<h1 style="font-family:cursive; font-size:18px;background-color: #70dbdb;color:black;text-align:center;padding: 8px">Loan Eligibility Prediction</h1>

 <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
        <p style="padding: 10px;
              color:white;">
            To detrermine whether or not to grant the loan based on likelihood of the loan being repaid. Implement a model that predicts if a loan should be granted to an indivudial based on data provided.
        </p>
    </div>

<img src="https://cdn.dribbble.com/users/6238/screenshots/1646680/money.gif" class="center"/>

### <a id='0'>Content</a>

- <a href='#1'>1. Summary Statistics</a>  
- <a href='#2'>2. Missing Value</a>  
- <a href='#3'>3. Skewness & Kurtosis</a>   
- <a href='#4'>4. Correlation</a>
- <a href='#5'>5. Outliers Detection & Removal</a>
- <a href='#6'>6. Normality Test</a>  
- <a href='#7'>7. Exploratory Data Analysis</a>
- <a href='#8'>8. Bi-Variate Exploratory Data Analysis</a>
- <a href='#9'>9. Missing Value Treatment</a>
- <a href='#10'>10. Feature Engineering</a>  
- <a href='#11'>11. Feature Scaling</a>
- <a href='#12'>12. Variance Inflation Factor</a>

<h1 style="font-family:cursive; font-size:18px;background-color: #33cccc;color:black;text-align:center;padding: 8px">Importing Libraries & Files</h1>

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import plotly.express as px
import plotly.graph_objs as go

In [1]:
import altair as alt
alt.data_transformers.disable_max_rows()

In [1]:
data_dictionary=pd.read_csv("../input/loan-eligibility-dataset/Data_Description.csv")
fig = go.Figure(data=[go.Table(columnwidth = [50,170,350],
    header=dict(values=list(data_dictionary.columns),
                fill_color='lightblue',
                line_color='black',
                align='center'),
    cells=dict(values=[data_dictionary.S_No,
                       data_dictionary['Attributes'],
                       data_dictionary['Description']],
                       fill_color='plum',
                       line_color='black',
                       align='left'))])
fig.show()

In [1]:
train=pd.read_csv("../input/loan-eligibility-dataset/LoansTraining.csv")

In [1]:
train.head()

<h1 style="font-family:cursive; font-size:18px;background-color: #70dbdb;color:black;text-align:center;padding: 8px">Uni-Variate Analysis</h1>

<h1 style="font-family:cursive; font-size:14px;color:#008080;text-align:left;padding: 4px"><a id='1'>1. Summary Statistics</a></h1>

In [1]:
sum_stat=train.describe()
sum_stat.T.style.bar(subset=['mean'],color='#ff944d')\
.background_gradient(subset=['std'],cmap='RdPu')\
.background_gradient(subset=['50%'],cmap='YlOrBr')

 <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
        <p style="padding: 10px;
              color:white;">
            Insights: <br>
            1. Spread of Data for Current Loan Amount, Current Credit Balance, Credit Score, Annual Income is very large. Which says majority of Datapoint falls away from mean.
        </p>
    </div>

<h1 style="font-family:cursive; font-size:14px;color:#008080;text-align:left;padding: 4px"><a id='2'>2. Missing Value</a></h1>

In [1]:
total_missing=train.isnull().sum().sort_values(ascending=False)
percent=(train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data=pd.concat([total_missing,percent],axis=1,keys=['Missing_Total','Percent']).head(10)
missing_data.style.bar(subset=['Percent'],color='orange')\
.background_gradient(subset=['Missing_Total'],cmap='Reds')

##### Checking For Categorical & Numeric Columns

In [1]:
numeric_data = train.select_dtypes(include=[np.number])
categorical_data = train.select_dtypes(exclude=[np.number])
print("Numeric_Column_Count =", numeric_data.shape)
print("Categorical_Column_Count =", categorical_data.shape)

<h1 style="font-family:cursive; font-size:14px;color:#008080;text-align:left;padding: 4px"><a id='3'>3. Skewness & Kurtosis</a></h1>

In [1]:
skewness=numeric_data.skew().to_frame(name='Skewness_Value')
kurtosis=numeric_data.kurt().to_frame(name='Kurtosis_Value')
measures=skewness.merge(kurtosis,left_index=True,right_index=True)
measures.style.background_gradient(subset=['Skewness_Value'],cmap='BuPu')\
                        .background_gradient(subset=['Kurtosis_Value'],cmap='cool')

<h1 style="font-family:cursive; font-size:14px;color:#008080;text-align:left;padding: 4px"><a id='4'>4. Correlation</a></h1>

In [1]:
def correlation_heatmap(df):
    _ , ax = plt.subplots(figsize =(14, 12))
    colormap = sns.diverging_palette(220, 10, as_cmap = True)
    
    _ = sns.heatmap(
        df.corr(), 
        cmap = colormap,
        square=True, 
        cbar_kws={'shrink':.9 }, 
        ax=ax,
        annot=True, 
        linewidths=0.1,vmax=1.0, linecolor='white',
        annot_kws={'fontsize':12 }
    )
    
    plt.title('Pearson Correlation of Features', y=1.05, size=15)

correlation_heatmap(numeric_data)

<h1 style="font-family:cursive; font-size:14px;color:#008080;text-align:left;padding: 4px"><a id='5'>5. Outliers Detection & Removal</a></h1>

 <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:80%;
           font-family:Verdana;
           letter-spacing:0.5px">
        <p style="padding: 10px;
              color:white;">
            Using Inter Quartile Range method to detect and remove outliers.
        </p>
    </div>

In [1]:
for k, v in numeric_data.items():
    q1 = v.quantile(0.25)
    q3 = v.quantile(0.75)
    irq = q3 - q1
    v_col = v[(v <= q1 - 1.5 * irq) | (v >= q3 + 1.5 * irq)]
    perc = np.shape(v_col)[0] * 100.0 / np.shape(numeric_data)[0]
    print("Column %s outliers = %.2f%%" % (k, perc))

In [1]:
Q1 = train.quantile(0.25)
Q3 = train.quantile(0.75)
IQR = Q3 - Q1
IQR

In [1]:
train = train[~((train < (Q1 - 1.5 * IQR)) |(train > (Q3 + 1.5 * IQR))).any(axis=1)]
train.shape

In [1]:
IQR=IQR.to_frame(name='IQR_Value')
IQR

<h1 style="font-family:cursive; font-size:14px;color:#008080;text-align:left;padding: 4px"><a id='6'>6. Normality Test</a></h1>

 <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:80%;
           font-family:Verdana;
           letter-spacing:0.5px">
        <p style="padding: 10px;
              color:white;">
            This function tests the null hypothesis that a sample comes from a normal distribution. It is based on D’Agostino and Pearson’s             test that combines skew and kurtosis to produce an omnibus test of normality.
        </p>
    </div>

In [1]:
import scipy
from scipy import stats
num_data = train.select_dtypes(include=[np.number])
X=pd.DataFrame(num_data)
Y=scipy.stats.normaltest(X,nan_policy='omit')
Z=pd.DataFrame(Y)
Z= Z.rename(columns={0:'Current Loan Amount',1:'Credit Score',2:'Annual Income',3:'Years of Credit History',
                     4:'Months since last delinquent',5:'Number of Open Accounts',6:'Number of Credit Problems',
                     7:'Current Credit Balance',8:'Bankruptcies',9:'Tax Liens'},index={0:'Z-Statistics',1:'P-Value'},inplace=False)
#Z=Z.drop([11],axis=1,inplace=False)
display(Z.T)

 <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:80%;
           font-family:Verdana;
           letter-spacing:0.5px">
        <p style="padding: 10px;
              color:white;">
            Z-statistic: s^2 + k^2, where s is the z-score returned by skewtest and k is the z-score returned by kurtosistest.<br>
            pvalue: A 2-sided chi squared probability for the hypothesis test.
        </p>
    </div>

<h1 style="font-family:cursive; font-size:14px;color:#008080;text-align:left;padding: 4px"><a id='7'>7. Exploratory Data Analysis</a></h1>

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">I. Loan Status</h1>
 <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:80%;
           font-family:Verdana;
           letter-spacing:0.5px">
        <p style="padding: 10px;
              color:white;">
            Showing the Counts of Target Variable. Seems for the number People who have applied for loan the Load Accepted is 4 time more than The Loan Rejected.
        </p>
    </div>

In [1]:
plt.figure(figsize=(7,5))
target_count = train["Loan Status"].value_counts()
sns.set(style="darkgrid")
sns.barplot(target_count.index, target_count.values, alpha=1,edgecolor='k',palette='autumn')
plt.title('Frequency Distribution showing if Loan Was Given or Not')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('target', fontsize=10)
plt.xticks((0,1),('Loan Given', 'Loan Refused'))
plt.show()

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">II. Loan Term</h1>

 <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:80%;
           font-family:Verdana;
           letter-spacing:0.5px">
        <p style="padding: 10px;
              color:white;">
            Reason for accepting and approving loan could be the Duration/ Period of loan. We can see that people with short term loan is more in numbers than with Long Term Loan. 
        </p>
    </div>

In [1]:
plt.figure(figsize=(8,6))
target_count = train["Term"].value_counts()
sns.set(style="darkgrid")
sns.barplot(target_count.index, target_count.values, alpha=1,edgecolor='k',palette='winter')
plt.title('Frequency Distribution showing if Loan given was Short Term or Long Term')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('target', fontsize=12)
plt.xticks((0,1),('Short Term', 'Long Term'))
plt.show()

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">III. Years of Experience</h1>

In [1]:
colors = ['gold', 'mediumturquoise', 'darkorange', 'lightgreen']
x=train['Years in current job'].value_counts()
x=pd.DataFrame(x).reset_index()
x=x.rename(columns={'index':'Years','Years in current job':'Count'})
import plotly.express as px
fig = go.Figure(data=[go.Pie(labels=x['Years'],values=x['Count'],hole=0.5)])
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=15,
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.update_layout(title_text='<b>Data Distribution based on Years of Experience <b>',title_x=0.5)
fig.show()

 <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:80%;
           font-family:Verdana;
           letter-spacing:0.5px">
        <p style="padding: 10px;
              color:white;">
            People with More number of Experience or people with experience with less than 3 years covers 55% of the data and these are the people who have applied for the loan.
        </p>
    </div>

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">IV. Checking Loan Amount </h1>

In [1]:
def loan_bucket(x):
    if 0 <= x <=23822.000000:
        return 'Low Amount'
    return 'High Amount'
train['Loan_Segment'] = train['Current Loan Amount'].apply(loan_bucket)

In [1]:
colors = ['gold', 'mediumturquoise']
x=train['Loan_Segment'].value_counts()
x=pd.DataFrame(x).reset_index()
x=x.rename(columns={'index':'Loan_Amt_Type','Loan_Segment':'Count'})
import plotly.express as px
fig = go.Figure(data=[go.Pie(labels=x['Loan_Amt_Type'],values=x['Count'],hole=0.5)])
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=15,
                 marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.update_layout(title_text='<b>Data Distribution based on Current Loan Amount<b>',title_x=0.5)
fig.show()

 <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:80%;
           font-family:Verdana;
           letter-spacing:0.5px">
        <p style="padding: 10px;
              color:white;">
            87.5% people who have applied for loan falls under low Amount Bucket & 12.5 % in High Bucket. i.e high Loan Amount.
        </p>
    </div>

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">V. Monthly Debt</h1>

In [1]:
train['Monthly Debt']=train['Monthly Debt'].str.replace('$','').astype(float)

In [1]:
fig, ax = plt.subplots(figsize=(12, 5))
sns.distplot(train['Monthly Debt'], hist=True, kde=False, bins=int(180/5), color = 'orange',hist_kws={'edgecolor':'black'})
# Add labels
plt.title('Histogram of Monthly Debt')
plt.xlabel('Monthly Debt')
plt.ylabel('Count')

 <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:80%;
           font-family:Verdana;
           letter-spacing:0.5px">
        <p style="padding: 10px;
              color:white;">
            More than 50% of people falls under Monthly Debt bucket of 0 to 2000. 
        </p>
    </div>

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">VI. Loan Amount</h1>

In [1]:
plt.figure(figsize=(16,6))
plt.style.use('fivethirtyeight')
plt.hist(train["Current Loan Amount"], edgecolor = "black", color = 'crimson',bins=20, label='Annual Income')
plt.xlabel ("Current Loan Amount")
plt.ylabel ("Count")
plt.title ("Distribution of Current Loan Amount after dropping Outliers")

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">VII. Annual Income</h1>

In [1]:
plt.figure(figsize=(16,6))
plt.style.use('fivethirtyeight')
plt.hist(train["Annual Income"], edgecolor = "black", color = 'pink',bins=20, label='Annual Income')
plt.xlabel ("Annual Income")
plt.ylabel ("Count")
plt.title ("Distribution of Annual Income after dropping Outliers")

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">VIII. Current Credit Balance</h1>

In [1]:
plt.figure(figsize=(16,6))
plt.style.use('fivethirtyeight')
plt.hist(train["Current Credit Balance"], edgecolor = "black", color = 'teal',bins=20, label='Annual Income')
plt.xlabel ("Current Credit Balance")
plt.ylabel ("Count")
plt.title ("Distribution of Current Credit Balance after dropping Outliers")

In [1]:
train['Maximum Open Credit']=train['Maximum Open Credit'].str.replace('#VALUE!','0').astype(float)

<h1 style="font-family:cursive; font-size:14px;color:#008080;text-align:left;padding: 4px"><a id='8'>8. Bi-Variate Exploratory Data Analysis</a></h1>

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">Loan Status vs Years in Current Job</h1>

In [1]:
pd.crosstab(train['Loan Status'],train['Years in current job'],margins=True).style.background_gradient(cmap='viridis')

In [1]:
c = pd.crosstab(train['Loan Status'],train['Years in current job']).apply(lambda x: x/x.sum(), axis=1)
c=c.T
c["Odds"]=c["Loan Given"]/c["Loan Refused"]
#c["odds"] = c.loc[:, 1] / c.loc[:, 0]
c.style.background_gradient(subset=['Odds'],cmap='copper')

 <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:80%;
           font-family:Verdana;
           letter-spacing:0.5px">
        <p style="padding: 10px;
              color:white;">
            1. Log odd for the loan being accepted for the people above 10 y of exp. in current company and people with experience of 1-2 years is high.<br>
            2. Wheras people with Experience between 3 to 10 years ion current company have least chance of loan getting accepted.
        </p>
    </div>

In [1]:
p = sns.pairplot(train, x_vars=['Current Loan Amount', 'Credit Score', 'Monthly Debt', 'Maximum Open Credit'], 
                 y_vars='Annual Income', size=5, aspect=0.5,kind='reg',plot_kws={'line_kws':{'color':'red'}})

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">Monthly Debt VS Years in Current Job</h1>

In [1]:
g = sns.factorplot(y="Monthly Debt",x="Years in current job",data=train,kind="box",aspect = 2)
g.set_xticklabels(labels=['< 1 year', '10+ years', '9 years', '3 years', '2 years',
       '7 years', '6 years', '1 year', '5 years', '4 years', '8 years'], rotation=30)

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">Annual Incme vs Year in Current Job</h1>

In [1]:
g = sns.factorplot(y="Annual Income",x="Years in current job",data=train,kind="box",aspect = 2)
g.set_xticklabels(labels=['< 1 year', '10+ years', '9 years', '3 years', '2 years',
       '7 years', '6 years', '1 year', '5 years', '4 years', '8 years'], rotation=30)

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">Loan Status​ vs Year in Current Job</h1>

In [1]:
train.groupby(['Years in current job','Loan Status']).size().unstack().plot(figsize=(15,8),kind='bar',stacked=True,
                                                                            color=['blue', 'cyan'],edgecolor='black', linestyle='--', linewidth=3,  alpha=0.7)
plt.ylabel('Count')
plt.show()

<h1 style="font-family:cursive; font-size:14px;color:#008080;text-align:left;padding: 4px"><a id='9'>9. Missing Value Treatment</a></h1>

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">I. Random Forest Imputation using Missingpy Library</h1>

 <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:80%;
           font-family:Verdana;
           letter-spacing:0.5px">
        <p style="padding: 10px;
              color:white;">
            MissForest is  machine learning-based data imputation algorithm that operates on the Random Forest algorithm.First, the missing values are filled in using median/mode imputation. Then, we mark the missing values as ‘Predict’ and the others as training rows, which are fed into a Random Forest model trained to predict
        </p>
    </div>
    
![1_m_z8E4HrFtCnHBoDANauTQ.png](attachment:1_m_z8E4HrFtCnHBoDANauTQ.png)

 <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:80%;
           font-family:Verdana;
           letter-spacing:0.5px">
        <p style="padding: 10px;
              color:white;">
            This process of looping through missing data points repeats several times, each iteration improving on better and better data. It’s like standing on a pile of rocks while continually adding more to raise yourself: the model uses its current position to elevate itself further.The model may decide in the following iterations to adjust predictions or to keep them the same.
        </p>
    </div>

![1_6Q3r3_tSadmQ0_JuVpDBUQ.png](attachment:1_6Q3r3_tSadmQ0_JuVpDBUQ.png)

In [1]:
pip install missingpy

In [1]:
import sys
import sklearn.neighbors._base
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base

In [1]:
from missingpy import MissForest

In [1]:
X=train[["Current Loan Amount","Annual Income","Monthly Debt","Maximum Open Credit","Credit Score",
         "Number of Open Accounts"]]

In [1]:
imputer = MissForest()
X_imputed = imputer.fit_transform(X)

In [1]:
rf_imputation=pd.DataFrame(X_imputed,columns=X.columns)

In [1]:
train=train.drop(["Current Loan Amount","Annual Income","Monthly Debt","Maximum Open Credit",'Credit Score',
         "Number of Open Accounts"],axis=1)

In [1]:
train=train.reset_index()

In [1]:
 train_new=train.merge(rf_imputation,left_index=True, right_index=True)

In [1]:
train_new.shape

In [1]:
train_new=train_new.reset_index()

In [1]:
train_new=train_new.drop(['index','level_0'],axis=1)
train_new.head()

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">II. Treating Missing by replacing NaN with 0</h1>

In [1]:
train_new["Months since last delinquent"]=train_new["Months since last delinquent"].fillna(0)

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">III. Treating Missing by replacing NaN with Mode</h1>

In [1]:
train_new["Bankruptcies"]=train_new["Years in current job"].fillna(train_new['Years in current job'].mode()[0])

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">IV. Treating Missing by replacing NaN with Forward Fill</h1>

In [1]:
train_new["Tax Liens"] = train_new['Tax Liens'].fillna(method='ffill', inplace=False)

In [1]:
train_new=train_new.drop(["Loan ID","Customer ID","Bankruptcies"],axis=1) #Dropping Unwanted Columns

<h1 style="font-family:cursive; font-size:14px;color:#008080;text-align:left;padding: 4px"><a id='10'>10. Feature Engineering</a></h1>

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">1. Credit Score</h1>

![FICO-Scores-0474cc0ca87b4b58b9391f065f623c0f-2.jpg](attachment:FICO-Scores-0474cc0ca87b4b58b9391f065f623c0f-2.jpg)

In [1]:
def credit_bucket(x):
    if  x < 580:
        return 'Poor'
    elif 580 <= x < 669:
        return 'Fair'
    elif 669 <= x < 739:
        return 'Good'
    elif  739 <= x < 799:
        return 'Very Good'
    else:
        return 'Exceptional'
train_new['Cred_Segment'] = train_new['Credit Score'].apply(credit_bucket)

In [1]:
colors = ['gold', 'mediumturquoise']
x=train_new['Cred_Segment'].value_counts()
x=pd.DataFrame(x).reset_index()
x=x.rename(columns={'index':'Rating','Cred_Segment':'Count'})
import plotly.express as px
fig = go.Figure(data=[go.Pie(labels=x['Rating'],values=x['Count'],hole=0.5)])
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=15,
                 marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.update_layout(title_text='<b>Data Distribution based on Rating<b>',title_x=0.5)
fig.show()

In [1]:
g = sns.kdeplot(train_new["Annual Income"][(train_new["Cred_Segment"] == "Good") & (train_new["Annual Income"].notnull())], color="Red", shade = True)
g = sns.kdeplot(train_new["Annual Income"][(train_new["Cred_Segment"] == "Exceptional") & (train_new["Annual Income"].notnull())], ax =g, color="Blue", shade= True)
g = sns.kdeplot(train_new["Annual Income"][(train_new["Cred_Segment"] == "Very Good") & (train_new["Annual Income"].notnull())], ax =g, color="Orange", shade= True)
g = sns.kdeplot(train_new["Annual Income"][(train_new["Cred_Segment"] == "Fair") & (train_new["Annual Income"].notnull())], ax =g, color="teal", shade= True)
g.set_xlabel("Annual Income")
g.set_ylabel("Frequency")
g = g.legend(["Good","Exceptional","Very Good","Fair"])

In [1]:
# alt.Chart(train_new).mark_bar().encode(
#     x='Cred_Segment',
#     y='count(Annual Income)',
#     color='Loan Status'
# ).properties(
#     width=500,
#     height=200,title="Loan Status as Per FICO Segment"
# )

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">2. Debt to Income Ratio</h1>

In [1]:
train_new['DTI']=(train_new['Monthly Debt']/(train_new['Annual Income']/12))*100

In [1]:
train_new['DTI'].isna().sum()

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">3. Debt to Limit ratio</h1>

In [1]:
train_new['DTL']=(train_new['Current Credit Balance']/(train_new['Maximum Open Credit']))*100

In [1]:
train_new['DTL'].isna().sum()

In [1]:
train_new['DTL']=train_new['DTL'].fillna(0)

<h1 style="font-family:cursive; font-size:10px;color:#008080;text-align:left;padding: 4px">4. Converting Categorical to Numeric Features</h1>

In [1]:
columns=train_new[['Purpose','Loan_Segment','Cred_Segment','Term','Years in current job']]
categorical_data=pd.get_dummies(columns,drop_first=True).astype(int)

In [1]:
train_new=train_new.drop(['Purpose','Loan_Segment','Cred_Segment','Term','Years in current job'],axis=1)

In [1]:
def func(x):
    if  x == 'Rent' :
        return 1
    elif  x == 'Own Home':
        return 2
    return 0
train_new['Home Ownership'] = train_new['Home Ownership'].apply(func)

In [1]:
def func(x):
    if  x == 'Loan Given' :
        return 1
    return 0
train_new['Loan Status'] = train_new['Loan Status'].apply(func)

In [1]:
display(np.any(np.isnan(X)))
display(np.all(np.isfinite(X)))

In [1]:
train_new=train_new.replace([np.inf, -np.inf], 0, inplace=False)

In [1]:
np.where(train_new.values >= np.finfo(np.float64).max)

<h1 style="font-family:cursive; font-size:14px;color:#008080;text-align:left;padding: 4px"><a id='11'>11. Feature Scaling</a></h1>

In [1]:
X=train_new.drop(['Loan Status','Home Ownership'],axis=1)
Y=train_new['Loan Status']

In [1]:
from sklearn.preprocessing import MinMaxScaler
scaled_features = MinMaxScaler().fit_transform(X.values)

In [1]:
scaled_features_df = pd.DataFrame(scaled_features,index=X.index, columns=X.columns)

In [1]:
train_new=train_new.drop(['Years of Credit History', 'Months since last delinquent',
       'Number of Credit Problems', 'Current Credit Balance', 'Tax Liens',
       'Current Loan Amount', 'Annual Income', 'Monthly Debt',
       'Maximum Open Credit', 'Credit Score', 'Number of Open Accounts', 'DTI',
       'DTL'],axis=1)

In [1]:
train_new.shape

In [1]:
train_new=pd.concat([train_new,scaled_features_df], axis=1).reindex(train_new.index)

In [1]:
train_new=pd.concat([train_new,categorical_data], axis=1).reindex(train_new.index)

In [1]:
#train_new=train_new.drop(train_new.columns[[0]],axis = 1)

<h1 style="font-family:cursive; font-size:14px;color:#008080;text-align:left;padding: 4px"><a id='12'>12. Variance Inflation Factor</a></h1>

In [1]:
x=train_new.drop(['Loan Status'],axis=1)
y=train_new['Loan Status']

In [1]:
from statsmodels.tools.tools import add_constant
from statsmodels.stats.outliers_influence import variance_inflation_factor
X_vif = add_constant(x)
vif = pd.Series([variance_inflation_factor(X_vif.values, i) 
               for i in range(X_vif.shape[1])], 
              index=X_vif.columns)

In [1]:
display(vif.sort_values(ascending = False).round(2).head(10).to_frame(name='VIF Score'))

In [1]:
train_new=train_new.drop(['Cred_Segment_Very Good','Monthly Debt','Purpose_Debt Consolidation'],axis=1)

 <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:80%;
           font-family:Verdana;
           letter-spacing:0.5px">
        <p style="padding: 10px;
              color:white;">
            Variance inflation factor (VIF) is a measure of the amount of multicollinearity in a set of multiple regression variables. Mathematically, the VIF for a regression model variable is equal to the ratio of the overall model variance to the variance of a model that includes only that single independent variable. This ratio is calculated for each independent variable. A high VIF indicates that the associated independent variable is highly collinear with the other variables in the model.<br>
            1. Multicollinearity does not reduce the explanatory power of the model, it does reduce the statistical significance of the independent variables.<br>
            2. A large variance inflation factor (VIF) on an independent variable indicates a highly collinear relationship to the other variables that should be considered or adjusted for in the structure of the model and selection of independent variables.
        </p>
    </div>

In [1]:
train_new['Loan Status'].value_counts()

To be Contd...