<a href="datacamp.com/workspacecompetition" target="_blank">![banner](banner.png)</a>

# Loan Data

Ready to put your coding skills to the test? Join us for our Workspace Competition.  
For more information, visit [datacamp.com/workspacecompetition](https://datacamp.com/workspacecompetition) 

## Context
This dataset ([source](https://www.kaggle.com/itssuru/loan-data)) consists of data from almost 10,000 borrowers that took loans - with some paid back and others still in progress. It was extracted from lendingclub.com which is an organization that connects borrowers with investors. We've included a few suggested questions at the end of this template to help you get started.

In [None]:
# Load packages
import numpy as np 
import pandas as pd 

#import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

#to create Pmf and Cdf
!pip install empiricaldist 
from empiricaldist import Pmf
from empiricaldist import Cdf

#To use linear regression between variables
from scipy.stats import linregress

## Load your data

In [None]:
#For Kaggle usage only
import os
#Print out the file paths
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#Loading file from Kaggle
df = pd.read_csv('/kaggle/input/loan-data-analysis-datacamp-workspace/loan_data.csv', index_col=None)

# Load data from the csv file
# df = pd.read_csv('loan_data.csv', index_col=None)

# Change the dots in the column names to underscores
df.columns = [c.replace(".", "_") for c in df.columns]
print(f"Number of rows/records: {df.shape[0]}")
print(f"Number of columns/variables: {df.shape[1]}")
df.head()

## Understand your variables

In [None]:
# Understand your variables
variables = pd.DataFrame(columns=['Variable','Number of unique values','Values'])

for i, var in enumerate(df.columns):
    variables.loc[i] = [var, df[var].nunique(), df[var].unique().tolist()]
    
#Loading file from Kaggle
var_dict = pd.read_csv('/kaggle/input/loan-data-analysis-datacamp-workspace/variable_explanation.csv', index_col=0) 

# Join with the variables dataframe
# var_dict = pd.read_csv('variable_explanation.csv', index_col=0)
variables.set_index('Variable').join(var_dict)

Now you can start to explore this dataset with the chance to win incredible prices! Can't think of where to start? Try your hand at these suggestions:

- Extract useful insights and visualize them in the most interesting way possible.
- Find out how long it takes for users to pay back their loan.
- Build a model that can predict the probability a user will be able to pay back their loan within a certain period.
- Find out what kind of people take a loan for what purposes.

In [None]:
#See the full descriptions of each of the variable
#Iterator for var_dict's values to be used in printing
var_description = (item for item in var_dict.values)
for i in range(len(variables['Variable'])):
    print(variables['Variable'][i],':', next(var_description))

### Judging Criteria
| CATEGORY | WEIGHTAGE | DETAILS                                                              |
|:---------|:----------|:---------------------------------------------------------------------|
| **Analysis** | 30%       | <ul><li>Documentation on the goal and what was included in the analysis</li><li>How the question was approached</li><li>Visualisation tools and techniques utilized</li></ul>       |
| **Results**  | 30%       | <ul><li>How the results derived related to the problem chosen</li><li>The ability to trigger potential further analysis</li></ul> |
| **Creativity** | 40% | <ul><li>How "out of the box" the analysis conducted is</li><li>Whether the publication is properly motivated and adds value</li></ul> |

# **Exploratory Data Analysis**

We'll begin with some basic descriptive statistics below to understand more about each variable

In [None]:
#See the important stats of each variable in the dataset
df.describe()

In [None]:
#Check the variable types and whether there're any missing values
df.info()

As we can see, there's no missing data in the dataset above, which will make the job of filling missing
values easier. 

In [None]:
sns.set()  #Set to the default style of seaborn
#See the distribution of people who qualify for credit underwriting criteria
fig , ax = plt.subplots()
plt.hist(x = df['credit_policy'], bins = 2)
plt.title("Distribution of credit underwriting criteria")
plt.ylabel('Number of people')
ax.set_xticks((0.25, 0.75))
ax.set_xticklabels(['Do not meet criteria','Meet criteria'])
plt.show()

We can see that **there're almost 4 times the number of people** who meet underwriting criteria compared to those who don't.

In [None]:
fig,ax = plt.subplots()
plt.bar(x = df['purpose'].unique(), height = df['purpose'].value_counts()) #default will sort from highest
plt.title("Purpose of the loan by frequency")
plt.ylabel('Frequency')
ax.set_ylim(0, 4500)
# Add the values on top of the x axis
values = df['purpose'].value_counts()
for hor, ver in enumerate(values):
    #hor is horizontal coordinate, ver is vertical coordinate
    ax.text(hor - 0.25, ver+ver*0.05, s = str(ver), color = 'red')
plt.xticks(x = df['purpose'].unique(), rotation=45) #Rotate X ticks by 45 degrees
plt.yticks()
plt.show()

In [None]:
#We'll visualize this with seaborn to see which category has the most number of people who have not paid back their loan
plt.figure(figsize = (10, 7))
sns.countplot(x = 'purpose', hue ='not_fully_paid', data=df, palette = 'Set2')
plt.show()

Debt consolidation category has the highest number of unpaid loans, at over 500 people. 

Next we'll see the percentage of people who are most likely to not pay back based on their FICO score. We'll use the image below as guideline to divide into different categories.
1. 300 - 560: very bad
2. 560 - 650: bad
3. 650 - 700: fair
4. 700 - 750: good
5. 750 - 850: excellent 
![](https://d187qskirji7ti.cloudfront.net/news/wp-content/uploads/2014/04/Credit-Score-Factors.jpg)

In [None]:
plt.suptitle('Distribution of people by fico score')
pd.cut(df['fico'], bins = [300,560,650,700,750,850],
        labels =['Very bad','Bad','Fair','Good','Excellent']).hist()
plt.show()
df_fico = pd.cut(df['fico'], bins = [300,560,650,700,750,850],
        labels =['Very bad','Bad','Fair','Good','Excellent'])
print(df_fico.value_counts())

We can see that there're no people with "very bad" credit scores. There're over 5200 people, or more than 50% of dataset with scores of good or excellent. 

In [None]:
#Alternative way to visualize this with credit policy variable
plt.figure(figsize = (10,6))
plt.suptitle('Distribution of people by fico score and their credit')
df[df['credit_policy'] == 1]['fico'].hist(bins = 15,alpha = 0.4, color = 'green',
                                          label ='Meet credit policy')
df[df['credit_policy'] == 0]['fico'].hist(bins = 15,alpha = 0.3, color = 'red',
                                          label ='Do not meet credit policy')
plt.legend()
plt.show()
print("Number of people with good (700) or above credit score but do not meet credit policy is: " + 
      str(df[(df['credit_policy']==0) & (df['fico']>=700)]['credit_policy'].count()))

According to graph above, it's surprising to see that there's quite some number of people who do not meet underwriting credit criteria, but have "good" or even "excellent" scores. Next we'll look at the relationship between these variables: interest rate, credit policy, not_fully_paid, and fico score.

In [None]:
sns.lmplot(x = 'int_rate', y = 'fico', hue = 'credit_policy', col ='not_fully_paid', data=df, palette = 'RdBu')
plt.show()

People who meet credit policy tend to have lower interest rate and have higher fico score. In both plots, int_rate and fico score have negative linear relationship as indicated by the lines. 

In [None]:
df_nfp_0 = df[df['not_fully_paid'] == 0]
df_nfp_1 = df[df['not_fully_paid'] == 1]
pub_rec_fico0 = linregress(df_nfp_0['pub_rec'], df_nfp_0['fico'])
pub_rec_fico1 = linregress(df_nfp_1['pub_rec'], df_nfp_1['fico'])
print(pub_rec_fico0)
print(pub_rec_fico1)
fx_pub_rec_0 = df_nfp_0['pub_rec']
fx_pub_rec_1 = df_nfp_1['pub_rec']
fy_fico_0 = pub_rec_fico0.intercept + fx_pub_rec_0 * pub_rec_fico0.slope
fy_fico_1 = pub_rec_fico1.intercept + fx_pub_rec_1 * pub_rec_fico1.slope
sns.catplot(x = 'pub_rec', y = 'fico', hue = 'not_fully_paid', data=df)
plt.plot(fx_pub_rec_0, fy_fico_0, '-')
plt.plot(fx_pub_rec_1, fy_fico_1, '-')
plt.show()

There's a strong correlation between having 0 public record and having fully paid. Also, if people have a public record or more, then it's very likely that their FICO score will be below 750. We also see that for every public record the person has, he/she will get -20 or -18 points deducted from their FICO score, depending on whether they have paid their loan in full. People who have not paid yet only have bad public records up to 2 times. 

In [None]:
df.groupby(['pub_rec'])['not_fully_paid'].describe()

In [None]:
# Create correlation matrix and graph it
corr = df.corr()
#To cover half away the correlation matrix
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style('white'):
    fig, ax = plt.subplots(figsize = (10,8))
    ax = sns.heatmap(corr, mask = mask, vmax = 1, square = True)
plt.show()

From the correlation matrix, we can infer a few things: 
- There's a strong negative correlation between fico score and interest rate, that is they move in opposite direction. This makes sense because if someone has poor credit score, their interest will be higher and vice versa. We'll graph this below.
- There's a strong negative correlation between number of inquiry in the past 6 months and credit policy. As people make more credit inquiries, they'll be less likely to meet underwriting credit criteria. 
- Strong negative correlation between revolving utility (the amount of credit used over the total credit line available) and fico. For people who overspends their credit limit, this will have negative impact on their fico score. 
- Moderate positive correlation between natural log of annual income & installment. As people earn more money annually, they will also have a bigger monthly payment (installment) since they have more disposable income to spend. 
- Interest rate & revolving utility has moderate positive correlation. As people utilizes more of their available credit line, the interest rate for them will go up.
- Dti (debt-to-income ratio) and revolving utility has mild positive correlation. As people own more debt, it also means that they're utilizing more of their own available credit line. 
- Log annual income has mild positive correlation with days with credit line (how long they have had a credit line). 

In [None]:
sns.jointplot(x ='fico', y = 'int_rate', data =df)
plt.show()

From the graph on top, we can see that interest rate will drop as fico score gets better.

In [None]:
#Select continous variables (float)
cont_var = [c for c in df.columns if df[c].dtype == 'float']

#Graph continuous variables to see the distribution
for i in cont_var:
    sns.boxplot(y = i, palette = 'rainbow', data = df)
    plt.show()

Next we'll answer the question of when the loan will be paid off assuming that a monthly installment is being made. We'll create a new column for this. Then we'll use pivot table to see what types of loans will take the longest to pay off. 

In [None]:
df['time_to_paid_off_mths'] = df['revol_bal']/df['installment']
pd.pivot_table(index = 'purpose', values = 'time_to_paid_off_mths', aggfunc = 'mean', margins = True, 
               margins_name = 'Average across all purposes', data = df)

Home improvement seems to take the longest to pay off, at 95 months on average. The category "major_purchase" has the shortest paid off time at 42 months.

In [None]:
pd.pivot_table(index = 'purpose', values = ['log_annual_inc', 'dti', 'delinq_2yrs'], 
               aggfunc = {'log_annual_inc':np.mean,'dti':np.mean,
                           'delinq_2yrs': 'count'             }, data = df)

There's not much income variation across different purposes, but we can see that groups with the most number of delinquencies in the past 2 years are "debt consolidation" and "all other". Debt-to-income ratio is also high in the debt consolidation and credit card groups.

Since income is a log variable, we'll use a special KDE plot to see its distribution.

In [None]:
sns.kdeplot(df['log_annual_inc'], hue = df['not_fully_paid'])
plt.show()

It has a perfect normal distribution with mean at around 11. There're more people who have fully paid than people who have not fully paid. 

We'll turn some variables into PMF functions to be able to more easily visualize them. Let's begin with delinq_2yrs

In [None]:
pmf_delinq = Pmf.from_seq(df['delinq_2yrs'], normalize = True)
pmf_delinq.bar()
plt.xlabel('Number of delinquencies in the past 2 years')
plt.ylabel('Probability')
plt.show()

Over 85% of people do not have any deliquencies in the past 2 years. There's an insignificant % of people with deliquencies greater than 3. 

In [None]:
pmf_pubrec = Pmf.from_seq(df['pub_rec'], normalize = True)
pmf_pubrec.bar()
plt.xlabel('Number of derogatory public records')
plt.ylabel('Probability')
plt.show()

Over 90% have 0 derogatory public records and a few % who have 1 derogatory public record. 

In [None]:
dti_0 = df[df['not_fully_paid'] == 0]['dti']
dti_1 = df[df['not_fully_paid'] == 1]['dti']  
cdf_dti_0 = Cdf.from_seq(dti_0, normalize = True)
cdf_dti_1 = Cdf.from_seq(dti_1, normalize = True)
cdf_dti_0.plot(label = "Fully paid")
cdf_dti_1.plot(label = "Not fully paid")
plt.legend()
plt.xlabel('Debt-to-income ratio')
plt.ylabel('CDF')
plt.show()

Overall, about 82% of the people who have fully paid have debt to income ratio smaller than 20 vs. 80% of people who have not fully paid. 

# **Preprocessing**

Next we'll go into the preprocessing aspect for the variables. First we'll import the necessary libraries. 

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import optuna
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix

In [None]:
#Use dummy encoding for purpose
df_dummy = pd.get_dummies(df['purpose'], prefix = 'purpose', drop_first = True) #to avoid multicollinearity
#Merging dataframes to get the dummy variables
df = df.merge(df_dummy, left_index = True, right_index = True)

In [None]:
#Separate 'not_fully_paid' into the target for prediction
y = df['not_fully_paid']
#Drop the target and old categorical 'purpose' from dataframe
X = df
X.drop('purpose', axis='columns',inplace = True)
X.drop('not_fully_paid', axis = 'columns', inplace = True)

In [None]:
#Cont var will be standardized
cont_var = [c for c in df.columns if df[c].dtype == 'float']
cont_var.remove('log_annual_inc')
cont_var_exc_inc = cont_var

int_var = [c for c in df.columns if df[c].dtype == 'int']
int_var.remove('credit_policy')
int_var_exc_credit_policy = int_var

#Standardize the integer variables to improve accuracy except for log_annual_inc
standardizeX = StandardScaler()

# Initialize column transformer
columnTrans = ColumnTransformer(transformers = [
    ('cont',standardizeX,cont_var_exc_inc),
    ],remainder='passthrough')

# **Hyperparameter tuning**

This section below was only run a few times in order to get the optimized parameters for the logistic regression model.

In [None]:
#With Optuna for tuning
accuracy = []
def run(trial):
#     penalty = trial.suggest_categorical('penalty',['l1', 'l2', 'elasticnet', 'none'])
    tol = trial.suggest_float('tol', 0.0000001, 0.0001, log = True)
    C = trial.suggest_float('C', 1.0, 1000)
    max_iter = trial.suggest_int('max_iter', 10, 100000)
    solver = trial.suggest_categorical('solver', ['saga'])
#     l1_ratio = trial.suggest_float('l1_ratio', 0, 1)

    #Standardize the continuous variables
    # Using regular train_test split to work with Optuna
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
    X_train = columnTrans.fit_transform(X_train)
    X_test = columnTrans.transform(X_test)
    model1 = LogisticRegression(penalty='l2', tol=0.0001, C=1.0, solver='lbfgs', 
                                max_iter=100, multi_class='auto', verbose=0, 
                            warm_start=False, n_jobs=-1, l1_ratio=None, random_state = 0)
    #Fiting the model
    model1.fit(X_train, y_train)

    #Make prediction
    preds = model1.predict(X_test)

    #Scoring the model
    score = model1.score(X_test, y_test)
    accuracy.append(score)
    print(f'Accuracy score is: {score}')
    return score

In [None]:
#Suppress optuna output
optuna.logging.set_verbosity(optuna.logging.WARNING)
optuna.samplers.RandomSampler(seed = 0) #Use random sampling
study = optuna.create_study(direction = 'maximize')
study.optimize(run, n_trials = 200)

In [None]:
#Get the best parameters for the model
study.best_params

# **Model Building**

First we'll use logistic regression to predict whether a person has fully paid yet or not. 

In [None]:
#Plugged in with tuned parameters from Optuna 200 trials
best_params = {'tol': 7.651638057784127e-06,
             'C': 952.0477857994174,
             'max_iter': 95461,
             'solver': 'saga'}
model1 = LogisticRegression(random_state = 0,**best_params)

#to hold results
accuracy = []
coef = np.zeros((1,19))
intercept = np.zeros((1,1))

#Establish stratified Kfold with 10 splits
skf = StratifiedKFold(n_splits = 10, shuffle = True)

for fold, (train_index,test_index) in enumerate(skf.split(X, y)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    #Standardize the continuous variables
    X_train = columnTrans.fit_transform(X_train)
    X_test = columnTrans.transform(X_test)

    #Fiting the model
    model1.fit(X_train, y_train)

    #Make prediction
    preds = model1.predict(X_test)

    #Scoring the model
    score = model1.score(X_test, y_test)
    accuracy.append(score)
    coef += model1.coef_
    intercept += model1.intercept_
    print(f'Accuracy score for {fold}th is: {score}')

print(f'The average score across all 10 folds is {np.mean(accuracy)}')

We can see that this logistic regression model has an **average accuracy of roughly 84%** across 10 folds in predicting who will be most likely to pay back the loan.

Then we'll analyze the relationship between X and y of each independent variables have on not_fully_paid

In [None]:
#Divide by the number of folds to get the average coefficients and intercept
coef = coef/10
intercept = intercept/10
coef = coef[0]
intercept = intercept[0]

In [None]:
import pprint #to print dictionary output nicely
pos_coef_list = []
neg_coef_list = []
pos_coef = {} #for positive coefficients var
neg_coef = {} #for negative coefficients var
for i,c in enumerate(X.columns): 
    if coef[i] > 0:
        pos_coef[c] = coef[i]
        pos_coef_list.append(coef[i])
    else:
        neg_coef[c] = coef[i]
        neg_coef_list.append(coef[i])
print('The default person with 0 across all variables has a probability of having not fully paid equals to:', intercept)
print('The bigger these variables below get, it means the more likely that the person has not fully paid')
pprint.pprint(pos_coef)
print("The bigger these variables below get, it means the more likely that the person has fully paid")
pprint.pprint(neg_coef)