# Table of Contents
1. [Introduction](#1.-Introduction)

    1.1. [Objectives](#1.1.-Objectives)
    
    1.2. [Features](#1.2.-Features)
    
2. [Packages, data loading and cleaning](#2.-Packages,-data-loading-and-cleaning)

    2.1. [Packages](#2.1.-Packages)
    
    2.2. [Data loading](#2.2.-Data-loading)
    
    2.3. [Data cleaning](#2.3.-Data-cleaning)
 
3. [Descriptive analysis](#3.-Descriptive-analysis)

    3.1. [Categories](#3.1.-Categories)
    
    3.2. [Numbers](#3.2.-Numbers)
    
    3.3. [Booleans](#3.3.-Booleans)
    
    3.4. [Descriptive analysis conclusions and considerations.](#3.4.-Descriptive-analysis-conclusions-and-considerations.)

4. [Data Analysis and EDA](#4.-Data-Analysis-and-EDA)
    
    4.1. [Overlaped histograms](#4.1.-Overlaped-histograms)
    
    4.2. [Barplots](#4.2.-Barplots)
    
5. [Model-based feature selection.](#5.-Model-based-feature-selection.)

6. [Decision Tree](#6.-Decision-Tree)

    6.1. [Resampling](#6.1.-Resampling)
    
    6.2. [Tree model](#6.2.-Tree-model)
    
    

# 1. Introduction

publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back.

We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full.

A model that predicts whether a consumer will pay or not could help companies distribute its their resources better. If the companies have a borrower profile, they can, for example, allocate their best debt collectors to each case. They can also come up with paying strategies that could help the borrower pay the debt.

In this kernel, we are going to determine relevant factors that determine whether a borrower will pay a debt or not, then create a borrower profile and finally conlcude and make recommendations to better the business.

## 1.1. Objectives
- Find relevant factors that influence whether a borrower will pay or not.
- Build a model that predicts wheter the borrower will pay or not.
- Create a consumer profile of borrowers that pay or not.
- Make conclusions and reccomendations to improve bussiness operations.

## 1.2. Features

- credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
- purpose: The purpose of the loan (takes values "creditcard", "debtconsolidation", "educational", "majorpurchase", "smallbusiness", and "all_other").
- int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
- installment: The monthly installments owed by the borrower if the loan is funded.
- log.annual.inc: The natural log of the self-reported annual income of the borrower.
- dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
- fico: The FICO credit score of the borrower.
- days.with.cr.line: The number of days the borrower has had a credit line.
- revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
- revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
- inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
- delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
- pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

# 2. Packages, data loading and cleaning

## 2.1. Packages

In [None]:
# Data wrangling
import pandas as pd
import numpy as np

# Data viz
import seaborn as sns
import matplotlib.pyplot as plt
import graphviz
sns.set_style('whitegrid')

# preprocessing
from sklearn.feature_selection import RFE

# Ml model
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import plot_confusion_matrix

# over-sampling
from imblearn.over_sampling import RandomOverSampler

# Scaling
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler, Normalizer

np.warnings.filterwarnings('ignore')

## 2.2. Data loading

In [None]:
loan = pd.read_csv('../input/loan-data/loan_data.csv')
display(loan)
display(loan.describe())

## 2.3. Data cleaning
In this step we are going to eliminate duplicates and examine Nulls values and see if they can be discarted as well. It might be possible that the cleaning process extend itself until the descriptive analysis, where is possible that suspicious data comes up.

We are also going to take a look at the data types and modify them if they are inconsistent.

### Data types

In [None]:
display(loan.info())

We could leave the datatypes like that since a boolean could also be interpreted as a boolean. But, tu ease the algorithm in the descritive section we are going to convert "credit policy" and "not.fully.paid" to booleans. The remaining features are all in order.

In [None]:
loan["credit.policy"] = loan["credit.policy"].astype("bool")
loan["not.fully.paid"] = loan["not.fully.paid"].astype("bool")
print(loan.info())

### Duplicates

In [None]:
loan.drop_duplicates(inplace=True)

### Null's
Before dropping the null's let's frst check how they are distributed with a heatmap

In [None]:
sns.heatmap(loan.isnull())

The heatmap doesn't show any null value. We can then continue with the descriptive analysis.

# 3. Descriptive analysis

We know from section 2.3. that most of the features are numeric, having only one categorical and two booleans, one of which being our target variable which is our target variable. This makes our descriptive analysis quite easy since we only have to graph histograms corresponding to the numeric values, a bar plot for the categorical feature an another barplot for the boolean.

## 3.1. Categories

In [None]:
fig, ax = plt.subplots(figsize=(8,5))

cat = loan.select_dtypes('object').columns

order = list(loan[cat[0]].value_counts().keys())
sns.countplot(cat[0], data=loan, palette="vlag", order=order)
ax.tick_params(labelrotation=90)
ax.set_title(cat[0])

plt.show()

table = pd.DataFrame(loan[cat[0]].value_counts())
table.rename(columns={'purpose':'count'}, inplace=True)
table['%'] = np.round((table['count']/table['count'].sum()) * 100, 2)
table

From the descriptive analysis we can see that most of the borrowers are for debt consolidation (41.31%). The least category is educational. It is too soon to start concluding, but my intuition tells me that the categories with the most borrowers are also the ones with a bigger proportion of paid debt, as educational sits at the bottom having students with the least amount of money. We will wait though until the analysis part where we are going to analyse further.

## 3.2. Numbers

In [None]:
numbers = loan.select_dtypes(['int64', 'float64']).columns
loan[numbers].hist(figsize=(20,10), edgecolor='white', color='#00afb9')
plt.show()

loan[numbers].describe()

We can see that most of the histograms follow a normal distribution, except for revol.bal, which, from the descriptive stats, know that have some very extreme values, this could be fixed by removing the outiler or applyieng a log formula such as:

In [None]:
TotalLog = np.log(loan['revol.bal'] + 1)
TotalLog.hist(color='#00afb9')

plt.show()

With the log formula, the graph looks more 'normal'. We are going to take into account both method when building the ML model.

Regarding 'delinq.2yrs' and 'pub.rec', these are distributions that have their values distributed through 4-5 values. They behave like a cateforical value but they are not. A count plot could give us a better representation of these distributions

In [None]:
fig, ax = plt.subplots(1,2, figsize=(15,5))

inte = ['delinq.2yrs', 'pub.rec']

sns.countplot(inte[0], data=loan, ax=ax[0], palette="vlag")
sns.countplot(inte[1], data=loan, ax=ax[1], palette="vlag")

plt.show()

print(loan[inte[0]].value_counts())
print('\n', loan[inte[1]].value_counts())

We can see that both distributions have some extreme values, after the data analysis, when building the ML model, we are to consider getting rid of them

## 3.3. Booleans

In [None]:
fig, ax = plt.subplots(1,2, figsize=(15,5))

boole = list(loan.select_dtypes(['bool']).columns)

sns.countplot(boole[0], data=loan, ax=ax[0], palette="vlag")
ax[0].tick_params(labelrotation=90)
ax[0].set_title(boole[0])

sns.countplot(boole[1], data=loan, ax=ax[1], palette="vlag")
ax[1].tick_params(labelrotation=90)
ax[1].set_title(boole[1])
               
plt.show()

print(loan[boole[0]].value_counts())
print('\n', loan[boole[1]].value_counts())

we have a case of inbalanced data that could be a problem for the ML model since most of are false. When building the model, we are going to apply sampling technique to deal with the imbalanced data.

## 3.4. Descriptive analysis conclusions and considerations.

- 'revol.bal' is rich in outliers. for the ML model, we might take the outliers out or apply a log formula.
- 'delinq.2yrs' and 'pub.rec' are also rich in outliers. Since they are integers, i don't think a log formula would be wise, It's better to take out the outliers for the ML model.
- 'not.fully.paid', the target variable'is highly imbalanced. After the data analysis and before the ML model, I'm going to use a sampling technique to deal with this case.

# 4. Data Analysis and EDA

Since our target variable is a boolean, we are going to plot the 6 scatter plots with the highest correlations and hue them with the target variable, this will be more as a visual analysis. If we don't find any pattern, I'm going to apply a model-based feature selection to reduce the number of features and analyze the most relevants for the model.

'Purpose' is a categorical variable, so a barplot will do.

'inq.last.6mths', 'delinq.2yrs' and 'pub.rec'describbe the number of times something happened, a barplot will also work on these cases.

## 4.1. Overlaped histograms

In [None]:
numbers = loan.select_dtypes(['int64', 'float64']).columns
numbers = numbers[:-3]

sns.histplot(data=loan, x=numbers[0], hue='not.fully.paid')

In [None]:
fig, ax = plt.subplots(2,4, figsize=(22,10))
ax=ax.ravel()

count=0
for i in numbers:
    sns.histplot(data=loan, x=i, hue='not.fully.paid', ax=ax[count])
    count+=1

Unfortunately, both distributions have the same shape but in a different proportion, looking for a pattern here will be difficult. Let's group our target variable by these variables and see how the mean and standard deviation differ.

In [None]:
loan.groupby('not.fully.paid')[numbers].agg(['mean', 'std'])

As expected, they do not differ a lot. The only concluson we can get here is that our target variable is almost equally distributed in each feature. Therefore, there's no relevant pattern for our model.

## 4.2. Barplots

In [None]:
# Features to graph
numbers = loan.select_dtypes(['int64', 'float64']).columns
numbers = list(numbers[-3:])
numbers.append("credit.policy")
print(numbers)

# Viz
fig, ax = plt.subplots(2,2, figsize=(20,10))

ax=ax.ravel()

count=0
for i in numbers:
    sns.countplot(x=i, data=loan, hue='not.fully.paid', ax=ax[count])
    count+=1

plt.show()

No relevant insights as well. Let's try and apply a model-based feature selection and let the algorithm show us what is not visible to the eye.

# 5. Model-based feature selection.

Model-based feature selection uses a supervised machine learning model to judge the importance of each feature, and keeps only the most important ones. For this case, we are going to use a random forest classifier, since it usually yields good results without having to normalize the features. Let's take a look.

In [None]:
# Ml values
numbers = loan.select_dtypes(['int64', 'float64', 'bool']).columns

X = loan[numbers].iloc[:,:-1].values
y = loan.iloc[:,-1].values.reshape(-1,1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
for i in range(1,13):

    select = RFE(RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=i)

    select.fit(X_train, y_train)

    mask = select.get_support()

    X_train_rfe = select.transform(X_train)
    X_test_rfe = select.transform(X_test)

    score = RandomForestClassifier().fit(X_train_rfe, y_train).score(X_test_rfe, y_test)

    print("Test score: {:.3f}".format(score), " number of features: {}".format(i))

There's not much difference between test scores. I'm going to choose 5 features and see what the algorithm chooses.

In [None]:
select = RFE(RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=5)

select.fit(X_train, y_train)

mask = select.get_support()

X_train_rfe = select.transform(X_train)
X_test_rfe = select.transform(X_test)

score = RandomForestClassifier().fit(X_train_rfe, y_train).score(X_test_rfe, y_test)

print("Test score: {:.3f}".format(score), " number of features: {}".format(5))

features = pd.DataFrame({'features':list(loan[numbers].iloc[:,:-1].columns), 'select':list(mask)})
display(features.T)
features = list(features[features['select']==True]['features'])
print("The selected features are: " "\n")
display(features)


We have our working features, let's add the target variable 'not.fully.paid' and proceed aplying a decision tree. Then, I'm going to keep mining and graph the model to see if i can finally find the pattern.

In [None]:
features.append('not.fully.paid')

print("Working dataset", "\n")
loan[features]

# 6. Decision Tree

## 6.1. Resampling
We know from section 3.4 that our data is highly imbalanced, let's apply a resampling algorithm before normalizing the dataset. For this case, I'm going to apply a random oversampling algorithm, this will create synthethic data in the minority class.

In [None]:
loan_ros = loan[features]
print("Data before over-sampling")
print(loan_ros['not.fully.paid'].value_counts(), "\n")



In [None]:
# over-sampling
loan_ros = loan[features]
X = loan_ros.iloc[:,:-1]
y = loan_ros.iloc[:,-1]

ros = RandomOverSampler(random_state=42)
x_ros, y_ros = ros.fit_resample(X, y)

loan_ros = x_ros
loan_ros['not.fully.paid'] = y_ros


#visualazing samples
fig, ax = plt.subplots(1,2, figsize=(15,5))

sns.countplot('not.fully.paid', data=loan, ax=ax[0], palette="vlag")
ax[0].tick_params(labelrotation=90)
ax[0].set_title("Data before over-sampling")

sns.countplot('not.fully.paid', data=loan_ros, ax=ax[1], palette="vlag")
ax[1].tick_params(labelrotation=90)
ax[1].set_title("Data after over-sampling")

plt.show()

print("Data before over-sampling")
print(loan['not.fully.paid'].value_counts(), "\n")

print("Data after over-sampling")
print(loan_ros['not.fully.paid'].value_counts())

With balanced data, we can go ahead and build the model. Let's first check the distributions with the newly balanced data.

In [None]:
fig, ax = plt.subplots(3,2, figsize=(22,10))
ax=ax.ravel()

count=0
for i in loan_ros.keys():
    sns.histplot(data=loan_ros, x=i, hue='not.fully.paid', ax=ax[count])
    count+=1

not a pattern yet, lets see what the model tells us.

## 6.2. Tree model

In [None]:
# Selecting training values
loan_model = loan_ros
X = loan_model.iloc[:,:-1].values
y = loan_model.iloc[:,-1].values

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Building the model


tree = DecisionTreeClassifier(max_depth=12, random_state=0)
tree.fit(X_train, y_train)

print("Accuracy on training set : {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set : {:.3f}".format(tree.score(X_test, y_test)), "\n")


# Confusion matrix
disp = plot_confusion_matrix(tree, X_test, y_test,
                             cmap=plt.cm.Blues, display_labels=["not fully paid", "fully paid"])
plt.show()

print(disp.confusion_matrix, "\n")

print("Feature importances: ")
print(tree.feature_importances_)

We have achieved an acceptable score and confussion matrix. The feature importances shows why is so difficult to find a pattern in the data, all of the features have similar importance, which means that the model is a combination of fine tunnings from all of the features. Is hard to create a client profile given these features. But the algorithm can be useful when the bank operator enters the data in the system, the algorithm will tell the agents how probable it is for the client to fully pay or not.

# 7. Conclusions
- We couldn't find any obervable pattern. The data the target variable is equally distributed in all the variables. It almost looks as if the dataset in synthetic.
- Nevertheless, The the algorithm was able to reach a decent and more real test score (working with the imbalanced data would have yield a better but not real result), and was able to classify both outcomes of the target variable in the confusion matrix.
- We couldn't meet the objective of creating a cliente profile, but the algorith could be use by agents to see how probable it is a client will pay or not, easing the decision process. 