# Predicting Loan Repayment

In the lending industry, investors provide loans to borrowers in exchange for the promise of repayment with interest. If the borrower repays the loan, then the lender profits from the interest. However, if the borrower is unable to repay the loan, then the lender loses money. Therefore, lenders face the problem of predicting the risk of a borrower being unable to repay a loan.

To address this problem, we will use publicly available data from [LendingClub.com](https://www.lendingclub.com/info/download-data.action), a website that connects borrowers and investors over the Internet. This dataset represents 9,578 3-year loans that were funded through the LendingClub.com platform between May 2007 and February 2010. The binary dependent variable "not_fully_paid" indicates that the loan was not paid back in full (the borrower either defaulted or the loan was "charged off," meaning the borrower was deemed unlikely to ever pay it back).

To predict this dependent variable, we will use the following independent variables available to the investor when deciding whether to fund a loan:

__credit.policy__: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").

__int.rate__: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.

__installment__: The monthly installments ($) owed by the borrower if the loan is funded.

__log.annual.inc__: The natural log of the self-reported annual income of the borrower.

__dti__: The debt-to-income ratio of the borrower (amount of debt divided by annual income).

__fico__: The FICO credit score of the borrower.

__days.with.cr.line__: The number of days the borrower has had a credit line.

__revol.bal__: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).

__revol.util__: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).

__inq.last.6mths__: The borrower's number of inquiries by creditors in the last 6 months.

__delinq.2yrs__: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.

__pub.rec__: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

### Load and understand data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

Load the dataset loans.csv into a data frame called loans, and explore it.

In [2]:
loans = pd.read_csv(r'./Data/loans.csv')

In [3]:
loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
credit.policy        9578 non-null int64
purpose              9578 non-null object
int.rate             9578 non-null float64
installment          9578 non-null float64
log.annual.inc       9574 non-null float64
dti                  9578 non-null float64
fico                 9578 non-null int64
days.with.cr.line    9549 non-null float64
revol.bal            9578 non-null int64
revol.util           9516 non-null float64
inq.last.6mths       9549 non-null float64
delinq.2yrs          9549 non-null float64
pub.rec              9549 non-null float64
not.fully.paid       9578 non-null int64
dtypes: float64(9), int64(4), object(1)
memory usage: 1.0+ MB


In [4]:
loans.describe()

Unnamed: 0,credit.policy,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid
count,9578.0,9578.0,9578.0,9574.0,9578.0,9578.0,9549.0,9578.0,9516.0,9549.0,9549.0,9549.0,9578.0
mean,0.80497,0.12264,319.089413,10.931874,12.606679,710.846314,4562.026085,16913.96,46.865677,1.571578,0.163787,0.062101,0.160054
std,0.396245,0.026847,207.071301,0.614736,6.88397,37.970537,2497.985733,33756.19,29.018642,2.198095,0.546712,0.262152,0.366676
min,0.0,0.06,15.67,7.547502,0.0,612.0,178.958333,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.1039,163.77,10.558414,7.2125,682.0,2820.0,3187.0,22.7,0.0,0.0,0.0,0.0
50%,1.0,0.1221,268.95,10.927987,12.665,707.0,4139.958333,8596.0,46.4,1.0,0.0,0.0,0.0
75%,1.0,0.1407,432.7625,11.289832,17.95,737.0,5730.0,18249.5,71.0,2.0,0.0,0.0,0.0
max,1.0,0.2164,940.14,14.528354,29.96,827.0,17639.95833,1207359.0,119.0,33.0,13.0,5.0,1.0


What proportion of the loans in the dataset were not paid in full? Please input a number between 0 and 1.

In [5]:
loans['not.fully.paid'].mean()

0.16005429108373356

What variables have at least one missing observation?

In [6]:
loans.isna().sum(axis=0)

credit.policy         0
purpose               0
int.rate              0
installment           0
log.annual.inc        4
dti                   0
fico                  0
days.with.cr.line    29
revol.bal             0
revol.util           62
inq.last.6mths       29
delinq.2yrs          29
pub.rec              29
not.fully.paid        0
dtype: int64

For the rest of this problem, we'll be using a revised version of the dataset that has the missing values filled in with multiple imputation

In [7]:
loans_imp = pd.read_csv('./Data/loans_imputed.csv')
loans_imp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
credit.policy        9578 non-null int64
purpose              9578 non-null object
int.rate             9578 non-null float64
installment          9578 non-null float64
log.annual.inc       9578 non-null float64
dti                  9578 non-null float64
fico                 9578 non-null int64
days.with.cr.line    9578 non-null float64
revol.bal            9578 non-null int64
revol.util           9578 non-null float64
inq.last.6mths       9578 non-null int64
delinq.2yrs          9578 non-null int64
pub.rec              9578 non-null int64
not.fully.paid       9578 non-null int64
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB


Now that we have prepared the dataset, we need to split it into a training and testing set (select the 70% of observations for the training set). The dependent variable is `not.fully.paid`.

Now, use logistic regression trained on the training set to predict the dependent variable `not.fully.paid` using all the independent variables.

In [8]:
loans_imp['purpose'].value_counts()

debt_consolidation    3957
all_other             2331
credit_card           1262
home_improvement       629
small_business         619
major_purchase         437
educational            343
Name: purpose, dtype: int64

In [9]:
loans_imp = pd.get_dummies(loans_imp, prefix=['purpose'])

In [10]:
from sklearn.model_selection import train_test_split

features = list(loans_imp.columns.drop(labels='not.fully.paid'))

y = loans_imp['not.fully.paid']
X = loans_imp[features]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=70)

Predict the probability of the test set loans not being paid back in full. Store these predicted probabilities in a variable named `predicted_risk` and add it to your test set (we will use this variable in later parts of the problem). Compute the confusion matrix using a threshold of 0.5.

In [11]:
from sklearn.linear_model import LogisticRegression

# C=1e10 to avoid regularization
model = LogisticRegression(solver='liblinear', penalty="l2", C=1e10).fit(X_train, y_train)
predicted_risk = model.predict_proba(X_test)

y_pred = (predicted_risk[:,1]>=0.5).astype(int)
#y_pred = model.predict(X_test)

# confusion matrix
pd.crosstab(y_pred, y_test, rownames=['y_pred'], colnames=['y_test'])

y_test,0,1
y_pred,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2411,455
1,5,3


What is the accuracy of the logistic regression model? Input the accuracy as a number between 0 and 1.

In [12]:
(y_pred == y_test).sum()/len(y_pred)
#model.score(X_test, y_test)

0.8399443284620738

What is the accuracy of the baseline model? Input the accuracy as a number between 0 and 1.

In [13]:
# a baseline model is the most frequent outcome
y_train.value_counts().max()/len(y_train)

0.8396479713603818

Compute the test set AUC.

In [14]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_pred)

0.5022403409583851

The model has poor accuracy at the threshold 0.5.