## Module 7 - Case study 3

Business challenge/requirement:

PeerLoanKart is an NBFC (Non-Banking Financial Company) which facilitates peer to peer loan.
It connects people who need money (borrowers) with people who have money (investors). As an investor, you would want to invest in people who showed a profile of having a high probability of paying you back.
You as a ML expert create a model that will help predict whether a borrower will pay the loan or not.

Key issues:

Ensure NPAs are lower – meaning PeerLoanKart wants to be very diligent in giving loans to borrower

Data volume:

- Approx 9578 records – file loan_borowwer_data.csv

Fields in Data:

• credit.policy: 1 if the customer meets the credit underwriting criteria of PeerLoanKart, and 0 otherwise

• purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other")

• int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by PeerLoanKart to be more risky are assigned higher interest rates

• installment: The monthly installments owed by the borrower if the loan is funded

• log.annual.inc: The natural log of the self-reported annual income of the borrower

• dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income)

• fico: The FICO credit score of the borrower

• days.with.cr.line: The number of days the borrower has had a credit line

• revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle)

• revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available)

• inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months

• delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years

• pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments)

• not.fully.paid: This is the output field. Please note that 1 means borrower is not going to pay the loan completely

Business benefits:

Increase in profits up to 20% as NPA will be reduced due to loan disbursal for only good borrowers

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

  import pandas.util.testing as tm


In [2]:
df = pd.read_csv(r'D:\E\Courses\Edureka\Assignments\Dataset\module7\loan_borowwer_data.csv')

In [3]:
df

Unnamed: 0,credit.policy,purpose,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid
0,1,debt_consolidation,0.1189,829.10,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0,0
1,1,credit_card,0.1071,228.22,11.082143,14.29,707,2760.000000,33623,76.7,0,0,0,0
2,1,debt_consolidation,0.1357,366.86,10.373491,11.63,682,4710.000000,3511,25.6,1,0,0,0
3,1,debt_consolidation,0.1008,162.34,11.350407,8.10,712,2699.958333,33667,73.2,1,0,0,0
4,1,credit_card,0.1426,102.92,11.299732,14.97,667,4066.000000,4740,39.5,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9573,0,all_other,0.1461,344.76,12.180755,10.39,672,10474.000000,215372,82.1,2,0,0,1
9574,0,all_other,0.1253,257.70,11.141862,0.21,722,4380.000000,184,1.1,5,0,0,1
9575,0,debt_consolidation,0.1071,97.81,10.596635,13.09,687,3450.041667,10036,82.9,8,0,0,1
9576,0,home_improvement,0.1600,351.58,10.819778,19.18,692,1800.000000,0,3.2,5,0,0,1


In [4]:
df.isna().sum()

credit.policy        0
purpose              0
int.rate             0
installment          0
log.annual.inc       0
dti                  0
fico                 0
days.with.cr.line    0
revol.bal            0
revol.util           0
inq.last.6mths       0
delinq.2yrs          0
pub.rec              0
not.fully.paid       0
dtype: int64

In [10]:
x = df.iloc[:,:13].values
y = df.iloc[:, 13:14].values

In [11]:
x, y

(array([[1, 'debt_consolidation', 0.1189, ..., 0, 0, 0],
        [1, 'credit_card', 0.1071, ..., 0, 0, 0],
        [1, 'debt_consolidation', 0.1357, ..., 1, 0, 0],
        ...,
        [0, 'debt_consolidation', 0.1071, ..., 8, 0, 0],
        [0, 'home_improvement', 0.16, ..., 5, 0, 0],
        [0, 'debt_consolidation', 0.1392, ..., 6, 0, 0]], dtype=object),
 array([[0],
        [0],
        [0],
        ...,
        [1],
        [1],
        [1]], dtype=int64))

In [15]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [17]:
labelencoder_x = LabelEncoder()
x[:,1] = labelencoder_x.fit_transform(x[:,0])

In [18]:
x

array([[1, 1, 0.1189, ..., 0, 0, 0],
       [1, 1, 0.1071, ..., 0, 0, 0],
       [1, 1, 0.1357, ..., 1, 0, 0],
       ...,
       [0, 0, 0.1071, ..., 8, 0, 0],
       [0, 0, 0.16, ..., 5, 0, 0],
       [0, 0, 0.1392, ..., 6, 0, 0]], dtype=object)

In [25]:
#one hot encoder not applying as it has only 2 values (0 and 1) after applying label encoder

In [26]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=1)

In [27]:
len(x_train), len(x_test), len(y_train), len(y_test)

(7662, 1916, 7662, 1916)

### Model-1 : Applying Logistic Regression to predict the customer non-full payment

In [28]:
from sklearn.linear_model import LogisticRegression
log_reg_model = LogisticRegression()
log_reg_model.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [30]:
y_predict_logistic = log_reg_model.predict(x_test)

In [37]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [32]:
confusion_matrix(y_test, y_predict_logistic)

array([[1589,    4],
       [ 314,    9]], dtype=int64)

In [121]:
log_reg_model_accu = round(accuracy_score(y_test, y_predict_logistic),4)
log_reg_model_accu

0.834

In [122]:
accuracy_dict = {}

accuracy_dict['log_reg_model'] = log_reg_model_accu

In [39]:
print(classification_report(y_test, y_predict_logistic))

              precision    recall  f1-score   support

           0       0.83      1.00      0.91      1593
           1       0.69      0.03      0.05       323

    accuracy                           0.83      1916
   macro avg       0.76      0.51      0.48      1916
weighted avg       0.81      0.83      0.76      1916



###### 83.4% is a good accuracy. still we will try other models and fix the ine with best accuracy

### Model-2 : Applying Decision Tree Classifier to predict the customer non-full payment

In [34]:
from sklearn.tree import DecisionTreeClassifier

dec_tree_class_model = DecisionTreeClassifier()
dec_tree_class_model.fit(x_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [36]:
y_predict_dec_tree_class = dec_tree_class_model.predict(x_test)

In [40]:
confusion_matrix(y_test, y_predict_dec_tree_class)

array([[1340,  253],
       [ 255,   68]], dtype=int64)

In [117]:
accu_dec_tree_class = round(accuracy_score(y_test, y_predict_dec_tree_class),4)
accu_dec_tree_class

0.7349

In [123]:
accuracy_dict['dec_tree_class'] = accu_dec_tree_class

In [46]:
print(classification_report(y_test, y_predict_dec_tree_class))

              precision    recall  f1-score   support

           0       0.84      0.84      0.84      1593
           1       0.21      0.21      0.21       323

    accuracy                           0.73      1916
   macro avg       0.53      0.53      0.53      1916
weighted avg       0.73      0.73      0.73      1916



### Model-3 : Applying Random Forest Classifier to predict the customer non-full payment

In [43]:
from sklearn.ensemble import RandomForestClassifier

rand_for_class_model = RandomForestClassifier(n_estimators=20)
rand_for_class_model.fit(x_train, y_train)

  after removing the cwd from sys.path.


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=20,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [45]:
y_predict_rand_for_class = rand_for_class_model.predict(x_test)

In [47]:
confusion_matrix(y_test, y_predict_rand_for_class)

array([[1575,   18],
       [ 312,   11]], dtype=int64)

In [119]:
rand_for_class_accu = round(accuracy_score(y_test, y_predict_rand_for_class),4)
rand_for_class_accu

0.8278

In [124]:
accuracy_dict['rand_for_class'] = rand_for_class_accu

In [50]:
print(classification_report(y_test, y_predict_rand_for_class))

              precision    recall  f1-score   support

           0       0.83      0.99      0.91      1593
           1       0.38      0.03      0.06       323

    accuracy                           0.83      1916
   macro avg       0.61      0.51      0.48      1916
weighted avg       0.76      0.83      0.76      1916



### Model-4 : Applying Naive Bayes algorithm to predict the customer non-full payment

In [51]:
from sklearn.naive_bayes import GaussianNB
gnb_class_model = GaussianNB()
gnb_class_model.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)


GaussianNB(priors=None, var_smoothing=1e-09)

In [53]:
y_predict_gnb_class_model = gnb_class_model.predict(x_test)

In [54]:
confusion_matrix(y_test, y_predict_gnb_class_model)

array([[1530,   63],
       [ 286,   37]], dtype=int64)

In [125]:
gnb_class_accu =round(accuracy_score(y_test, y_predict_gnb_class_model),4)
gnb_class_accu

0.8178

In [126]:
accuracy_dict['gnb_class'] = gnb_class_accu

In [56]:
print(classification_report(y_test, y_predict_gnb_class_model))

              precision    recall  f1-score   support

           0       0.84      0.96      0.90      1593
           1       0.37      0.11      0.17       323

    accuracy                           0.82      1916
   macro avg       0.61      0.54      0.54      1916
weighted avg       0.76      0.82      0.78      1916



### Model-5 : Applying K Nearest Neighbour classifier to predict the customer non-full payment

In [57]:
# applying standard scaler to make data uniform as knn is distance based algorithm

from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x = sc_x.fit_transform(x)

In [58]:
x

array([[ 0.49222226,  0.49222226, -0.13931753, ..., -0.71698894,
        -0.29973008, -0.23700318],
       [ 0.49222226,  0.49222226, -0.57886837, ..., -0.71698894,
        -0.29973008, -0.23700318],
       [ 0.49222226,  0.49222226,  0.48648368, ..., -0.26247044,
        -0.29973008, -0.23700318],
       ...,
       [-2.03160257, -2.03160257, -0.57886837, ...,  2.91915909,
        -0.29973008, -0.23700318],
       [-2.03160257, -2.03160257,  1.39166043, ...,  1.55560358,
        -0.29973008, -0.23700318],
       [-2.03160257, -2.03160257,  0.61685894, ...,  2.01012208,
        -0.29973008, -0.23700318]])

In [59]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=1)

In [60]:
from sklearn.neighbors import KNeighborsClassifier
knn_class_model = KNeighborsClassifier()
knn_class_model.fit(x_train, y_train)

  This is separate from the ipykernel package so we can avoid doing imports until


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [62]:
y_predict_knn_class_model = knn_class_model.predict(x_test)

In [63]:
confusion_matrix(y_test, y_predict_knn_class_model)

array([[1532,   61],
       [ 298,   25]], dtype=int64)

In [127]:
knn_class_accu = round(accuracy_score(y_test, y_predict_knn_class_model),4)
knn_class_accu

0.8126

In [128]:
accuracy_dict['knn_class'] = knn_class_accu

In [65]:
print(classification_report(y_test, y_predict_knn_class_model))

              precision    recall  f1-score   support

           0       0.84      0.96      0.90      1593
           1       0.29      0.08      0.12       323

    accuracy                           0.81      1916
   macro avg       0.56      0.52      0.51      1916
weighted avg       0.75      0.81      0.76      1916



### Model-6 : Applying SVM classifier to predict the customer non-full payment

In [66]:
from sklearn.svm import SVC

###### 6.1 SVC with kernel= linear

In [67]:
svm_class_lin_model = SVC(kernel='linear')
svm_class_lin_model.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [69]:
y_predict_svm_class_lin_model = svm_class_lin_model.predict(x_test)

In [70]:
confusion_matrix(y_test, y_predict_svm_class_lin_model)

array([[1593,    0],
       [ 323,    0]], dtype=int64)

In [129]:
svm_lin_accu = round(accuracy_score(y_test, y_predict_svm_class_lin_model),4)
svm_lin_accu

0.8314

In [130]:
accuracy_dict['svm_lin'] = svm_lin_accu

In [72]:
print(classification_report(y_test, y_predict_svm_class_lin_model))

              precision    recall  f1-score   support

           0       0.83      1.00      0.91      1593
           1       0.00      0.00      0.00       323

    accuracy                           0.83      1916
   macro avg       0.42      0.50      0.45      1916
weighted avg       0.69      0.83      0.75      1916



  'precision', 'predicted', average, warn_for)


###### 6.2 SVC with kernel= rbf

In [73]:
svm_class_rbf_model = SVC(kernel='rbf')
svm_class_rbf_model.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [75]:
y_predict_svm_class_rbf = svm_class_rbf_model.predict(x_test)

In [76]:
confusion_matrix(y_test, y_predict_svm_class_rbf)

array([[1586,    7],
       [ 320,    3]], dtype=int64)

In [131]:
svm_rbf_accu = round(accuracy_score(y_test, y_predict_svm_class_rbf),4)
svm_rbf_accu

0.8293

In [132]:
accuracy_dict['svm_rfb'] = svm_rbf_accu

In [78]:
print(classification_report(y_test, y_predict_svm_class_rbf))

              precision    recall  f1-score   support

           0       0.83      1.00      0.91      1593
           1       0.30      0.01      0.02       323

    accuracy                           0.83      1916
   macro avg       0.57      0.50      0.46      1916
weighted avg       0.74      0.83      0.76      1916



###### 6.3 SVC with kernel= sigmoid

In [79]:
svm_class_sig_model = SVC(kernel='sigmoid')
svm_class_sig_model.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='sigmoid', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [81]:
y_predict_svm_class_sig_model = svm_class_sig_model.predict(x_test)

In [82]:
confusion_matrix(y_test, y_predict_svm_class_sig_model)

array([[1359,  234],
       [ 261,   62]], dtype=int64)

In [133]:
svm_sig_accu = round(accuracy_score(y_test, y_predict_svm_class_sig_model),4)
svm_sig_accu

0.7416

In [134]:
accuracy_dict['svm_sigmoid'] = svm_sig_accu

###### 6.4 SVC with kernel= polynomial

In [85]:
svm_class_poly_model = SVC(kernel='poly', gamma='auto')
svm_class_poly_model.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [87]:
y_predict_svm_class_poly_model = svm_class_poly_model.predict(x_test)

In [88]:
confusion_matrix(y_test, y_predict_svm_class_poly_model)

array([[1589,    4],
       [ 318,    5]], dtype=int64)

In [135]:
svm_poly_accu = round(accuracy_score(y_test, y_predict_svm_class_poly_model),4)
svm_poly_accu

0.8319

In [136]:
accuracy_dict['svm_ploy'] = svm_poly_accu

In [91]:
print(classification_report(y_test, y_predict_svm_class_poly_model))

              precision    recall  f1-score   support

           0       0.83      1.00      0.91      1593
           1       0.56      0.02      0.03       323

    accuracy                           0.83      1916
   macro avg       0.69      0.51      0.47      1916
weighted avg       0.79      0.83      0.76      1916



In [137]:
accuracy_dict

{'log_reg_model': 0.834,
 'dec_tree_class': 0.7349,
 'rand_for_class': 0.8278,
 'gnb_class': 0.8178,
 'knn_class': 0.8126,
 'svm_lin': 0.8314,
 'svm_rfb': 0.8293,
 'svm_sigmoid': 0.7416,
 'svm_ploy': 0.8319}

### It looks like logistic regression is the best model for this case based on accuracy of the model.