In this notebook, we are going to implement unsupervised learning algorithm.

We would be using the same loan dataset which we used in classification algorithm. 

Even though it is labeled dataset, we would ignore the labels and see how k-means performs.

We would not be doing exploratory data analysis and feature engineering as we have already covered them as part of classification algorithms.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix,classification_report

In [2]:
#this dataset consist of information about the columns
df_info = pd.read_csv('../dataset/lending_club_info.csv')

In [3]:
df_info

Unnamed: 0,LoanStatNew,Description
0,loan_amnt,The listed amount of the loan applied for by t...
1,term,The number of payments on the loan. Values are...
2,int_rate,Interest Rate on the loan
3,installment,The monthly payment owed by the borrower if th...
4,grade,LC assigned loan grade
5,sub_grade,LC assigned loan subgrade
6,emp_title,The job title supplied by the Borrower when ap...
7,emp_length,Employment length in years. Possible values ar...
8,home_ownership,The home ownership status provided by the borr...
9,annual_inc,The self-reported annual income provided by th...


In [4]:
#actual dataset
df = pd.read_csv('../dataset/lending_club_loan_two.csv')

In [5]:
df.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,...,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,application_type,mort_acc,pub_rec_bankruptcies,address
0,10000.0,36 months,11.44,329.48,B,B4,Marketing,10+ years,RENT,117000.0,...,16.0,0.0,36369.0,41.8,25.0,w,INDIVIDUAL,0.0,0.0,"0174 Michelle Gateway\nMendozaberg, OK 22690"
1,8000.0,36 months,11.99,265.68,B,B5,Credit analyst,4 years,MORTGAGE,65000.0,...,17.0,0.0,20131.0,53.3,27.0,f,INDIVIDUAL,3.0,0.0,"1076 Carney Fort Apt. 347\nLoganmouth, SD 05113"
2,15600.0,36 months,10.49,506.97,B,B3,Statistician,< 1 year,RENT,43057.0,...,13.0,0.0,11987.0,92.2,26.0,f,INDIVIDUAL,0.0,0.0,"87025 Mark Dale Apt. 269\nNew Sabrina, WV 05113"
3,7200.0,36 months,6.49,220.65,A,A2,Client Advocate,6 years,RENT,54000.0,...,6.0,0.0,5472.0,21.5,13.0,f,INDIVIDUAL,0.0,0.0,"823 Reid Ford\nDelacruzside, MA 00813"
4,24375.0,60 months,17.27,609.33,C,C5,Destiny Management Inc.,9 years,MORTGAGE,55000.0,...,13.0,0.0,24584.0,69.8,43.0,f,INDIVIDUAL,1.0,0.0,"679 Luna Roads\nGreggshire, VA 11650"


In [6]:
#loan_status is labeled column which tells whether the loan would be fully paid or charged off
df.groupby('loan_status').count()['loan_amnt']

loan_status
Charged Off     77673
Fully Paid     318357
Name: loan_amnt, dtype: int64

As we can see that majority of loans are paid fully. We are going to remove this column later and see how k-means cluster the observations into two clusters

In [7]:
#feature engineering - preperation of data
#we would skip explanation of these steps as we already covered them in classification
df = df.drop('emp_title',axis=1)
df = df.drop('emp_length',axis=1)
df = df.drop('title',axis=1)
mean_total_acc = df.groupby('total_acc').mean()['mort_acc']
def update_mort_acc(total_acc,mort_acc):
    if np.isnan(mort_acc):
        return mean_total_acc[total_acc]
    else:
        return mort_acc

df['mort_acc'] = df[['total_acc','mort_acc']].apply(lambda x:update_mort_acc(x['total_acc'],x['mort_acc']),axis=1)
df.dropna(inplace=True)
df['term'] = df['term'].apply(lambda x:str(x).strip().split()[0])
df['term'] = df.term.apply(lambda x:int(x))
df.drop('grade',axis=1,inplace=True)
df['issue_d'] = df.issue_d.apply(lambda x:x[-4:])
df['issue_d'] = df.issue_d.apply(lambda x:int(x))
df = df.drop('earliest_cr_line',axis=1)
df['address'] = df['address'].apply(lambda x:x[-5:])
df['address'] = df.address.apply(lambda x:int(x))
dummy_cols = ['sub_grade','home_ownership','verification_status','purpose','initial_list_status','application_type']
dummy_df = pd.get_dummies(df[dummy_cols],drop_first=True) 
df = pd.concat((df,dummy_df),axis=1)
df = df.drop(dummy_cols,axis=1)

In [8]:
df.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,annual_inc,issue_d,loan_status,dti,open_acc,pub_rec,...,purpose_medical,purpose_moving,purpose_other,purpose_renewable_energy,purpose_small_business,purpose_vacation,purpose_wedding,initial_list_status_w,application_type_INDIVIDUAL,application_type_JOINT
0,10000.0,36,11.44,329.48,117000.0,2015,Fully Paid,26.24,16.0,0.0,...,0,0,0,0,0,1,0,1,1,0
1,8000.0,36,11.99,265.68,65000.0,2015,Fully Paid,22.05,17.0,0.0,...,0,0,0,0,0,0,0,0,1,0
2,15600.0,36,10.49,506.97,43057.0,2015,Fully Paid,12.79,13.0,0.0,...,0,0,0,0,0,0,0,0,1,0
3,7200.0,36,6.49,220.65,54000.0,2014,Fully Paid,2.6,6.0,0.0,...,0,0,0,0,0,0,0,0,1,0
4,24375.0,60,17.27,609.33,55000.0,2013,Charged Off,33.95,13.0,0.0,...,0,0,0,0,0,0,0,0,1,0


In [9]:
df.columns

Index(['loan_amnt', 'term', 'int_rate', 'installment', 'annual_inc', 'issue_d',
       'loan_status', 'dti', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util',
       'total_acc', 'mort_acc', 'pub_rec_bankruptcies', 'address',
       'sub_grade_A2', 'sub_grade_A3', 'sub_grade_A4', 'sub_grade_A5',
       'sub_grade_B1', 'sub_grade_B2', 'sub_grade_B3', 'sub_grade_B4',
       'sub_grade_B5', 'sub_grade_C1', 'sub_grade_C2', 'sub_grade_C3',
       'sub_grade_C4', 'sub_grade_C5', 'sub_grade_D1', 'sub_grade_D2',
       'sub_grade_D3', 'sub_grade_D4', 'sub_grade_D5', 'sub_grade_E1',
       'sub_grade_E2', 'sub_grade_E3', 'sub_grade_E4', 'sub_grade_E5',
       'sub_grade_F1', 'sub_grade_F2', 'sub_grade_F3', 'sub_grade_F4',
       'sub_grade_F5', 'sub_grade_G1', 'sub_grade_G2', 'sub_grade_G3',
       'sub_grade_G4', 'sub_grade_G5', 'home_ownership_MORTGAGE',
       'home_ownership_NONE', 'home_ownership_OTHER', 'home_ownership_OWN',
       'home_ownership_RENT', 'verification_status_Source Verifi

In [10]:
#now we have to convert our main labled column into number before we fed this into model
# we are going to use this column for verification
df.loan_status = df.loan_status.apply(lambda x: 0 if x == 'Fully Paid' else 1 )

Now that we have done all the data processing it's time for creating model.

Also this dataset is huge with around 400,000 rows and many columns. If you are short of powerful hardware resources then
we can use sample of this data to create model

In [11]:
df = df.sample(frac=0.9) #this step is optional if you are on GPU

In [12]:
print(len(df))

355697


In [13]:
df['loan_status'].value_counts()

0    286016
1     69681
Name: loan_status, dtype: int64

In [14]:
y = df['loan_status']
X = df.drop('loan_status',axis=1)

## Normalizing the data

In [15]:
scaler = MinMaxScaler()
# scaler = StandardScaler()

In [16]:
scaler.fit_transform(X)

array([[0.12658228, 0.        , 0.22088041, ..., 0.        , 1.        ,
        0.        ],
       [0.11392405, 0.        , 0.19633814, ..., 0.        , 1.        ,
        0.        ],
       [0.18987342, 0.        , 0.66809505, ..., 1.        , 1.        ,
        0.        ],
       ...,
       [0.36708861, 1.        , 0.40085703, ..., 1.        , 1.        ,
        0.        ],
       [0.36708861, 1.        , 0.40085703, ..., 0.        , 1.        ,
        0.        ],
       [0.24050633, 1.        , 0.44604597, ..., 0.        , 1.        ,
        0.        ]])

## K Means

Let's divide the data into two clusters

In [17]:
model = KMeans(n_clusters=2,max_iter=500)

In [18]:
model.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=500,
       n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [19]:
#two clusters values
model.cluster_centers_

array([[1.30000286e+04, 4.15222749e+01, 1.37369833e+01, 3.98954985e+02,
        6.06430272e+04, 2.01360935e+03, 1.79147576e+01, 1.10510777e+01,
        1.82944881e-01, 1.31889224e+04, 5.34047256e+01, 2.45813381e+01,
        1.59422234e+00, 1.29731221e-01, 3.41887297e+04, 2.25605402e-02,
        2.49886386e-02, 3.78627540e-02, 4.44458872e-02, 4.80523275e-02,
        5.63591508e-02, 6.78244498e-02, 6.60066221e-02, 5.73135104e-02,
        6.05856002e-02, 5.81639940e-02, 5.42848796e-02, 5.21521781e-02,
        4.67766020e-02, 4.15146400e-02, 3.65253522e-02, 3.17568006e-02,
        2.95818996e-02, 2.47549179e-02, 2.02168409e-02, 1.86684412e-02,
        1.56235798e-02, 1.36012465e-02, 1.15983899e-02, 8.91709407e-03,
        6.98565215e-03, 5.69694215e-03, 4.52509251e-03, 3.50905668e-03,
        2.57092774e-03, 1.88275011e-03, 1.40557034e-03, 9.02421606e-04,
        7.62838408e-04, 4.72946829e-01, 6.81685386e-05, 2.98643121e-04,
        9.74063494e-02, 4.29270272e-01, 3.23401285e-01, 3.424040

Model has now classified data into 1 and 0. 
Let's see how it looks

In [20]:
model.labels_

array([0, 0, 0, ..., 0, 0, 0])

In [21]:
model.labels_.shape

(355697,)

In [22]:
output = pd.DataFrame(model.labels_)

In [23]:
output.head()

Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0


In [24]:
output[0].value_counts()

0    308021
1     47676
Name: 0, dtype: int64

In [26]:
print(confusion_matrix(y,output))
print(classification_report(y,output))

[[245285  40731]
 [ 62736   6945]]
              precision    recall  f1-score   support

           0       0.80      0.86      0.83    286016
           1       0.15      0.10      0.12     69681

    accuracy                           0.71    355697
   macro avg       0.47      0.48      0.47    355697
weighted avg       0.67      0.71      0.69    355697



Above results shows that model is ok in predicting 0 but it's not doing good in predicting 1.

However please note that it is unsupervised model and model has classified data by recognizing data patterns. We have not inputted model with training data with labels to make it learn and intelligent. So in a way model is doing good.