# Exercise - Autoencoder

The data set for this exercise is from the banking industry. It contains data about the home loans of 2,500 bank clients. Each row represents a single loan. The columns include the characteristics of the client who used a loan. We will build an autoencoder to learn the representation of good loans. That way, we can see if an unknown loan will be a good or bad loan based on its reconstruction error.  

Note: in the data set, we don't have a column that indicates whether a loan is "good" or "bad". So, we can't train a classification model like we did before.

## Description of Variables

The description of variables are provided in "Loan - Data Dictionary.docx"

# Read and Prepare the Data

In [1]:
# Common imports

import pandas as pd
import numpy as np

np.random.seed(42)

# Get the data

In [2]:
good = pd.read_csv('good loans.csv')

unknown = pd.read_csv('unknown loans.csv')

In [3]:
# Note that there is no GOOD/BAD classification. Though, we know that
# these loans are good

good.head()

Unnamed: 0,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
0,25900,61064.0,94714.0,DebtCon,Office,2.0,0.0,0.0,98.809375,0.0,23.0,34.565944
1,26100,113266.0,182082.0,DebtCon,Sales,18.0,0.0,0.0,304.852469,1.0,31.0,33.193949
2,20900,62615.0,87904.0,DebtCon,Office,5.0,,,177.864849,,15.0,36.831076
3,25300,62540.0,101165.0,DebtCon,ProfExe,0.0,0.0,0.0,195.451331,0.0,25.0,35.200865
4,27700,73148.0,101462.0,DebtCon,ProfExe,10.0,0.0,0.0,264.605389,0.0,33.0,40.475793


In [4]:
good.shape

(1489, 12)

In [5]:
unknown.shape

(1011, 12)

# Data Prep

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

##  Identify the numeric, binary, and categorical columns

In [7]:
# Identify the numerical columns
numeric_columns = good.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = good.select_dtypes('object').columns.to_list()

In [8]:
numeric_columns

['LOAN',
 'MORTDUE',
 'VALUE',
 'YOJ',
 'DEROG',
 'DELINQ',
 'CLAGE',
 'NINQ',
 'CLNO',
 'DEBTINC']

In [9]:
categorical_columns

['REASON', 'JOB']

# Pipeline

In [10]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [11]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [12]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns)],   
        remainder='passthrough')

#passtrough is an optional step. You don't have to use it.

# Transform: fit_transform() for GOOD

In [13]:
#Fit and transform the train data
good_x = preprocessor.fit_transform(good)

good_x

array([[ 0.6365782 , -0.29793911, -0.12177109, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.6546228 ,  1.02893596,  1.62034212, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.18546313, -0.25851565, -0.25756212, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 4.18234262, -0.04383447,  0.81444926, ...,  0.        ,
         0.        ,  0.        ],
       [-1.30321659, -0.47828045, -0.73681873, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.03208401, -0.46521555, -0.38146895, ...,  0.        ,
         0.        ,  0.        ]])

In [14]:
good_x.shape

(1489, 20)

# Tranform: transform() for UNKNOWN

In [15]:
# Transform the test data
unknown_x = preprocessor.transform(unknown)

unknown_x

array([[ 2.81095281,  3.755331  ,  3.98956663, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.32079765, -0.54180025, -0.65167516, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.19448543,  0.59006724,  0.48213012, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [-0.79796772,  0.05628738,  0.05541527, ...,  0.        ,
         0.        ,  0.        ],
       [-0.16640662, -0.510536  , -0.45504735, ...,  0.        ,
         0.        ,  0.        ],
       [-0.81601232, -0.92230904, -0.93360606, ...,  0.        ,
         0.        ,  0.        ]])

In [16]:
unknown_x.shape

(1011, 20)

# Build an Autoencoder to learn the representation of GOOD loans

In [17]:
import tensorflow as tf
from tensorflow import keras

model = keras.models.Sequential()

#Encoder
model.add(keras.layers.InputLayer(input_shape=20))
model.add(keras.layers.Dense(15, activation='relu')) 
model.add(keras.layers.Dense(10, activation='relu')) 
model.add(keras.layers.Dense(5, activation='relu')) 

#Decoder
model.add(keras.layers.Dense(10, activation='relu')) 
model.add(keras.layers.Dense(20, activation=None)) 

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 15)                315       
                                                                 
 dense_1 (Dense)             (None, 10)                160       
                                                                 
 dense_2 (Dense)             (None, 5)                 55        
                                                                 
 dense_3 (Dense)             (None, 10)                60        
                                                                 
 dense_4 (Dense)             (None, 20)                220       
                                                                 
Total params: 810
Trainable params: 810
Non-trainable params: 0
_________________________________________________________________


In [18]:
adam = keras.optimizers.Adam(learning_rate=0.001)


model.compile(loss='mse', optimizer='Nadam', metrics=['mean_squared_error'])

In [19]:
from tensorflow.keras.callbacks import EarlyStopping

earlystop = EarlyStopping(monitor='val_loss', patience=5, verbose=1, mode='auto')

callback = [earlystop]

In [20]:
model.fit(good_x, good_x, 
          validation_data = (good_x, good_x),
          epochs=100, batch_size=100, callbacks=callback)



Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100


Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x28f5151fa30>

### Check the average MSE on the "good" loans

In [21]:
model.evaluate(good_x, good_x)




[0.22123031318187714, 0.22123031318187714]

In [22]:
model.evaluate(good_x, good_x)[0]*100



22.123031318187714

### Check the average MSE on the "unknown" data

In [23]:
model.evaluate(unknown_x, unknown_x)



[0.9636467099189758, 0.9636467099189758]

In [24]:
model.evaluate(unknown_x, unknown_x)[0]*100



96.36467099189758

### Do you think the "unknown" loans look like good loans or not (justify your answer using the interpretation of the average MSE values)

The unknown loans are 3 times worse than the good loans.  

In [25]:
# Looping through the first 20 good loans
from sklearn.metrics import mean_squared_error

for i in range(0,20):
    prediction = model.predict(good_x[i:i+1])
    print((mean_squared_error(good_x[i:i+1], prediction))*100)
    
# The mse * 100 of the first 20 good loan records fall between 9.82 and 43.62

17.455050077189586
10.650773754125057
9.952228323513044
17.84088738788164
12.590331565501794
11.979907710271712
9.829080001328835
28.148115168493636
15.981035875707677
32.68079871232332
13.978573365138857
12.829460901553125
5.3589697420422375
9.817235501719045
6.5489086952886115
10.010341167340375
13.158937294624979
33.797345494992605
14.965246757215672
16.48252591004031


In [26]:
# Looping through the first 20 unknown loans
for i in range(0,20):
    prediction = model.predict(unknown_x[i:i+1])
    print((mean_squared_error(unknown_x[i:i+1], prediction))*100)
    
# The mse * 100 of the first 20 unknown loan records fall between 14.98 and 235.85.

58.67682553182565
9.797934008367273
27.573497593325996
31.787245179585238
88.24448396377178
42.67978783164765
281.35755138776346
19.526858583604675
218.03554483582707
272.22413295337316
17.772180074850745
66.171864629121
96.02279048666354
25.773518136734353
98.20152788876045
19.2215885014464
100.00389215613245
30.092340359791425
17.74084824372996
29.096293237538422


The mse is largely bigger in the unknown loan file.  This increases the likelihood that these loans are not "good" and probably have a higher chance for default.  