# Exercise - Neural Networks using Keras

The data set for this exercise is from the banking industry. It contains data about the home loans of 2,500 bank clients. Each row represents a single loan. The columns include the characteristics of the client who used a loan. This is a binary classification task: predict whether a loan will be bad or not (1=Yes, 0=No). This is an important task for banks to prevent bad loans from being issued.

## Description of Variables

The description of variables are provided in "Loan - Data Dictionary.docx"

## Goal

Use the **loan.csv** data set and build a model to predict **BAD**. 

Since you have a relatively small data set, I recommend using cross-validation to evaluate your accuracy.

# Read and Prepare the Data

In [1]:
# Common imports

import pandas as pd
import numpy as np

np.random.seed(42)

# Get the data

In [2]:
#We will predict the "price" value in the data set:

loan = pd.read_csv("loan_keras.csv")
loan.head()

Unnamed: 0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
0,0,25900,61064.0,94714.0,DebtCon,Office,2.0,0.0,0.0,98.809375,0.0,23.0,34.565944
1,0,26100,113266.0,182082.0,DebtCon,Sales,18.0,0.0,0.0,304.852469,1.0,31.0,33.193949
2,1,50000,220528.0,300900.0,HomeImp,Self,5.0,0.0,0.0,0.0,0.0,2.0,
3,1,22400,51470.0,68139.0,DebtCon,Mgr,9.0,0.0,0.0,31.168696,2.0,8.0,37.95218
4,0,20900,62615.0,87904.0,DebtCon,Office,5.0,,,177.864849,,15.0,36.831076


# Split data (train/test)

In [3]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(loan, test_size=0.2)

# Data Prep

Perform your data prep here. You can use pipelines like we do in the tutorials. Otherwise, feel free to use your own data prep steps. Eventually, you should do the following at a minimum:<br>
- Separate inputs from target<br>
- Impute/remove missing values<br>
- Standardize the continuous variables<br>
- One-hot encode categorical variables<br>

In [16]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer

## Separate the target variable 

In [5]:
train_target = train['BAD']
test_target = test['BAD']

train_inputs = train.drop(['BAD'], axis=1)
test_inputs = test.drop(['BAD'], axis=1)

## Feature Engineering: Derive a new column

Examples:
- Ratio of delinquent to total number of credit lines
- Ratio of loan to value of current property
- Convert yr_renovated to a binary variable (i.e., renovated or not)
- (etc.)

In [6]:
def new_col(df):
    
    #Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()

    # Use the formula, though fill in 0s when the value is 0/0 (because 0/0 generates "nan" values)
    df1['deliq_ratio'] = (df1['DELINQ']/df1['CLNO']).fillna(0)

    # Replace the infinity values with 1 (because a value divided by 0 generates infinity)
    df1['deliq_ratio'].replace(np.inf, 1, inplace=True)

    return df1[['deliq_ratio']]

In [8]:
new_col(train)

Unnamed: 0,deliq_ratio
2055,0.000000
1961,0.300000
1864,0.041667
2326,0.000000
461,0.000000
...,...
1638,0.125000
1095,0.181818
1130,0.000000
1294,0.000000


##  Identify the numeric, binary, and categorical columns

In [9]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [10]:
numeric_columns

['LOAN',
 'MORTDUE',
 'VALUE',
 'YOJ',
 'DEROG',
 'DELINQ',
 'CLAGE',
 'NINQ',
 'CLNO',
 'DEBTINC']

In [11]:
categorical_columns

['REASON', 'JOB']

In [12]:
feat_eng_columns = ['DELINQ', 'CLNO']

# Pipeline

In [13]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [14]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [17]:
# Create a pipeline for the transformed column here

my_new_column = Pipeline(steps=[('my_new_column', FunctionTransformer(new_col)),
                               ('scaler', StandardScaler())])


In [18]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('trans', my_new_column, feat_eng_columns)],   
        remainder='passthrough')

#passtrough is an optional step. You don't have to use it.

# Transform: fit_transform() for TRAIN

In [19]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

array([[ 1.04111078,  0.6878524 ,  0.85091391, ...,  0.        ,
         0.        , -0.42081285],
       [-0.28532644, -0.52374577, -0.00483295, ...,  0.        ,
         0.        ,  3.61741439],
       [-0.49898075, -0.60339785, -0.57929256, ...,  0.        ,
         0.        ,  0.14005205],
       ...,
       [-0.20520607, -0.8316353 , -0.79774855, ...,  0.        ,
         0.        , -0.42081285],
       [-0.45446944,  1.82868782,  1.33923641, ...,  0.        ,
         0.        , -0.42081285],
       [-0.30313096, -0.0781773 , -0.20951181, ...,  0.        ,
         0.        , -0.42081285]])

In [20]:
train_x.shape

(2000, 21)

# Tranform: transform() for TEST

In [21]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

array([[ 0.07966635,  0.38294162,  0.31640967, ...,  0.        ,
         0.        , -0.42081285],
       [ 0.33783199,  0.59525334,  0.4202283 , ...,  1.        ,
         0.        , -0.42081285],
       [-0.32093549,  0.42863133,  0.12227435, ...,  0.        ,
         0.        , -0.42081285],
       ...,
       [-0.58800339, -0.65204285, -0.73238892, ...,  0.        ,
         0.        , -0.42081285],
       [-0.82836449, -0.33280126, -0.49771342, ...,  0.        ,
         0.        , -0.42081285],
       [ 0.69392251,  0.9629992 ,  1.3954629 , ...,  0.        ,
         0.        , -0.42081285]])

In [22]:
test_x.shape

(500, 21)

# Calculate the Baseline

In [24]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")

dummy_clf.fit(train_x, train_target)

In [25]:
from sklearn.metrics import accuracy_score

In [26]:
#Baseline Train Accuracy
dummy_train_pred = dummy_clf.predict(train_x)

baseline_train_acc = accuracy_score(train_target, dummy_train_pred)

print('Baseline Train Accuracy: {}' .format(baseline_train_acc))

#Baseline Test Accuracy
dummy_test_pred = dummy_clf.predict(test_x)

baseline_test_acc = accuracy_score(test_target, dummy_test_pred)

print('Baseline Test Accuracy: {}' .format(baseline_test_acc))

Baseline Train Accuracy: 0.5995
Baseline Test Accuracy: 0.58


# Train a shallow (one-layer) Keras model

In [63]:
import tensorflow as tf
from tensorflow import keras

# fix random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

In [64]:
train_x.shape

(2000, 21)

In [65]:
#Define the model: for multi-class

model = keras.models.Sequential()

model.add(keras.layers.Input(shape=21))
model.add(keras.layers.Dense(21, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))



In [66]:
# Compile model

#Optimizer:
adam = keras.optimizers.Adam(learning_rate=0.01)

model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])

# Train a deep (multi-layered) Keras model 

In [67]:
# Fit the model

history = model.fit(train_x, train_target, 
                    validation_data=(test_x, test_target), 
                    epochs=20, batch_size=100)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [68]:
# Train values

train_scores = model.evaluate(train_x, train_target, verbose=0)

train_scores

# In results, first is loss, second is accuracy

[0.38676244020462036, 0.8270000219345093]

In [69]:
# Print the values

print(f"Train {model.metrics_names[0]}: {train_scores[0]:.2f}")

print(f"Train {model.metrics_names[1]}: {train_scores[1]*100:.2f}%")


Train loss: 0.39
Train accuracy: 82.70%


In [70]:
# Test values

test_scores = model.evaluate(test_x, test_target, verbose=0)

test_scores

# In results, first is loss, second is accuracy

[0.4245983958244324, 0.8119999766349792]

In [71]:
# Print the values

print(f"Test {model.metrics_names[0]}: {test_scores[0]:.2f}")

print(f"Test {model.metrics_names[1]}: {test_scores[1]*100:.2f}%")

Test loss: 0.42
Test accuracy: 81.20%


# Optional: try different activation functions, optimizers, or configurations (such as wide and deep) to build other models

In [52]:
# Select the first two columns: longitude and latitude
#(WHY: because lat and lon are good and important predictors)

train_lon_lat = train_x[:,:2]

train_lon_lat

array([[ 1.04111078,  0.6878524 ],
       [-0.28532644, -0.52374577],
       [-0.49898075, -0.60339785],
       ...,
       [-0.20520607, -0.8316353 ],
       [-0.45446944,  1.82868782],
       [-0.30313096, -0.0781773 ]])

In [53]:
test_lon_lat = test_x[:,:2]

In [54]:
model = keras.models.Sequential()

input1 = keras.layers.Input(shape=2)
input2 = keras.layers.Input(shape=21)

hidden1 = keras.layers.Dense(21, activation='relu')(input2)
hidden2 = keras.layers.Dense(21, activation='relu')(hidden1)
hidden3 = keras.layers.Dense(21, activation='relu')(hidden2)

concat = keras.layers.Concatenate()([input1, hidden3])

#final layer: there has to be 4 nodes with softmax (because we have 4 categories)
output = keras.layers.Dense(1, activation='sigmoid')(concat)

model = keras.Model(inputs =[input1, input2], outputs = output)

In [55]:
# Compile model

#Optimizer:
adam = keras.optimizers.Adam(learning_rate=0.01)

model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])

In [56]:
# Fit the model

history = model.fit((train_lon_lat, train_x), train_target, 
                    validation_data=((test_lon_lat, test_x), test_target), 
                    epochs=20, batch_size=100)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [57]:
# Train values

train_scores = model.evaluate((train_lon_lat, train_x), train_target, verbose=0)

train_scores

# In results, first is loss, second is accuracy

[0.21594728529453278, 0.9150000214576721]

In [58]:
# Print the values

print(f"Train {model.metrics_names[0]}: {train_scores[0]:.2f}")

print(f"Train {model.metrics_names[1]}: {train_scores[1]*100:.2f}%")

Train loss: 0.22
Train accuracy: 91.50%


In [59]:
# Test values

test_scores = model.evaluate((test_lon_lat, test_x), test_target, verbose=0)

test_scores

# In results, first is loss, second is accuracy

[0.48974159359931946, 0.8080000281333923]

In [60]:
# Print the values

print(f"Test {model.metrics_names[0]}: {test_scores[0]:.2f}")

print(f"Test {model.metrics_names[1]}: {test_scores[1]*100:.2f}%")


Test loss: 0.49
Test accuracy: 80.80%
