# Lab: Classification for Prosper Loan Dataset

We are going to classify the prosper loan dataset.  This dataset shows a history of loans made by Prosper.

In [None]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
from tensorflow import keras
print ('tensorflow version:', tf.__version__)
print ('devices found:', tf.config.experimental.list_physical_devices())

## TF-GPU Debug
The following block tests if TF is running on GPU.

In [None]:
## This block is to tweak TF running on GPU
## You may comment this out, if you are not using GPU


import os, sys

## disable info logs from TF
#   Level | Level for Humans | Level Description                  
#  -------|------------------|------------------------------------ 
#   0     | DEBUG            | [Default] Print all messages       
#   1     | INFO             | Filter out INFO messages           
#   2     | WARNING          | Filter out INFO & WARNING messages 
#   3     | ERROR            | Filter out all messages 

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'  # or any {'0', '1', '2'}
tf.get_logger().setLevel('WARN')


## ---- start Memory setting ----
## Ask TF not to allocate all GPU memory at once.. allocate as needed
## Without this the execution will fail with "failed to initialize algorithm" error

from tensorflow.compat.v1.keras.backend import set_session
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True  # dynamically grow the memory used on the GPU
config.log_device_placement = True  # to log device placement (on which device the operation ran)
sess = tf.compat.v1.Session(config=config)
set_session(sess)
## ---- end Memory setting ----


## Step 1: Load the Data

Notice we are first loading this into a Pandas dataframe. This is fine for a small dataset, but we will need more than this for a large "at scale" notebook.

In [None]:
import os 
import pandas as pd

## small file, start with this
data_location_local = "https://s3.amazonaws.com/elephantscale-public/data/prosper-loan/prosper-loan-data-sample.csv"
data_location = "https://s3.amazonaws.com/elephantscale-public/data/prosper-loan/prosper-loan-data-sample.csv"

## this is a large file
# data_location_local = "../data/prosper-loan/prosper-loan-data.csv.gz"
# data_location = "https://s3.amazonaws.com/elephantscale-public/data/prosper-loan/prosper-loan-data.csv.gz"

## access the file
if not os.path.exists (data_location_local):
    data_location_local = keras.utils.get_file(fname=os.path.basename(data_location),
                                           origin=data_location)
print ('data_location_local:', data_location_local)
data = pd.read_csv(data_location_local)
data

## Step 2 : Explore Data

In [None]:
prosper_clean = data.dropna()

print("Original record count {:,}, cleaned records count {:,},  dropped {:,}"\
      .format(len(data), len(prosper_clean), 
              (len(data) - len(prosper_clean))))
prosper_clean

In [None]:
## TODO : do you see data skew?
## LoanStatus is what we are trying to predict

print(prosper_clean['LoanStatus'].value_counts())

# 1 - paid
# 0 - defaulted

In [None]:
print(prosper_clean['EmploymentStatus'].value_counts())


In [None]:
print(prosper_clean['ListingCategory'].value_counts())

## Step 3 - Shape Data

### 3.1 - Select Columns to consider

In [None]:
## categorical columns : These columns need to be encoded 
categorical_columns = ['ListingCategory', 'BorrowerState','EmploymentStatus']
label_column = ['LoanStatus']

## numeric columns : these columns will be scaled

## Approcah 1: We can manually define these columns

## TODO : Add 'CreditScore' and 'YearsWithCredit' to this
numeric_colums = ['Term', 'BorrowerRate', 'ProsperRating (numeric)', 'ProsperScore', 'EmploymentStatusDuration', 
                   'CurrentCreditLines', 'OpenCreditLines',  '???', '???'
                 ]

## Approach 2 : include every thing but categorical and label
## TODO Later:  Once you have a base line benchmark, just thrown in all numeric columns and see if it gives better results

# numeric_colums = [c for c in prosper_clean.columns if c not in categorical_columns + label_column]

input_columns = categorical_columns + numeric_colums

print ('categorical columns: ', categorical_columns)
print ()
print ('numeric columns: ', numeric_colums)
print ()
print ("label column : ", label_column)

In [None]:
print ("selected data:")
prosper_clean[label_column + input_columns]

### 3.1 - Encode Categorical Data

**Categorical data** can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property. 

Here we will encode the 3 categorical data columns with **OneHotEncoder** *'ListingCategory', 'BorrowerState', and 'Employmentstatus'*  into numerical data for our model to be able to use this data.  One-hot encoding is a representation of categorical variables as binary vectors.  



In [None]:
from sklearn.preprocessing import OneHotEncoder

class categorical_encoder():
    def __init__(self):
        self.encoder = OneHotEncoder()
    def fit(self, data):
        self.encoder.fit(data)
        self.columns = data.columns
    def transform(self, data):    
        columns = list()
        for j,col in enumerate(self.columns):
            for val in self.encoder.categories_[j]:
                columns.append(col + '_is_' + str(val))
        return pd.DataFrame(self.encoder.transform(data).toarray(), dtype=bool, columns=columns)
    
category_encoder = categorical_encoder()

In [None]:
categorical_data = prosper_clean[categorical_columns]
categorical_data

In [None]:
category_encoder.fit(categorical_data)
categorical_data_encoded = category_encoder.transform(categorical_data)
categorical_data_encoded

### Scale Numerical Data 

**Numerical data** is data that is measurable, such as time, height, weight, amount, and so on. You can help yourself identify numerical data by seeing if you can average or order the data in either ascending or descending order.

We will normalize the numerical data columns using **StandardScaler.**  StandardScaler, standardize features by removing the mean and scaling to unit variance.   

In [None]:
from sklearn.preprocessing import StandardScaler

numerical_data = prosper_clean[numeric_colums]
ss = StandardScaler()
numerical_data_scaled = pd.DataFrame(ss.fit_transform(numerical_data), columns = numerical_data.columns, dtype='float32')
numerical_data_scaled

In [None]:
# x = prosper_clean [input_columns]
x = pd.concat([categorical_data_encoded, numerical_data_scaled], axis = 1)
y = prosper_clean[label_column]

print (y.head())
print('-----')
x

## TODO : Inspect the final data.  Does it look correct?

### 3.2 - Ensure Columns are in correct data type

In [None]:
## TODO-Later: Once you get the program working, comment out the folowing and run the notebook
##  See what happens.  It is a good error to remember :-) 

## convert input columns to float
for col in x.columns:
    x[col] = x[col].astype('float32')
    
## y column is int64
## If running on GPUs convert to int8, save some memory
y = data[label_column].astype('int8')

print (y.head())
print('-----')
x

## TODO : Inspect the final data.  Does it look correct?

### 3.3 - Create train/test split

In [None]:
from sklearn.model_selection import train_test_split

## TODO : Split the data 80% / 20%
## Hint : test_size = 0.2
x_train,x_test, y_train,y_test = train_test_split(x,y,test_size=???,random_state=0) 

# # backup
# x_train_bak = x_train.copy(deep=True)
# x_test_bak = x_test.copy(deep=True)

print ("x_train.shape : ", x_train.shape)
print ("y_train.shape : ", y_train.shape)
print ("x_test.shape : ", x_test.shape)
print ("y_test.shape : ", y_test.shape)

## Step 4 : Build the Model
Since this is a classifier, here is how we are going to build the neural network
- Neurons in Input layer  = input dimensions (4 here)
- Neurons in hidden layer = ???
- Neurons in Output layer = output classes (binary)
- Output activation is 'sigmoid'

**Optimizers** trains models fast, but it also prevents them from getting stuck in a local minimum. Optimizers are the engine of machine learning — they make the computer learn.

Here are the **optimizers** we will be working with and you can change:
- **RMSprop**, gradient-based optimization technique using a moving average of squared gradients to normalize the gradient itself
- **Adam**, is an adaptive learning rate optimization algorithm that's been designed specifically for training deep neural networks. The algorithms leverages the power of adaptive learning rates methods to find individual learning rates for each parameter.

### TODO : Sketch the neural net
- What is the input dimensions
- how many neurons in layers
- how many output neurons

<img src="../assets/images/neural-net-unknown.png" style="width:40%"/>

In [None]:
## TODO build a network
##    - number of neurons @ first hidden layer = 64, activation=tf.nn.relu
##    - Dropout layer :  Dropout (0.2) - drop 20% of signals  -
##    - neurons for second hidden layer :  32,  activation=tf.nn.relu
##    - final output layer : 1 neuron, activation=tf.nn.sigmoid

## TODO-Later:  Remove Dropout layers, and run it again.
##              Does it make a differencen in results?  Can you explain?

model = tf.keras.Sequential([
    # input layer is implicit
    tf.keras.layers.Dense(units=??, activation=???, input_dim=x.shape[1]),
    tf.keras.layers.Dropout(???),
    tf.keras.layers.Dense(units=???, activation=???),
    tf.keras.layers.Dropout(???),
    
        ## TODO-Later : Experiment by adding more layers?
#     tf.keras.layers.Dense(units=16,activation=tf.nn.relu),
#     tf.keras.layers.Dropout(0.2),
#     tf.keras.layers.Dense(8,activation=tf.nn.relu),
#     tf.keras.layers.Dropout(0.2),

    tf.keras.layers.Dense(units=???, activation=???)
  ])

## include a bunch of metrics
metrics = [
    'accuracy',
    tf.keras.metrics.TruePositives(name='tp'),
    tf.keras.metrics.FalsePositives(name='fp'),
    tf.keras.metrics.TrueNegatives(name='tn'),
    tf.keras.metrics.FalseNegatives(name='fn'),
    tf.keras.metrics.Precision(name='precision'),
    tf.keras.metrics.Recall(name='recall'),
    tf.keras.metrics.AUC(name='auc')
  ]

# metrics = ['accuracy' ]

## TODO-Later: Experiment with different optimizers.
##   Do they make a diff?

# opt = tf.keras.optimizers.RMSprop()
# opt=tf.keras.optimizers.RMSprop(lr=0.000001)
opt = 'adam'

model.compile(loss='binary_crossentropy',
                optimizer=opt,
                metrics=metrics)

print(model.summary())

tf.keras.utils.plot_model(model, to_file='model.png', show_shapes=True)

## Step 5 : Tensorboard

In [None]:
## This is fairly boiler plate code
import datetime
import os
import shutil

app_name = 'classification-prosper'
# timestamp  = datetime.datetime.now().strftime("%Y-%m-%d--%H-%M-%S")
tb_top_level_dir= '/tmp/tensorboard-logs'
tb_app_dir = os.path.join (tb_top_level_dir, app_name)
tb_logs_dir = os.path.join (tb_app_dir, datetime.datetime.now().strftime("%H-%M-%S"))
print ("Saving TB logs to : " , tb_logs_dir)
#clear out old logs
shutil.rmtree ( tb_app_dir, ignore_errors=True )
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=tb_logs_dir, write_graph=True, 
                                                      write_images=True, histogram_freq=1)

## This will embed Tensorboard right here in jupyter!
%load_ext tensorboard
%tensorboard --logdir $tb_logs_dir

## Step 6 : Train

In [None]:
%%time


## TODO : Start with epochs 10.  Observe the training
epochs = ???

print ("training starting ...")
## TODO : specify 20% data for validation (Hint : validation_split = 0.2)
history = model.fit(
              x_train, y_train,
              epochs=epochs, validation_split = ???, verbose=1,
              callbacks=[tensorboard_callback])
print ("training done.")

## TODO-Later : Try to increase the epochs
## TODO-Later : Training taking too long?  Switch to GPU !
##              Observe the debug output on the top to make sure you are infact using GPU

## Step 7 : Plot History

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

if 'accuracy' in history.history:
    plt.plot(history.history['accuracy'], label='train_accuracy')
if 'val_accuracy' in history.history:
    plt.plot(history.history['val_accuracy'], label='val_accuracy')
plt.legend()
plt.show()

## Step 8 : Predict

**predict**  will return the scores of the regression
   
**predict_classes**  will return the class of your prediction


In [None]:
import numpy as np

## Raw predictions are probabilities
predictions = model.predict(x_test)

## we need to map the probabilities into 0/1
y_pred = predictions2 = [0 if n < 0.5 else 1 for n in predictions]

np.set_printoptions(formatter={'float': '{: 0.2f}'.format})

print ('predictions : ' , predictions[:10])
print ('prediction2: ' , predictions2[:10])

## Step 9 : Evaluate the model

### 9.1 - Print out metrics

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

In [None]:
metric_names = model.metrics_names
print ("model metrics : " , metric_names)

metrics = model.evaluate(x_test, y_test, verbose=0)

for idx, metric in enumerate(metric_names):
    print ("Metric : {} = {:,.2f}".format (metric_names[idx], metrics[idx]))

### 9.2 - Confussion Matrix
Since this is a classification problem, confusion matrix is very effective way to evaluate our model

Visualizing the confusion matrix:

In [None]:
## plain confusion matrix 

from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(y_test, y_pred, labels = [0,1])
cm


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize = (8,5))

# colormaps : cmap="YlGnBu" , cmap="Greens", cmap="Blues",  cmap="Reds"
sns.heatmap(cm, annot=True, cmap="Reds", fmt='d').plot()


### 9.3 - Metrics calculated from Confusion Matrix

In [None]:
from sklearn.metrics import classification_report
from pprint import pprint

pprint(classification_report(y_test, y_pred, output_dict=True))

### TODO : Intepret confusion matrix
Instructor will walk you through the matrix.  
Answer these questions
- which class is classified correctly mostly
- which class is classified incorrectly?

## Step 10 : Improve the Model

Inspect the following
- What is the metric 'accuracy' in step 9.1
- And verify this with tensorboard (port 6066)

Most likely, we didn't get a great accuracy.  
How can we improve it?

**Try the following ideas** 

- **Idea-1 : Increase neurons in hidden layer**  
  - In Step-4, increase hidden layer neurons from 8 --> 64  
  - Click 'Kernel --> Restart and Run all Cells'  
  - Hopefully you should see improvement in the accuracy.  
  - Check  accuracy metrics / confusion matrix / tensorboard
- **Idea-2 : Increase epochs**
  - Increasing the epochs may cause cause your data to overfit
  - Look at time and how long it will take to run when increasing epochs
- **Idea-3 : Change optimizers** 
 - The optimizer interacts with the initialization scheme, so this might need to be changed.
 - The learning rate may need to be changed.
 - The learning rate schedule may need to be adjusted.
- **Idea-3 : Change scalers**
  - Try different scalers
  - Try data without using a scaler

## Cleanup 
Before running the next exercise, run the following cell to terminate processes and free up resources

In [None]:
## Kill any child processes (like tensorboard)

import psutil
import os, signal

current_process = psutil.Process()
children = current_process.children(recursive=True)
for child in children:
    print('Killing Child pid  {}'.format(child.pid))
    os.kill(child.pid, signal.SIGKILL)
    
## This will kill actual kernel itself
# os.kill(os.getpid(), signal.SIGKILL)