# Neural Network : Prosper Loan Dataset

We are going to look at the prosper loan dataset.  This dataset shows a history of loans made by Prosper.


In [1]:
%matplotlib inline
import time,datetime
import pandas as pd
import matplotlib.pyplot as plt




try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf


## Step 1: Load the Data

Notice we are first loading this into a Pandas dataframe. This is fine for a small dataset, but we will need more than this for a large "at scale" notebook.

In [2]:
## small file, start with this
#datafile = "https://s3.amazonaws.com/elephantscale-public/data/prosper-loan/prosper-loan-data-sample.csv"
## this is a large file
datafile = "https://s3.amazonaws.com/elephantscale-public/data/prosper-loan/prosper-loan-data.csv.gz"


data = pd.read_csv(datafile)
data

Unnamed: 0,Term,LoanStatus,BorrowerRate,ProsperRating (numeric),ProsperScore,ListingCategory,BorrowerState,EmploymentStatus,EmploymentStatusDuration,IsBorrowerHomeowner,...,ProsperPaymentsOneMonthPlusLate,ProsperPrincipalBorrowed,ProsperPrincipalOutstanding,LoanOriginalAmount,MonthlyLoanPayment,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,Investors,YearsWithCredit
0,36,1,0.1580,4.0,6.0,Unknown,CO,Self-employed,2.0,True,...,0.0,0.0,0.00,9425,330.43,0,0,0.0,258,13
1,36,1,0.1325,4.0,6.0,Unknown,Unknown,Full-time,19.0,False,...,0.0,0.0,0.00,1000,33.81,0,0,0.0,53,14
2,36,0,0.1435,5.0,4.0,Debt,AL,Employed,1.0,False,...,0.0,0.0,0.00,4000,137.39,0,0,0.0,1,18
3,36,0,0.3177,1.0,5.0,Household,FL,Other,121.0,True,...,0.0,0.0,0.00,4000,173.71,0,0,0.0,10,15
4,36,1,0.2075,4.0,6.0,Unknown,MI,Full-time,36.0,False,...,0.0,0.0,0.00,3000,112.64,0,0,0.0,53,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49719,36,1,0.0679,4.0,6.0,Personal,WA,Full-time,69.0,True,...,0.0,1000.0,847.61,4292,132.11,2,0,0.0,194,42
49720,36,1,0.1899,4.0,6.0,Business,CO,Full-time,22.0,False,...,0.0,14250.0,0.02,2000,73.30,0,0,0.0,25,10
49721,36,1,0.2639,2.0,3.0,Reno,FL,Employed,25.0,False,...,0.0,0.0,0.00,2500,101.25,0,0,0.0,26,6
49722,36,0,0.1110,6.0,8.0,Other,PA,Employed,21.0,True,...,0.0,33501.0,4815.42,2000,65.57,0,0,0.0,22,22


In [3]:
## TODO : select a few columns 
## start with: 'LoanStatus',  'EmploymentStatus', 'CreditScore', 'StatedMonthlyIncome', 'ListingCategory'
#select_columns = ['LoanStatus', 'EmploymentStatus', 'CreditScore', '???', '???']


## we can add more later

select_columns = ['LoanStatus',  'EmploymentStatus', 'CreditScore', 'StatedMonthlyIncome', 'ListingCategory']

## Note : vector columns can only have Numbers, don't include Categorical columns here
## And definitely not 'LoanStatus'  (if you are curiuos include and see what happens!)
vector_columns = [ 'EmpIndex', 'CreditScore', 'StatedMonthlyIncome', 'CategoryIndex']

## Feature Columns

feature_columns = ['EmploymentStatusFactor', 'CreditScore', 'StatedMonthlyIncome', 'ListingCategoryFactor']




## Step 2 : Clean Data

In [4]:
## TODO :  Drop any NA, null values.  
## Hint : Using `.na.drop()`
prosper_clean = data.dropna()

print("Original record count {:,}, cleaned records count {:,},  dropped {:,}"\
      .format(len(data), len(prosper_clean), 
              (len(data) - len(prosper_clean))))
prosper_clean

Original record count 49,724, cleaned records count 49,724,  dropped 0


Unnamed: 0,Term,LoanStatus,BorrowerRate,ProsperRating (numeric),ProsperScore,ListingCategory,BorrowerState,EmploymentStatus,EmploymentStatusDuration,IsBorrowerHomeowner,...,ProsperPaymentsOneMonthPlusLate,ProsperPrincipalBorrowed,ProsperPrincipalOutstanding,LoanOriginalAmount,MonthlyLoanPayment,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,Investors,YearsWithCredit
0,36,1,0.1580,4.0,6.0,Unknown,CO,Self-employed,2.0,True,...,0.0,0.0,0.00,9425,330.43,0,0,0.0,258,13
1,36,1,0.1325,4.0,6.0,Unknown,Unknown,Full-time,19.0,False,...,0.0,0.0,0.00,1000,33.81,0,0,0.0,53,14
2,36,0,0.1435,5.0,4.0,Debt,AL,Employed,1.0,False,...,0.0,0.0,0.00,4000,137.39,0,0,0.0,1,18
3,36,0,0.3177,1.0,5.0,Household,FL,Other,121.0,True,...,0.0,0.0,0.00,4000,173.71,0,0,0.0,10,15
4,36,1,0.2075,4.0,6.0,Unknown,MI,Full-time,36.0,False,...,0.0,0.0,0.00,3000,112.64,0,0,0.0,53,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49719,36,1,0.0679,4.0,6.0,Personal,WA,Full-time,69.0,True,...,0.0,1000.0,847.61,4292,132.11,2,0,0.0,194,42
49720,36,1,0.1899,4.0,6.0,Business,CO,Full-time,22.0,False,...,0.0,14250.0,0.02,2000,73.30,0,0,0.0,25,10
49721,36,1,0.2639,2.0,3.0,Reno,FL,Employed,25.0,False,...,0.0,0.0,0.00,2500,101.25,0,0,0.0,26,6
49722,36,0,0.1110,6.0,8.0,Other,PA,Employed,21.0,True,...,0.0,33501.0,4815.42,2000,65.57,0,0,0.0,22,22


## Look at some summary data

In [5]:
print(prosper_clean['LoanStatus'].value_counts())
print(prosper_clean['EmploymentStatus'].value_counts())
print(prosper_clean['ListingCategory'].value_counts())


1    33530
0    16194
Name: LoanStatus, dtype: int64
Full-time        25016
Employed         18393
Self-employed     3045
Part-time         1060
Other              924
Retired            703
Not employed       583
Name: EmploymentStatus, dtype: int64
Debt             19107
Unknown           9335
Other             6272
Business          4449
Reno              3468
Personal          2392
Auto              1596
Student            756
Household          675
Medical            444
Taxes              246
Vacation           225
LargePurchase      224
Wedding            196
Motorcycle         103
Engagement          72
Cosmetic            47
Baby                46
Boat                30
Green               23
RV                  18
Name: ListingCategory, dtype: int64


**=> What does that say about the cardinality of these categorical columns? ***



## Step 3: Converting Categorical columns 

Convert categorical columns to numeric.   
Here let's convert **EmploymentStatus** column

In [6]:
# use pd.factorize on EmploymentStatus, ListingCategory

prosper_clean['EmploymentStatusFactor'] = pd.factorize(prosper_clean['EmploymentStatus'])[0]
prosper_clean['ListingCategoryFactor'] = pd.factorize(prosper_clean['ListingCategory'])[0]

## Step 4: Build feature vectors 

In [7]:
features = prosper_clean[feature_columns]
label = prosper_clean['LoanStatus']

In [8]:
len(features.columns)

4

## Step 5: Split Data into training and test.

We will split our the data up into training and test.  (You know the drill by now).

**=> TODO: Split dataset into 70% training, 30% validation**


In [9]:
## TODO :  Split the data into 70% training and 30% test sets 
## Hint : 0.7   , 0.3
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(features, label)
print("training set = " , len(train_x))
print("testing set = " , len(test_x))

training set =  37293
testing set =  12431


In [10]:

# Normalize the input features using the sklearn StandardScaler.
# This will set the mean to 0 and standard deviation to 1.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_features = scaler.fit_transform(train_x)
test_features = scaler.transform(test_x)


In [11]:
len(train_x.keys())

4

In [12]:
y = tf.keras.utils.to_categorical(train_y)
#y=train_y
print(y)



[[0. 1.]
 [0. 1.]
 [1. 0.]
 ...
 [0. 1.]
 [0. 1.]
 [0. 1.]]


## Step 6: Neural Network

Note this using Tensorflow's Keras interface, which is going to be the standard high-level interface for Tensorflow starting with 2.0


In [13]:
def build_model(train_x):
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation=tf.nn.tanh, input_dim=len(features.columns)),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(64, activation=tf.nn.tanh),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)
  ])
  
  
  metrics = [
      tf.keras.metrics.Accuracy(name='accuracy'),
      tf.keras.metrics.TruePositives(name='tp'),
      tf.keras.metrics.FalsePositives(name='fp'),
      tf.keras.metrics.TrueNegatives(name='tn'),
      tf.keras.metrics.FalseNegatives(name='fn'),
      tf.keras.metrics.Precision(name='precision'),
      tf.keras.metrics.Recall(name='recall'),
      tf.keras.metrics.AUC(name='auc')
  ]


  model.compile(loss='binary_crossentropy',
                optimizer=tf.keras.optimizers.RMSprop(lr=0.000001),
                metrics=metrics)
  return model


model = build_model(train_features)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 64)                320       
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 65        
Total params: 4,545
Trainable params: 4,545
Non-trainable params: 0
_________________________________________________________________


In [14]:

EPOCHS = 10
BATCH_SIZE = 256



log_dir="logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)


model.fit(
  train_x, train_y,
  epochs=EPOCHS, validation_split = 0.2, verbose=2, callbacks=[tensorboard_callback])

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 29834 samples, validate on 7459 samples
Epoch 1/10
29834/29834 - 2s - loss: 0.7223 - accuracy: 0.0000e+00 - tp: 12705.0000 - fp: 6239.0000 - tn: 3487.0000 - fn: 7403.0000 - precision: 0.6707 - recall: 0.6318 - auc: 0.4943 - val_loss: 0.6581 - val_accuracy: 0.0000e+00 - val_tp: 4916.0000 - val_fp: 2318.0000 - val_tn: 93.0000 - val_fn: 132.0000 - val_precision: 0.6796 - val_recall: 0.9739 - val_auc: 0.4890
Epoch 2/10
29834/29834 - 1s - loss: 0.7009 - accuracy: 0.0000e+00 - tp: 14686.0000 - fp: 7085.0000 - tn: 2641.0000 - fn: 5422.0000 - precision: 0.6746 - recall: 0.7304 - auc: 0.5005 - val_loss: 0.6446 - val_accuracy: 0.0000e+00 - val_tp: 4916.0000 - val_fp: 2318.0000 - val_tn: 93.0000 - val_fn: 132.0000 - val_precision: 0.6796 - val_recall: 0.9739 - val_auc: 0.4878
Epoch 3/10
29834/29834 - 1s - loss: 0.6926 - accuracy: 0.0000e+00 - tp: 15903.0000 - fp: 7659.0000 - tn: 2067.0000 - fn: 

<tensorflow.python.keras.callbacks.History at 0x7f10d41fcfd0>

In [15]:
predictions = model.predict(test_features)

In [16]:
predictions

array([[0.44043687],
       [0.42610982],
       [0.61827   ],
       ...,
       [0.666938  ],
       [0.49820292],
       [0.61985314]], dtype=float32)

In [17]:
from sklearn.metrics import confusion_matrix

confusion_matrix(test_y, predictions > 0.5)

array([[2313, 1744],
       [4299, 4075]])

## Step 7: Evaluate the model.

Let us check to see how the model did, using accuracy as a measure.

In [18]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions > 0.5)


0.5138765988255168

## Step 8: Tensorboard

In [19]:
%tensorboard --logdir logs/fit  # For Colab

# jupyter: run the following at the command line: tensorboard --logdir logs/fit

UsageError: Line magic function `%tensorboard` not found.


## Step 9: Improve Accuracy

### Add more features
Look at the schema of the full dataset.  Are there any columns you want to add. Make sure you up the number of neurons in the hidden layer as you add more features.