## LendingClub - Back/Forward Testing

In this notebook, we shall load the model saved using Decision Tree algorithm (maxDepth = 6) and use it to perform back and forward testing from loans data available to evaluate the real-life model performance.

1. Data used to develop model : 2017Q1 to 2017Q3
2. BackTesting Data           : 2016Q1, 2016Q3, 2016Q4
3. Forward Testing Data       : 2017Q4

Back and Forward Testing Data (csv format) are placed in the "./Dataset/Processing" folder.

The performance is evaluated by checking the quantity of True and False Positives/Negatives. Definitions are as follows:

1. True Positive  = how often the model correctly predicted a loan defaulted
2. False Positive = how often the model predicted a loan defaulted when it was fully paid
3. True Negative  = how the model correctly predicted a loan was fully paid
4. False Negative = how often the model predicted a loan was fully paid when in fact it was defaulted

In [23]:
from pyspark.sql.functions import col, udf
from pyspark.sql.types import *
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel
import os
import shutil

# Selected Features for model input
selectedFeatures = ["int_rate", "loan_amnt", "total_pymnt", "out_prncp", "policy_code"]

fileDirectory = './Dataset/Processing/'
for fname in os.listdir(fileDirectory):  
    
    # extract data file name from full path
    filename = fname[0:len(fname)-4]
    
    # Load LoanStatsDF dataset
    LoanStatsDF = spark.read.csv(fileDirectory + fname, header=True, inferSchema=True, mode="DROPMALFORMED")
    
    # Converts numeric percentage string with "%" suffix to Float datatype
    func3 = udf (lambda x: 0.0 if x == None else float(x.strip('%'))/100.0, FloatType())
    # Process numeric percentage string columns in LoanStatsDF dataframe to convert to Float datatype
    LoanStatsML_df = LoanStatsDF.withColumn('int_rate', func3(col('int_rate')))
    
    # Load saved model and perform prediction on LoanStatsML_df data
    model = PipelineModel.load("./Model/LoanStats_model")
    predictions = model.transform(LoanStatsML_df)
    
    print("=========================================================================================================")
    print("Prediction Stats for {0}".format(filename + ".csv"))
    print("=========================================================================================================")
    
    trueP = predictions.filter((predictions.prediction == 1) & (predictions.loan_status == "Charged Off")).count()
    falseP = predictions.filter((predictions.prediction == 1) & (predictions.loan_status == "Fully Paid")).count()
    trueN = predictions.filter((predictions.prediction == 0) & (predictions.loan_status == "Fully Paid")).count()
    falseN = predictions.filter((predictions.prediction == 0) & (predictions.loan_status == "Charged Off")).count()
    P_Tot = trueP + falseP
    N_Tot = trueN + falseN
    
    print("Total Positives : {0}".format(P_Tot))
    print("Total Negatives : {0}".format(N_Tot))
    print("")
    print("True Positives : {0} ({1:.2%})".format(trueP, 1 if P_Tot == 0 else trueP/P_Tot))
    print("False Positives : {0} ({1:.2%})".format(falseP, 0 if P_Tot == 0 else falseP/P_Tot))
    print("True Negatives : {0} ({1:.2%})".format(trueN, 1 if N_Tot == 0 else trueN/N_Tot))
    print("False Negatives : {0} ({1:.2%})".format(falseN, 0 if N_Tot == 0 else falseN/N_Tot))
    print("")
    
    #predictions.toPandas().to_csv("./Dataset/Predictions/" + filename + "_Predictions.csv", sep=',')

Prediction Stats for LoanStats_2017Q4.csv
Total Positives : 25
Total Negatives : 3846

True Positives : 25 (100.00%)
False Positives : 0 (0.00%)
True Negatives : 3843 (99.92%)
False Negatives : 3 (0.08%)

Prediction Stats for LoanStats_2016Q4.csv
Total Positives : 4823
Total Negatives : 21511

True Positives : 4823 (100.00%)
False Positives : 0 (0.00%)
True Negatives : 21205 (98.58%)
False Negatives : 306 (1.42%)

Prediction Stats for LoanStats_2016Q1.csv
Total Positives : 10062
Total Negatives : 47469

True Positives : 10062 (100.00%)
False Positives : 0 (0.00%)
True Negatives : 41833 (88.13%)
False Negatives : 5636 (11.87%)

Prediction Stats for LoanStats_2016Q3.csv
Total Positives : 6869
Total Negatives : 27345

True Positives : 6869 (100.00%)
False Positives : 0 (0.00%)
True Negatives : 25538 (93.39%)
False Negatives : 1807 (6.61%)

