In [1]:
import numpy as np
import pandas as pd
from pathlib import Path

In [2]:
train_df = pd.read_csv(Path('Resources/2019loans.csv'))
test_df = pd.read_csv(Path('Resources/2020Q1loans.csv'))

In [3]:
train_df.head()

Unnamed: 0,loan_amnt,int_rate,installment,home_ownership,annual_inc,verification_status,pymnt_plan,dti,delinq_2yrs,inq_last_6mths,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,hardship_flag,debt_settlement_flag,target
0,7000.0,0.1894,256.38,MORTGAGE,75000.0,Not Verified,n,28.62,0.0,2.0,...,87.5,0.0,0.0,352260.0,62666.0,35000.0,10000.0,N,N,low_risk
1,40000.0,0.1614,975.71,MORTGAGE,102000.0,Source Verified,n,11.72,2.0,0.0,...,0.0,0.0,0.0,294664.0,109911.0,9000.0,71044.0,N,N,low_risk
2,11000.0,0.2055,294.81,RENT,45000.0,Verified,n,37.25,1.0,3.0,...,7.7,0.0,0.0,92228.0,36007.0,33000.0,46328.0,N,N,low_risk
3,4000.0,0.1612,140.87,MORTGAGE,38000.0,Not Verified,n,42.89,1.0,0.0,...,100.0,0.0,0.0,284273.0,52236.0,13500.0,52017.0,N,N,low_risk
4,14000.0,0.1797,505.93,MORTGAGE,43000.0,Source Verified,n,22.16,1.0,0.0,...,25.0,0.0,0.0,120280.0,88147.0,33300.0,78680.0,N,N,low_risk


In [4]:
test_df.head()

Unnamed: 0,loan_amnt,int_rate,installment,home_ownership,annual_inc,verification_status,pymnt_plan,dti,delinq_2yrs,inq_last_6mths,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,hardship_flag,debt_settlement_flag,target
0,40000.0,0.1033,856.4,RENT,128700.0,Source Verified,n,12.47,0.0,1.0,...,57.1,0.0,0.0,63915.0,49510.0,49400.0,14515.0,Y,N,low_risk
1,24450.0,0.143,572.72,MORTGAGE,44574.0,Not Verified,n,15.05,0.0,1.0,...,0.0,0.0,0.0,136425.0,19439.0,15500.0,18925.0,N,N,low_risk
2,13500.0,0.143,316.23,OWN,60000.0,Not Verified,n,28.72,0.0,0.0,...,0.0,0.0,0.0,82124.0,65000.0,5400.0,61724.0,Y,N,low_risk
3,10625.0,0.1774,268.31,RENT,60000.0,Verified,n,15.7,0.0,4.0,...,20.0,0.0,0.0,54855.0,50335.0,23200.0,26255.0,N,N,low_risk
4,6375.0,0.1862,232.46,RENT,60000.0,Source Verified,n,35.5,0.0,0.0,...,75.0,0.0,0.0,90445.0,56541.0,15300.0,72345.0,N,N,low_risk


In [5]:
train_df.dtypes

loan_amnt                     float64
int_rate                      float64
installment                   float64
home_ownership                 object
annual_inc                    float64
                               ...   
total_bc_limit                float64
total_il_high_credit_limit    float64
hardship_flag                  object
debt_settlement_flag           object
target                         object
Length: 84, dtype: object

Based on the above, there are a number of categorical variables that must be encoded as numeric variables before we can proceed with scaling, fitting, transforming and training the model. 

In [6]:
# Convert categorical data to numeric and separate target feature for training data
X_train = train_df.drop('target', axis=1)
y_train = train_df['target']
X_train_dummies = pd.get_dummies(X_train)
X_train_dummies.shape

(12180, 92)

In [7]:
# Convert categorical data to numeric and separate target feature for testing data
X_test = test_df.drop('target', axis=1)
y_test = test_df['target']
X_test_dummies = pd.get_dummies(X_test)
X_test_dummies.shape

(4702, 91)

In [8]:
# Check for missing columns between test and train
X_train_dummies.columns.difference(X_test_dummies.columns).tolist()

['debt_settlement_flag_Y']

Based on the above, we can see that there is one column missing from the testing dataset once the data was one-hot encoded. We will need to add this column into the testing dataset so as to have the same 'shape' for training and testing datasets.

In [9]:
# add missing dummy variables to testing set
X_test_dummies['debt_settlement_flag_Y'] = 0
X_test_dummies.shape


(4702, 92)

### Consider the models:
Given that RandomForestClassifier models make use of a subset of the data and bags it and generally perform better than LinearRegression models, it is my prediction that the RandomForestClassifier will perform better than the LinearRegression model because the data is balanced and therefore there is a lower risk of overfit. 

In [15]:
# Train the Logistic Regression model on the unscaled data and print the model score
from sklearn.linear_model import LogisticRegression
lg_class = LogisticRegression()
lg_class.fit(X_train_dummies, y_train)
print(f"Training Data Score: {lg_class.score(X_train_dummies, y_train)}")
print(f"Testing Data Score: {lg_class.score(X_test_dummies, y_test)}")


Training Data Score: 0.6535303776683087
Testing Data Score: 0.5091450446618461


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [26]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier
rf_class = RandomForestClassifier(random_state=1, n_estimators=1).fit(X_train_dummies, y_train)
print(f'Training Score: {rf_class.score(X_train_dummies, y_train)}')
print(f'Testing Score: {rf_class.score(X_test_dummies, y_test)}')

Training Score: 0.870935960591133
Testing Score: 0.5793279455550829


Based on the above two models, my prediction, while correct that the testing performance for the Random Forest model would be better, based on the training score (87), we can see that the model is overfit. As such, we must accept the Logistic Regression model as the better performing model despite performance being barely better than chance (50.9%), the difference between the testing and training scores are closer. The next logical step would be to scale the data as both models do not function well with non-normalized data. 

In [18]:
# Scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train_dummies)
X_train_scaled = scaler.transform(X_train_dummies)
X_test_scaled = scaler.transform(X_test_dummies)

Scaling the data should improve the performance of both models

In [19]:
# Train the Logistic Regression model on the scaled data and print the model score
lg_class_scaled = LogisticRegression()
lg_class_scaled.fit(X_train_scaled, y_train)
print(f"Training Data Score: {lg_class.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {lg_class.score(X_test_scaled, y_test)}")

Training Data Score: 0.6301313628899836
Testing Data Score: 0.49276903445342407


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [25]:
# Train a Random Forest Classifier model on the scaled data and print the model score
rf_class_scaled = RandomForestClassifier(random_state=1, n_estimators=1).fit(X_train_scaled, y_train)
print(f'Training Score: {rf_class_scaled.score(X_train_scaled, y_train)}')
print(f'Testing Score: {rf_class_scaled.score(X_test_scaled, y_test)}')

Training Score: 0.8706896551724138
Testing Score: 0.581242024670353


#### How do scores compare with each other:
Looking at the two models on the scaled data we can see that while the Random Forest model performed better on the scaled training data as compared to the Logistic Regression model, however, it suffers from overfit as it has performed significantly less well on the scaled test data. While the Logistic Regession model also performed better on the training than the test data, it is a more balanced model as the gap between the training and the test scores (.13) is smaller than that between the training and test scores of the Random Forest (.28). As such, the Logistic Regression is the better performing model.

#### How do scores compare with previous results (unscaled):
Surprisingly, the Logistic Regression model performed marginally better on the unscaled data than on the scaled data but was once again the more balanced model and the Random Forest model, while performance was about the same on both scaled and unscaled data remained the most overfit.  

#### How does this compare to my prediction.
I had predicted that scaling the data would improve performance for both models however, it decreased the performance of the Logistic Regression data and did not impact the Random Forest. It is likely that the dataset itself may have a more linear trend, some of which was removed, along with the noise, when the data was scaled. I would recommend applying other scaling algorithms and verifying if the data can be further processed to remove noise caused by outliers. 


In [4]:
randscale = 0.8706896551724138 - 0.581242024670353
randscale

0.2894476305020608

In [5]:
logscale = 0.6301313628899836 - 0.49276903445342407
logscale

0.13736232843655954

In [6]:
log = 0.6535303776683087 - 0.5091450446618461

random = 0.870935960591133 - 0.5793279455550829

print(log,random)


0.1443853330064626 0.29160801503605005
