In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [2]:
train_df = pd.read_csv(Path('Resources/2019loans.csv'))
test_df = pd.read_csv(Path('Resources/2020Q1loans.csv'))

In [3]:
print(train_df.shape)

(12180, 86)


In [4]:
print(test_df.shape)

(4702, 86)


In [5]:
# Convert categorical data to numeric and separate target feature for training data

# Get X data by removing loan_status column (outcome)
X = train_df.drop(['loan_status', 'Unnamed: 0'], axis=1)
X_dummies = pd.get_dummies(X, drop_first=True)
print(X_dummies.shape)

(12180, 86)


In [6]:
# Convert categorical data to numeric and separate target feature for testing data
X_test = test_df.drop(['loan_status', 'Unnamed: 0'], axis=1)
X_test_dummies = pd.get_dummies(X_test, drop_first=True)
print(X_test_dummies.shape)

for el in X_dummies.columns:
    if el not in X_test_dummies.columns:
        print(el)

# Add missing column to test dummies df
X_test_dummies['debt_settlement_flag_Y'] = 0

(4702, 85)
debt_settlement_flag_Y


In [7]:
# Convert output labels to 0 and 1
y_label = LabelEncoder().fit_transform(train_df['loan_status'])
y_test_label = LabelEncoder().fit_transform(test_df['loan_status'])

# 1 = low risk, 0 = high risk

I anticipate that the Logistic Regression model will perform better than the Random Forest Classifier because the latter does not tend to generalize well and we are working with a very large number of features here. I believe the Random Forest Classifier is more likely to overfit to noise in the training data and thus perform poorly with the testing set.

In [8]:
# Train the Logistic Regression model on the unscaled data and print the model score
logmodel = LogisticRegression()
logmodel.fit(X_dummies, y_label)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [9]:
print(f"Training Data Score: {logmodel.score(X_dummies, y_label)}")
print(f"Testing Data Score: {logmodel.score(X_test_dummies, y_test_label)}")

Training Data Score: 0.6553366174055829
Testing Data Score: 0.5202041684389621


In [10]:
# Train a Random Forest Classifier model and print the model score
clf = RandomForestClassifier(random_state=1, n_estimators=50).fit(X_dummies, y_label)
print(f'Training Score: {clf.score(X_dummies, y_label)}')
print(f'Testing Score: {clf.score(X_test_dummies, y_test_label)}')

Training Score: 1.0
Testing Score: 0.6707783921735432


I believe that the after scaling the data to remove bias from features with higher magnitudes, the Random Forest Classifier will perform better than the Logistic Regression model both in the training set and in the testing set, since the Random Forest model is designed for more complex relationships between features and doesn't assume linearity.

In [11]:
# Scale the data
scaler = StandardScaler().fit(X_dummies)
X_train_scaled = scaler.transform(X_dummies)
X_test_scaled = scaler.transform(X_test_dummies)

In [12]:
# Train a Random Forest Classifier model on the scaled data and print the model score
clf1 = RandomForestClassifier(random_state=1, n_estimators=50).fit(X_train_scaled, y_label)
print(f'Training Score: {clf1.score(X_train_scaled, y_label)}')
print(f'Testing Score: {clf1.score(X_test_scaled, y_test_label)}')

Training Score: 1.0
Testing Score: 0.6714164185452999


### Log Model scores:
Unscaled data: 

    Training Data Score: 0.6575533661740558
    Testing Data Score: 0.5204168438962143
    
Scaled data:

    Training Score: 0.7130541871921182
    Testing Score: 0.7216078264568269

Random Forest scores: Unscaled data: Training Score: 1.0 Testing Score: 0.6707783921735432 Scaled data: Training Score: 1.0 Testing Score: 0.6714164185452999

Scaling the data had a very good effect on the outcome of the Log model. The scores for both the training and test sets increased, and the scores got closer together (0.13 difference reduced to 0.01), meaning the Log model is now likely to generalize well too. Scaling the data did not really have much effect on the Random Forest classifier. It still appears to be the case that the model is overfitting on the training data and thus performing poorly on the test data set. Overall, I believe the Log Model (with scaled data) is the better choice in this situation, since it seems to be less sensitive to noise in the data or any collinearity between features. I think we would see a drastic improvement in the Random Forest model if we implemented feature selection techniques, but that might be overkill for this use case. My predicition about the effect scaling would have on the models was not quite accurate. My first prediction, where I mention that random forests do not tend to generalize well, was closer to the truth.