# Credit Risk Evaluator

### Which model will be better? LogisticRegression or RandomForestClassifier?
#### Predictions: 

I predict that the RandomForestClassifier will be a better model due to the data set containing several columns of categorical data.

In [32]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder

In [33]:
train_df = pd.read_csv(Path('Resources/2019loans.csv'))
test_df = pd.read_csv(Path('Resources/2020Q1loans.csv'))

In [34]:
# train_df

In [35]:
# test_df

In [36]:
# Convert categorical data to numeric and separate target feature for training data
X_tr = train_df.drop('target', axis=1)
pd.get_dummies(X_tr[['home_ownership', 'verification_status', 'pymnt_plan', 'initial_list_status', 'application_type', 'hardship_flag', 'debt_settlement_flag']])
X_tr_dummies = pd.get_dummies(X_tr)
# print(X_tr_dummies.columns)
y_tr = LabelEncoder().fit_transform(train_df['target'])
# X_tr_dummies.columns

In [37]:
# Convert categorical data to numeric and separate target feature for testing data
X_te = test_df.drop('target', axis=1)
pd.get_dummies(X_te[['home_ownership', 'verification_status', 'pymnt_plan', 'initial_list_status', 'application_type', 'hardship_flag', 'debt_settlement_flag']])
X_te_dummies = pd.get_dummies(X_te)
# print(X_te_dummies.columns)
y_te = LabelEncoder().fit_transform(test_df['target'])
# X_te_dummies.columns

In [59]:
# add missing dummy variables to testing set
X_tr_dummies.columns.symmetric_difference(X_te_dummies.columns) # checks to see different columns between two dfs
X_te_dummies['debt_settlement_flag_Y'] = 0
# X_te_dummies

In [70]:
# Train the Logistic Regression model on the unscaled data and print the model score
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(solver='lbfgs', max_iter=9000)
classifier.fit(X_tr_dummies, y_tr)
print(f"Training Data Score: {classifier.score(X_tr_dummies, y_tr)}")
print(f"Testing Data Score: {classifier.score(X_te_dummies, y_te)}")

Training Data Score: 0.7045155993431855
Testing Data Score: 0.5595491280306253


In [78]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=5, n_estimators=150).fit(X_tr_dummies, y_tr)
print(f"Training Data Score: {clf.score(X_tr_dummies, y_tr)}")
print(f"Testing Data Score: {clf.score(X_te_dummies, y_te)}")

Training Data Score: 1.0
Testing Data Score: 0.6422798809017439


### Thoughts before scaling data
#### Comparisons from unscaled data: 
The logistic regression model seems to be performing better on the unscaled data. The random forest classifier is performing poorly, possibly due to overfitting.
#### Predictions for scaled data: 
I think the performance of the unscaled data is too poor in the random forest classifier to overcome the deficit of the logistical regression model. I predict both models will perform better after scaling.

In [74]:
# Scale the data
scaler = StandardScaler().fit(X_tr_dummies)
X_train_scaled = scaler.transform(X_tr_dummies)
X_test_scaled = scaler.transform(X_te_dummies)

In [75]:
# Train the Logistic Regression model on the scaled data and print the model score
classifier.fit(X_train_scaled, y_tr)
print(f"Training Data Score: {classifier.score(X_train_scaled, y_tr)}")
print(f"Testing Data Score: {classifier.score(X_test_scaled, y_te)}")

Training Data Score: 0.7107553366174055
Testing Data Score: 0.7594640578477244


In [81]:
# Train a Random Forest Classifier model on the scaled data and print the model score
clf2 = RandomForestClassifier(random_state=5, n_estimators=150).fit(X_train_scaled, y_tr)
print(f'Training Score: {clf2.score(X_train_scaled, y_tr)}')
print(f'Testing Score: {clf2.score(X_test_scaled, y_te)}')

Training Score: 1.0
Testing Score: 0.6418545299872395


### Results
#### Comparisons from scaled data: 
It seems the prediction was incorrect due to random forest classifier actually performing worse after scaling. The logistic regression model performed significantly better after scaling.
#### Results for scaled data: 
The logistic regression model seems to be the way to go meaning my prediction was incorrect. The data set could possibly use some more pre-processing to achieve better results.