# Credit Risk Evaluator

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

## Retrieve the Data

The data is located in the Challenge Files Folder:

* `lending_data.csv`

Import the data using Pandas. Display the resulting dataframe to confirm the import was successful.

In [2]:
# Import the data
lending_data = pd.read_csv("Resources/lending_data.csv")
lending_data.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

Logistic regression

## Split the Data into Training and Testing Sets

In [3]:
# Split the data into X_train, X_test, y_train, y_test
from sklearn.model_selection import train_test_split
y = lending_data['loan_status'].values
X = lending_data.drop('loan_status', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=27)
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


## Create, Fit and Compare Models

Create a Logistic Regression model, fit it to the data, and print the model's score. Do the same for a Random Forest Classifier. You may choose any starting hyperparameters you like. 

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the designated markdown cell.

In [4]:
# Train a Logistic Regression model and print the model score
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(max_iter=10000)
classifier

LogisticRegression(max_iter=10000)

In [5]:
classifier.fit(X_train, y_train)

LogisticRegression(max_iter=10000)

In [6]:
print(f"Train Data Score: {classifier.score(X_train, y_train)}")
print(f"Test Data Score: {classifier.score(X_test, y_test)}")

Train Data Score: 0.9916597881414225
Test Data Score: 0.9923648369789517


In [7]:
from sklearn.metrics import confusion_matrix
y_true = y_test
y_pred = classifier.predict(X_test)
confusion_matrix(y_true, y_pred)

array([[18656,    85],
       [   63,   580]], dtype=int64)

In [8]:
tp,tn,fp,fn = confusion_matrix(y_true, y_pred).ravel()
accuracy = (tp + tn) / (tp + tn + fp + fn) 
# (18656 + 580) / (18656 + 580 + 63 + 85)
print(f"Accuracy: {accuracy}")

Accuracy: 0.966828312009905


In [9]:
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18741
           1       0.87      0.90      0.89       643

    accuracy                           0.99     19384
   macro avg       0.93      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384



In [10]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier

In [12]:
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=5)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [13]:
classifier = RandomForestClassifier(random_state=1, n_estimators=500).fit(X_train_scaled, y_train)
print(f'Train Score: {classifier.score(X_train_scaled, y_train)}')
print(f'Test Score: {classifier.score(X_test_scaled, y_test)}')

Train Score: 0.9974893382858715
Test Score: 0.9908171687990095


In [14]:
y_true = y_test
y_pred = classifier.predict(X_test)
confusion_matrix(y_true, y_pred)



array([[18751,     0],
       [  633,     0]], dtype=int64)

In [15]:
tp,tn,fp,fn = confusion_matrix(y_true, y_pred).ravel()
accuracy = (tp + tn) / (tp + tn + fp + fn) 
# (18656 + 580) / (18656 + 580 + 63 + 85)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9673442014032192


In [16]:
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98     18751
           1       0.00      0.00      0.00       633

    accuracy                           0.97     19384
   macro avg       0.48      0.50      0.49     19384
weighted avg       0.94      0.97      0.95     19384



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Random Forest Classifier did better but only by a little. I predicted a logistic regression model would do better becasue the data is more catagorical and observational. In this data the calculations are based on wieghted factors such as the "loan size" and "interest rates". So if you have "x" amount of income and "x" amoount for loan and "x" you can predict the "x" value for interest rate, which should be catagorical. But because other numiric ranges are based on diffeerent factors the random forest becomes more accurate or predictive because of other factors.