# Credit Risk Evaluator

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

## Retrieve the Data

The data is located in the Challenge Files Folder:

* `lending_data.csv`

Import the data using Pandas. Display the resulting dataframe to confirm the import was successful.

In [2]:
# Import the data
lending_data = pd.read_csv('Resources/lending_data.csv')
lending_data.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

## Prediction
*Replace the text in this markdown cell with your predictions, and be sure to provide justification for your guess.*

I believe that the Logistic Regression model will perform better than the Random Forest model in this case because of the fact that none of our features are categorical in nature. This makes the ease of use with Logistic Regression more accessible. The magnitude of the number in each feature has an actual meaning whereas categorical features with one hot encoding using Logistic Regression would prove more time consuming in which case we may prefer Random Forests instead. Also, since we are trying to assess the risk level of a given loan, Logistic Regression would make most sense in terms of predicting the probability of a given loan to default.

## Split the Data into Training and Testing Sets

In [15]:
# Split the data into X_train, X_test, y_train, y_test
X = lending_data.drop('loan_status', axis=1)
y = lending_data['loan_status']

X_train, X_test, y_train, y_test = train_test_split(X, y)

## Create, Fit and Compare Models

Create a Logistic Regression model, fit it to the data, and print the model's score. Do the same for a Random Forest Classifier. You may choose any starting hyperparameters you like. 

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the designated markdown cell.

In [16]:
# Train a Logistic Regression model and print the model score
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression().fit(X_train, y_train)

print(f"The Training Data Score is: {classifier.score(X_train, y_train)}")
print(f"The Testing Data Score is: {classifier.score(X_test, y_test)}")

The Training Data Score is: 0.9914878250103177
The Testing Data Score is: 0.9925196037969459


In [17]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf = RandomForestClassifier(n_estimators=500).fit(X_train_scaled, y_train)

print(f"The Training Data Score is: {clf.score(X_train_scaled, y_train)}")
print(f"The Testing Data Score is: {clf.score(X_test_scaled, y_test)}")

The Training Data Score is: 0.9973861604072087
The Testing Data Score is: 0.9918489475856377


## Analysis
*Which model performed better? How does that compare to your prediction? Replace the text in this markdown cell with your answers to these questions.*

The Logistic Regression model performed marginally better than the Random Forest model in this case which confirms my prediction. The test score for the Logistic Regression (0.9925) was only slightly higher than the test score for the Random Forest model (.9918) so the results indicate that both models still performed well regardless of which one did better. Although, the Random Forest model did take quite longer to run than the Logistic Regression model. So, in terms of what may be best for the lending service company to utilize, I would recommend the Logistic Regression model as it will take up less resources and none of the data are categorical in nature.