# Credit Risk Evaluator

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

## Retrieve the Data

The data is located in the Challenge Files Folder:

* `lending_data.csv`

Import the data using Pandas. Display the resulting dataframe to confirm the import was successful.

In [2]:
# Import the data
file_path = Path("Resources/lending_data.csv")
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

## Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

The data has seven features including loan size, interest rate, borrower income, debt-to-income ratio, number of accounts, derogatory marks, and total debt. The target variable is loan status, which is whether the loan was paid back or not.According to my understanding, the features and target variable may have complex relationships with each other, such as the borrower's income and debt-to-income ratio affecting the likelihood of loan default. There may also be interactions between features, like the relationship between loan size and interest rate. Due to these complexities, the random forest model may perform better than the logistic regression model, as it is better at handling complex relationships and interactions. However, the best way to determine which model performs better is to create, fit, and score both models on the specific data and compare their performance.

In [3]:
# Define the X (features) and y (target) sets
y = df["loan_status"].values
X = df.drop("loan_status", axis=1)

In [4]:
df.count()

loan_size           77536
interest_rate       77536
borrower_income     77536
debt_to_income      77536
num_of_accounts     77536
derogatory_marks    77536
total_debt          77536
loan_status         77536
dtype: int64

## Split the Data into Training and Testing Sets

In [5]:
# Split the data into X_train, X_test, y_train, y_test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Create, Fit and Compare Models

Create a Logistic Regression model, fit it to the data, and print the model's score. Do the same for a Random Forest Classifier. You may choose any starting hyperparameters you like. 

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the designated markdown cell.

In [6]:
# Train a Logistic Regression model and print the model score
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(max_iter=10000)
classifier.fit(X_train, y_train)

LogisticRegression(max_iter=10000)

In [7]:
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Training Data Score: 0.9919177328380795
Testing Data Score: 0.9924680148576145


In [8]:
#optional step for understanding purpose
from sklearn.metrics import confusion_matrix,classification_report

y_true = y_test
y_pred = classifier.predict(X_test)
confusion_matrix(y_true, y_pred)

array([[18699,    93],
       [   53,   539]], dtype=int64)

In [9]:
#optional step for understanding purpose
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18792
           1       0.85      0.91      0.88       592

    accuracy                           0.99     19384
   macro avg       0.93      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384



In [10]:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

# Fit the model to the training data

clf = RandomForestClassifier(random_state=1, n_estimators=100).fit(X_train, y_train)
# Train a Random Forest Classifier model and print the model score

print(f'Training Score: {clf.score(X_train, y_train)}')
print(f'Testing Score: {clf.score(X_test, y_test)}')

Training Score: 0.9971798046498831
Testing Score: 0.9921068922822947


In [11]:
#optional step for understanding purpose
y_pred1 = classifier.predict(X_test)
confusion_matrix(y_true, y_pred1)

array([[18699,    93],
       [   53,   539]], dtype=int64)

In [12]:
#optional step for understanding purpose
print(classification_report(y_test,y_pred1))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18792
           1       0.85      0.91      0.88       592

    accuracy                           0.99     19384
   macro avg       0.93      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384



## Which model performed better? How does that compare to your prediction? Replace the text in this markdown cell with your answers to these questions.

The logistic regression model had a slightly better score of 0.9924680148576145 compared to 0.9921068922822947 for the Random Forest Classifier. This result is different from what I predicted, but it's important to consider that machine learning models' performance can depend on many factors, such as the data, features, and hyperparameters. The logistic regression model may have learned the relationship between the features and target variable better than the random forest model in this case, and it's essential to use appropriate evaluation metrics to choose the best model. Further analysis and experimentation are recommended to validate these results and find the optimal model for this problem.