# [Supervised ML] Credit Risk Analysis
---
## Step #0 - Import & Setup Dependencies

In [1]:
# Import the modules
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

## Step #1 - Split the Data into Training and Testing Sets
---
### Read the `lending_data.csv` data into a Pandas DataFrame.

In [2]:
# Using Pandas, store the Lending Dataset in a new DataFrame
lending_df = pd.read_csv("Resources/lending_data.csv")

lending_df

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.430740,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0
...,...,...,...,...,...,...,...,...
77531,19100.0,11.261,86600,0.653580,12,2,56600,1
77532,17700.0,10.662,80900,0.629172,11,2,50900,1
77533,17600.0,10.595,80300,0.626401,11,2,50300,1
77534,16300.0,10.068,75300,0.601594,10,2,45300,1


### Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
y = lending_df["loan_status"]
X = lending_df.drop(columns = "loan_status")

In [15]:
# Review the 'Labels' Dataset
print(y.value_counts())
print()
y

loan_status
0    75036
1     2500
Name: count, dtype: int64



0        0
1        0
2        0
3        0
4        0
        ..
77531    1
77532    1
77533    1
77534    1
77535    1
Name: loan_status, Length: 77536, dtype: int64

In [5]:
# Review the 'Features' DataFrame
X

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.430740,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000
...,...,...,...,...,...,...,...
77531,19100.0,11.261,86600,0.653580,12,2,56600
77532,17700.0,10.662,80900,0.629172,11,2,50900
77533,17600.0,10.595,80300,0.626401,11,2,50300
77534,16300.0,10.068,75300,0.601594,10,2,45300


### Split the data into training and testing datasets by using `train_test_split`

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    random_state = 1, 
                                                    stratify = y)

## Step #2 - Create a Logistic Regression Model with the Original Data
---
### Fit a logistic regression model by using the training data (`X_train` and `y_train`)

In [7]:
# Create a Logistic Regression Model
classifier = LogisticRegression(solver = 'lbfgs', random_state = 1)

# Fit (train) or model using the training data
classifier.fit(X_train, y_train)

### Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model

In [8]:
predictions = classifier.predict(X_test)

predict_df = pd.DataFrame({"Prediction": predictions, "Actual": y_test})

predict_df

Unnamed: 0,Prediction,Actual
36831,0,0
75818,0,1
36563,0,0
13237,0,0
43292,0,0
...,...,...
38069,0,0
36892,0,0
5035,0,0
40821,0,0


### Evaluate Model Performance - Confusion Matrix

In [9]:
confusion_matrix(y_test, predictions)

array([[18679,    80],
       [   67,   558]], dtype=int64)

### Evaluate Model Performance - Classification Report

In [10]:
target_names = ["Healthy Loan (0)", "High-Risk Loan (1)"]

print(classification_report(y_test, predictions, target_names = target_names))

                    precision    recall  f1-score   support

  Healthy Loan (0)       1.00      1.00      1.00     18759
High-Risk Loan (1)       0.87      0.89      0.88       625

          accuracy                           0.99     19384
         macro avg       0.94      0.94      0.94     19384
      weighted avg       0.99      0.99      0.99     19384



### Question - How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

#### Confusion Matrix Breakdown:

True Positive (TP) = **558** CORRECTLY Identified as High-Risk Loans (`1`)

True Negative (TN) = **18679** CORRECTLY Identified as Healthy Loans (`0`)

[Type I Error] False Positive (FP) =  **80** Healthy Loans (`0`) INCORRECTLY Identified as High-Risk Loans (`1`)

[Type II Error] False Negative (FN) = **67** High-Risk Loans (`1`) INCORRECTLY Identified as Healthy Loans (`0`)


#### Key Classification Report Metrics:

**Accuracy** = 99%

**Healthy Loan (`0`)**
   - Precision = 100%
   - Recall = 100%

**High Risk Loan (`1`)**
   - Precision = 87%
   - Recall = 89%


#### Final Answer:

With an accuracy of 99%, this tells me that the Logistics Regression Model performed exceptionally when predicting whether loans are either healthy (`0`) or high-risk (`1`).

Notably, whenever the model predicts a high-risk loan (`1`), it does so correctly 87% of the time. In contrast, the model can correctly predict a healthy loan (`0`) 100% of the time.

While favourable metrics in general, the discrepancy is likely due the uneven distribution of healthy vs high-risk loan datapoints in the original dataset that the test/train datasets derive from.

Of 77536 datapoints, only 3.2% cases consists of high-risk loans. This explains why the model is able to predict healthy loans more accurately than high-risk loans