# Credit Risk Classification

In this project, we'll use various techniques to train and evaluate a model based on loan risk. We'll work with a dataset of historical lending activity from a peer-to-peer lending services company to build a model that can identify the creditworthiness of borrowers.

### Importing the Necessary Libraries and Modules
Imports required libraries and modules for data manipulation, machine learning, and evaluation.

In [6]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import RandomOverSampler

---

### Reading CSV File and Reviewing DataFrame
- Read the CSV file in `Resources/lending_Data.csv` into a Pandas DataFrame.
- Display the first few rows of the DataFrame `df_lending_data` to get an overview of the data.

In [7]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
df_lending_data = pd.read_csv('../data/lending-data.csv')

# Review the DataFrame
df_lending_data.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Separating Labels and Features
- Separates data into labels (`y`) and features (`X`) from `df_lending_data`.

**Note:**
- The variable name `X` is often written in capital case (uppercase) to indicate that it represents arrays or data structures that are used for specific purposes in machine learning or statistical modeling.
- In many machine learning contexts, it is convention to use `y` to represent the dependent variable (target variable or labels), and `X` to represent the independent variables (features or predictors).

In [8]:
# Separate the y variable, the labels
y = df_lending_data["loan_status"]

# Separate the X variable, the features
X = df_lending_data.drop(columns="loan_status")

### Reviewing Target Variable

- Displays summary statistics of the `y` variable (loan_status) using the `describe()` function.

In [9]:
y.describe()

count    77536.000000
mean         0.032243
std          0.176646
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: loan_status, dtype: float64

### Reviewing Feature Variables

- Displays the first few rows of the DataFrame `X` to inspect the feature variables.

In [10]:
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Checking Target Value Balance

- Counts occurrences of each unique value in the `y` variable (loan_status) to check target value balance.

In [11]:
y.value_counts()

0    75036
1     2500
Name: loan_status, dtype: int64

### Splitting Data into Train and Test Sets

- Splits data into training and testing datasets using `train_test_split` with a `random_state = 1`.

**Note:**
- the `random_state` parameter allows is used to  set a fixed seed value for the random number generator. This seed value ensures that the data split is reproducible
- Means that running the code with the same `random_state` value will always produce the same split.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

---

### Instantiating Logistic Regression Model and Fitting the Logistic Regression Model

- Creates an instance of the logistic regression model with a `random_state = 1`.
- Fits the logistic regression model using the training data (`X_train` and `y_train`).

In [13]:
# Instantiate the Logistic Regression model
logistic_regression_model = LogisticRegression(random_state=1)

# Fit the model using training data
lr_model_fit = logistic_regression_model.fit(X_train, y_train)

### Making Predictions with Logistic Regression Model

- Use the trained (fitted) logistic regression model to predict loan risk labels for the testing data (`X_test`).

In [14]:
# Make a prediction using the testing data
testing_predictions = lr_model_fit.predict(X_test)

### To  evaluate the model’s performance:

- Calculate the Balanced Accuracy Score:
    - The balanced accuracy score calculates the accuracy of a classification model while considering imbalanced classes, providing a fair evaluation by giving equal weight to each class.

- Generate the Confusion Matrix
    - The confusion matrix displays the count of correct and incorrect predictions for each class in a classification model, helping to assess the model's performance and identify potential misclassifications.
    
- Print the Classification Report
    - The classification report provides a comprehensive evaluation of a classification model's performance, including precision, recall, F1-score, and support for each class, aiding in assessing the model's effectiveness for different classes.

### Calculate the  Balanced Accuracy Score

- Calculate the balanced accuracy score for the logistic regression model's predictions on the testing data.

In [15]:
# Calculate the balanced accuracy score
# Print the balanced_accuracy score of the model
balanced_accuracy = balanced_accuracy_score(y_test, testing_predictions)
print("Balanced Accuracy Score:", balanced_accuracy)

Balanced Accuracy Score: 0.9520479254722232


### Generate the Confusion Matrix

- Generate a confusion matrix for the logistic regression model's predictions on the testing data.

In [16]:
# Generate a confusion matrix for the model
conf_matrix = confusion_matrix(y_test, testing_predictions)
print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[18663   102]
 [   56   563]]


### Print the Classification Report

- Print the classification report for the logistic regression model's predictions on the testing data.

In [17]:
# Print the classification report for the model
print("Classification Report:")
print(classification_report(y_test, testing_predictions))

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.91      0.88       619

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384



### How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:**  The logistic regression model performs well in predicting both the 0 (healthy loan) and 1 (high-risk loan) labels, with high precision and recall scores for both classes. 
- For the `0` (health loan) class: High precision (1.00) and high recall (0.99) indicates the model is almost always correct.
- For the `1` (high-risk loan) class: The precision(0.85) was not perfect but still high, and the recall (0.91) was also high but not as accurate as for predicting the healthy loans.

**Note:** Precision measures the proportion of true positive predictions.
- Precision (Positive Prediction Value) = True Positives / (True Positives + False Positives).
    - (i.e. - The proportion of accurately predicted loans as True out of all predictions (both inaccurately and accurately) predicting a loan as healthy)
- Recall (Sensitivity) = True Positives / (True Positives + False Negatives).
    - (i.e. - The proportion of accurately predicted loans as True out of all loans which are actually True)

---

### Using Resampled Data, Use a Logistic Regression Model to Predict the Credit Risk Level of Borrowers

**Note:**

- Confirm that the labels have an equal number of data points.
    - When using the `RandomOverSampler`, the goal is to increase the number of samples in the minority class (in this case, high-risk loans) to balance the class distribution.
    - It is essential to avoid overfitting and preserve the original data's integrity.
    - If the number of synthetic samples generated for the minority class is excessively high, it may cause the model to memorize the training data rather than learning meaningful patterns.
    - This can result in poor performance on new, unseen data.

### Instantiate the Random Oversampler Model and Resample the Training Data with `RandomOversampler`

- Create an instance of the random oversampler model with a `random_state = 1`.
- Fits the original training data (`X_train` and `y_train`) to the random oversampler model to create a balanced dataset.

**Note:**
- Using the `RandomOverSampler` model for logistic regression predictions aims to balance the dataset by generating synthetic samples of the minority class, thus improving model performance and mitigating biases caused by class imbalance.
- The minority class refers to the class with fewer instances or samples in a dataset compared to other classes, often resulting in an imbalanced class distribution (in this case healthy loans (`loan_status = 0`) and high-risk loans (`loan_status = 1`) are the two classes).

In [18]:
# # Assign a random_state parameter of 1 to the model
random_oversampler_model = RandomOverSampler(random_state=1)

# Fit the original training data to the random_oversampler model
X_train_resampled, y_train_resampled = random_oversampler_model.fit_resample(X_train, y_train)

### Counting Resampled Labels

- Count and print the distinct values in the resampled training labels (`y_train_resampled``) to check the oversampling effect.

In [19]:
# Count the distinct values of the resampled labels data
distinct_values_count = pd.Series(y_train_resampled).value_counts()

print(distinct_values_count)

0    56271
1    56271
Name: loan_status, dtype: int64


### Instantiate, Fit, Predict with the Logistic Regression Model of the Resampled data.

- Create a new instance of the logistic regression model with a `random_state = 1` for the resampled data.
- Fit the logistic regression model using the resampled training data (`X_train_resampled` and `y_train_resampled`).
- Use the trained logistic regression model with the resampled data to predict loan risk labels for the testing data (`X_test`).

**Note:**
- The resambled data, in this case, is oversampled to increase the representation fo the minority class (high-risk loans) in the dataset.

In [20]:
logistic_regression_model = LogisticRegression(random_state=1)

# Fit the model using the resampled training data
lr_model_resampled = logistic_regression_model.fit(X_train_resampled, y_train_resampled)

# Make a prediction using the testing data
testing_predictions_resampled = logistic_regression_model.predict(X_test)

### Calculate the Balanced Accuracy Score (Resampled)

- Calculate the balanced accuracy score for the logistic regression model's predictions on the testing data with resampled model.

In [21]:
# Calculate the balanced accuracy score
# Print the balanced_accuracy score of the model
balanced_accuracy = balanced_accuracy_score(y_test, testing_predictions_resampled)
print("Balanced Accuracy Score:", balanced_accuracy)

Balanced Accuracy Score: 0.9936781215845847


### Generate the Confusion Matrix (Resampled)

- Generate a confusion matrix for the logistic regression model's predictions on the testing data with resampled model.

In [22]:
# Generate a confusion matrix for the model
conf_matrix = confusion_matrix(y_test, testing_predictions_resampled)
print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[18649   116]
 [    4   615]]


### Print the Classification Report (Resampled)
P
- rint the classification report for the logistic regression model's predictions on the testing data with the resampled model.

In [23]:
# Print the classification report for the model
print("Classification Report:")
print(classification_report(y_test, testing_predictions_resampled))

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.84      0.99      0.91       619

    accuracy                           0.99     19384
   macro avg       0.92      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384



### How well does the logistic regression model, fit with oversampled data, predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:**
- Looking back at the original imbalanced data of `75036` instances of healthy loans compared tothe  `2500` instances of high-risk laons, the resampled model improved the Balanced Accuracy score from `~0.952` to `~0.9937` and improved the recall for the high-risk loans from `91 % accuracy` to `99% accuracy` with a slighty decrease in precision for the high-risk laons, dropping from `85% accuracy` to `84% accuracy`. These results support a case for using oversampling to address the original imbalanced data.

**Note:** Oversampling, in the context of binary classification, generates additional synthetic samples for the minority class so that both classes have a more balanced representation in the dataset.