In [27]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [3]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
# YOUR CODE HERE!
df_lending_data = pd.read_csv(
    "Resources/lending_data.csv")
# Review the DataFrame
df_lending_data
# YOUR CODE HERE!


Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.430740,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0
...,...,...,...,...,...,...,...,...
77531,19100.0,11.261,86600,0.653580,12,2,56600,1
77532,17700.0,10.662,80900,0.629172,11,2,50900,1
77533,17600.0,10.595,80300,0.626401,11,2,50300,1
77534,16300.0,10.068,75300,0.601594,10,2,45300,1


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [4]:
# Separate the data into labels and features

# Separate the y variable, the labels
# YOUR CODE HERE!]
y = df_lending_data['loan_status']

# The X variable should include all features except the target
X = df_lending_data.drop(columns=['loan_status'])
# Separate the X variable, the features
# YOUR CODE HERE!

In [5]:
# Review the y variable Series
# YOUR CODE HERE!
y.head()

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

In [6]:
# Review the X variable DataFrame
X.head()
# YOUR CODE HERE!

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Check the balance of the labels variable (`y`) by using the `value_counts` function.

In [7]:
# Check the balance of our target values
y.value_counts()
# YOUR CODE HERE!

0    75036
1     2500
Name: loan_status, dtype: int64

### Step 4: Split the data into training and testing datasets by using `train_test_split`.

In [8]:
# Import the train_test_learn module - Imported above, First line of code with all the other libraries

# Split the data using train_test_split
# Assign a random_state of 1 to the function

# YOUR CODE HERE!

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [9]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
# YOUR CODE HERE!
logistic_regression_model = LogisticRegression(random_state=1)
# Fit the model using training data
# YOUR CODE HERE!
lr_model = logistic_regression_model.fit(X_train, y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [10]:
# Make a prediction using the testing data
# YOUR CODE HERE!
testing_predictions = logistic_regression_model.predict(X_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [11]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Print the balanced_accuracy score of the model
# YOUR CODE HERE!
accuracy_score = accuracy_score(y_test,testing_predictions)
print("Accuracy Score:", accuracy_score)

# Generate a confusion matrix for the model
# YOUR CODE HERE!
test_matrix = confusion_matrix(y_test, testing_predictions)
print("Confusion Matrix:")
print(test_matrix)

# Print the classification report for the model
# YOUR CODE HERE!
class_report = classification_report(y_test, testing_predictions)
print("Classification Report:")
print(class_report)

Accuracy Score: 0.9918489475856377
Confusion Matrix:
[[18663   102]
 [   56   563]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.91      0.88       619

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** 
The primary purpose of this analysis is Risk Assessment and Improving Operational Efficiency:
1. Risk Assessment
Assessing the risk and predicting the risk associated with loans. In this analysis, we are determining whether a loan is "healthy" (likely to be repaid without issues) or "high-risk" (higher chances of default payments or problems). This analysis is very valuable for Banks or financial institutions, credit agencies, and private lenders to make informed decisions about lending and managing loan portfolios.
2. Operational Efficiency and Data Driven (Credit-Score) decisions
This model will help in creating the Automation of loan application process and having the data points clearly showing the Risk's (assessment),this model will make the decsion making faster and efficient. By using this kind of predictive model(s) we can significantly improve operational efficiency for financial institutions. It speeds up the decision-making process and reduces the need for manual or biased decisions whick can result in losses (from default customers) or loss of customers who were rejected loan but were or may be healthy loan bearer's.

About the results from the Model's Report above:

Accuracy Score : 
The model has very high accuracy of 99.18% and it shows a good balance between the Precision and Recall 
(especially for '0' - High risk loans). 

Confusion Matrix:
True Positives (TP): 563 high-risk loans were correctly classified as such.
True Negatives (TN): 18,663 healthy loans were correctly classified as such.
False Positives (FP): 102 healthy loans were incorrectly classified as high-risk.
False Negatives (FN): 56 high-risk loans were incorrectly classified as healthy.

Healthy Loans '0':

Precision (P0): The precision for healthy loans is 1.00, which indicates that when the model predicts a loan as healthy (0), 
it is almost always correct. This high precision means there are very few false positives in the predictions for healthy loans, making it highly reliable.

Recall '0': The recall for healthy loans is 0.99, suggesting that the model correctly identified 99% of the actual healthy loans. In practical terms, this means the model effectively captures the majority of healthy loans and minimizes false negatives.

F1-Score '0': The F1-score for healthy loans is 1.00. This perfect F1-score demonstrates the model's exceptional ability to consistently and accurately identify healthy loans.

High-Risk Loans '1':

Precision '1': The precision for high-risk loans is 0.85, indicating that 85% of the loans predicted as high-risk were indeed high-risk loans. While this precision is slightly lower than for healthy loans, it's still a strong result.

Recall '1': The recall for high-risk loans is 0.91, suggesting that the model correctly identified 91% of the actual high-risk loans. It means the model is effective at capturing the majority of high-risk loans while minimizing false negatives.

F1-Score '1': The F1-score for high-risk loans is 0.88, indicating a balance between precision and recall. It reflects the model's ability to correctly classify high-risk loans while maintaining a reasonably low false positive rate.

Summary:

The logistic regression model demonstrates a robust performance in distinguishing between healthy and high-risk loans, 
with high precision, recall, and F1-scores for both classes.

For healthy loans, it achieves near-perfect precision, recall, and F1-score, indicating its ability to accurately 
identify healthy loans with minimal false positives or negatives.

For high-risk loans, it maintains a strong balance between precision and recall, resulting in a high F1-score. 
This reflects the model's effectiveness in correctly classifying high-risk loans while keeping the false positive 
rate at a reasonable level.

---

## Predict a Logistic Regression Model with Resampled Training Data

### Step 1: Use the `RandomOverSampler` module from the imbalanced-learn library to resample the data. Be sure to confirm that the labels have an equal number of data points. 

In [17]:
# Import the RandomOverSampler module form imbalanced-learn

from imblearn.over_sampling import RandomOverSampler

# Instantiate the random oversampler model
# # Assign a random_state parameter of 1 to the model
# YOUR CODE HERE!
ros = RandomOverSampler(sampling_strategy='auto', random_state=1) 
# Fit the original training data to the random_oversampler model
# YOUR CODE HERE!
X_train_oversample, y_train_oversample = ros.fit_resample(X_train, y_train)

In [19]:
# Count the distinct values of the resampled labels data
# YOUR CODE HERE!
pd.Series(y_train_oversample).value_counts()

0    56271
1    56271
Name: loan_status, dtype: int64

### Step 2: Use the `LogisticRegression` classifier and the resampled data to fit the model and make predictions.

In [22]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
# YOUR CODE HERE!
logistic_model = LogisticRegression(random_state=1)
# Fit the model using the resampled training data
# YOUR CODE HERE!
logistic_model.fit(X_train_oversample, y_train_oversample)
# Make a prediction using the testing data
# YOUR CODE HERE!
y_pred = logistic_model.predict(X_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [28]:
# Print the balanced_accuracy score of the model 
# YOUR CODE HERE!
model_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy Score:", model_accuracy)

# Generate a confusion matrix for the model
# YOUR CODE HERE!
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Print the classification report for the model
# YOUR CODE HERE!
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)

Accuracy Score: 0.9938093272802311
Confusion Matrix:
[[18649   116]
 [    4   615]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.84      0.99      0.91       619

    accuracy                           0.99     19384
   macro avg       0.92      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question

**Question:** How well does the logistic regression model, fit with oversampled data, predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:**
The overall Accuracy Score of the model is very high at 99.38%  

Healthy Loan '0':

Precision '0': The precision for healthy loans is 1.00. This means that when the model predicts a loan as healthy (0), it is virtually always correct. In other words, there are very few false positives in the predictions for healthy loans, making it highly reliable.

Recall '0': The recall for healthy loans is 0.99. This indicates that the model correctly identified 99% of the actual healthy loans. This means that the model rarely misses healthy loans, it is efficient at capturing it most of the time.

F1-Score '0': The F1-score for healthy loans is 1.00. The F1-score is a balance between precision and recall and is essentially perfect in this case. This demonstrates the model's ability to consistently and accurately identify healthy loans.

High-Risk Loans '1':

Precision '1': The precision for high-risk loans is 0.84. This means that 84% of the loans predicted as high-risk were indeed high-risk loans. While slightly lower than the precision for healthy loans, there is still a strong result.

Recall '1': The recall for high-risk loans is 0.99. This indicates that the model correctly identified 99% of the actual high-risk loans. It rarely misses high-risk loans, making it effective in capturing most of them.

F1-Score '1': The F1-score for high-risk loans is 0.91. It's a strong balance between precision and recall, signifying the model's ability to correctly classify high-risk loans while maintaining a reasonably low false positive rate.

Summary:

The logistic regression model with oversampled data excels in predicting both healthy and high-risk loans.
For healthy loans, it maintains a very high precision and recall, indicating its ability to correctly identify the vast majority of healthy loans with minimal false positives or negatives.
For high-risk loans, it achieves an effective balance between precision and recall, resulting in a strong F1-score. It correctly identifies most high-risk loans while keeping the rate of false positives relatively low.