In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
# YOUR CODE HERE!
file_path = Path("Credit_Risk/lending_data.csv")
df_lending = pd.read_csv(file_path)

# Review the DataFrame
# YOUR CODE HERE!
df_lending.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Separate the data into labels and features

# Separate the y variable, the labels
# YOUR CODE HERE!]
y = df_lending['loan_status']

# Separate the X variable, the features
# YOUR CODE HERE!
X = df_lending.drop('loan_status', axis=1)

In [4]:
# Review the y variable Series
# YOUR CODE HERE!
print(y)

0        0
1        0
2        0
3        0
4        0
        ..
77531    1
77532    1
77533    1
77534    1
77535    1
Name: loan_status, Length: 77536, dtype: int64


In [5]:
# Review the X variable DataFrame
# YOUR CODE HERE!
print(X)

       loan_size  interest_rate  borrower_income  debt_to_income  \
0        10700.0          7.672            52800        0.431818   
1         8400.0          6.692            43600        0.311927   
2         9000.0          6.963            46100        0.349241   
3        10700.0          7.664            52700        0.430740   
4        10800.0          7.698            53000        0.433962   
...          ...            ...              ...             ...   
77531    19100.0         11.261            86600        0.653580   
77532    17700.0         10.662            80900        0.629172   
77533    17600.0         10.595            80300        0.626401   
77534    16300.0         10.068            75300        0.601594   
77535    15600.0          9.742            72300        0.585062   

       num_of_accounts  derogatory_marks  total_debt  
0                    5                 1       22800  
1                    3                 0       13600  
2                 

### Step 3: Check the balance of the labels variable (`y`) by using the `value_counts` function.

In [6]:
# Check the balance of our target values
# YOUR CODE HERE!
balance = y.value_counts()
print(balance)

0    75036
1     2500
Name: loan_status, dtype: int64


### Step 4: Split the data into training and testing datasets by using `train_test_split`.

In [7]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
# YOUR CODE HERE!
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [8]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
# YOUR CODE HERE!
logistic_model = LogisticRegression(random_state=1)

# Fit the model using training data
# YOUR CODE HERE!
logistic_model.fit(X_train, y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [9]:
# Make a prediction using the testing data
# YOUR CODE HERE!
y_predict = logistic_model.predict(X_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [10]:
# Print the balanced_accuracy score of the model
# YOUR CODE HERE!
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Calculate the accuracy score
accuracy = accuracy_score(y_test, y_predict)
print(f"Accuracy: {accuracy:.2f}")

# Generate a confusion matrix
confusion_mat = confusion_matrix(y_test, y_predict)
print("Confusion Matrix:")
print(confusion_mat)

# Print the classification report
class_report = classification_report(y_test, y_predict)
print("Classification Report:")
print(class_report)

# Print the balanced_accuracy score of the model
from sklearn.metrics import balanced_accuracy_score
balanced_accuracy = balanced_accuracy_score(y_test, y_predict)
print(f"Balanced Accuracy: {balanced_accuracy:.2f}")

Accuracy: 0.99
Confusion Matrix:
[[14926    75]
 [   46   461]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     15001
           1       0.86      0.91      0.88       507

    accuracy                           0.99     15508
   macro avg       0.93      0.95      0.94     15508
weighted avg       0.99      0.99      0.99     15508

Balanced Accuracy: 0.95


In [11]:
# Generate a confusion matrix for the model
# YOUR CODE HERE!
from sklearn.metrics import confusion_matrix

# Generate a confusion matrix for the model
confusion_matrix = confusion_matrix(y_test, y_predict)
print("Confusion Matrix:")
print(confusion_matrix)

Confusion Matrix:
[[14926    75]
 [   46   461]]


In [12]:
# Print the classification report for the model
# YOUR CODE HERE!
from sklearn.metrics import classification_report

# Print the classification report for the model
classification_report = classification_report(y_test, y_predict)
print("Classification Report:")
print(class_report)

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     15001
           1       0.86      0.91      0.88       507

    accuracy                           0.99     15508
   macro avg       0.93      0.95      0.94     15508
weighted avg       0.99      0.99      0.99     15508



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** The effectiveness of the logistic regression model in predicting both the 0 (healthy loan) and 1 (high-risk loan) labels can be assessed based on the results from the classification report and the confusion matrix.

Precision: Precision measures how many of the predicted positive instances were actually positive. In this context, for label 1 (high-risk loan), a high precision indicates that when the model predicts a high-risk loan, it is often correct. For label 0 (healthy loan), a high precision indicates that when the model predicts a healthy loan, it is often correct.


Recall: Recall measures how many of the actual positive instances were correctly predicted as positive. High recall for label 1 means that the model is good at identifying high-risk loans when they occur. High recall for label 0 means that the model is good at identifying healthy loans when they occur.


F1-Score: The F1-Score is the harmonic mean of precision and recall. It provides a balance between precision and recall. A high F1-Score for label 1 indicates that the model is both precise and able to recall high-risk loans effectively. A high F1-Score for label 0 indicates the same for healthy loans.


Confusion Matrix: The confusion matrix provides a detailed breakdown of true positives, true negatives, false positives, and false negatives for both labels. It gives a clear view of how well the model performs for each label.

---

## Predict a Logistic Regression Model with Resampled Training Data

### Step 1: Use the `RandomOverSampler` module from the imbalanced-learn library to resample the data. Be sure to confirm that the labels have an equal number of data points. 

In [13]:
# Import the RandomOverSampler module form imbalanced-learn
from imblearn.over_sampling import RandomOverSampler

# Split the data using train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Instantiate the random oversampler model
# # Assign a random_state parameter of 1 to the model
# YOUR CODE HERE!
random_oversampler = RandomOverSampler(random_state=1)

# Fit the original training data to the random_oversampler model
# YOUR CODE HERE!
X_resampled, y_resampled = random_oversampler.fit_resample(X_train, y_train)

In [14]:
# Count the distinct values of the resampled labels data
# YOUR CODE HERE!
distinct_values = pd.Series(y_resampled).value_counts()
print(distinct_values)

0    60035
1    60035
Name: loan_status, dtype: int64


### Step 2: Use the `LogisticRegression` classifier and the resampled data to fit the model and make predictions.

In [15]:
# Instantiate the Logistic Regression model
from sklearn.linear_model import LogisticRegression

# Assign a random_state parameter of 1 to the model
# YOUR CODE HERE!
logistic_model_resampled = LogisticRegression(random_state=1)

# Fit the model using the resampled training data
# YOUR CODE HERE!
logistic_model_resampled.fit(X_resampled, y_resampled)

# Make a prediction using the testing data
# YOUR CODE HERE!
y_predict_resampled = logistic_model_resampled.predict(X_test)


### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [16]:
# Print the balanced_accuracy score of the model 
# YOUR CODE HERE!
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, balanced_accuracy_score

# Calculate the accuracy score of the model.
accuracy_resampled = accuracy_score(y_test, y_predict_resampled)
print(f"Accuracy: {accuracy_resampled:.2f}")

Accuracy: 0.99


In [17]:
# Generate a confusion matrix for the model
# YOUR CODE HERE!
confusion_mat_resampled = confusion_matrix(y_test, y_predict_resampled)
print("Confusion Matrix:")
print(confusion_mat_resampled)

Confusion Matrix:
[[14915    86]
 [    3   504]]


In [18]:
# Print the classification report for the model
# YOUR CODE HERE!
class_report_resampled = classification_report(y_test, y_predict_resampled)
print("Classification Report:")
print(class_report_resampled)

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     15001
           1       0.85      0.99      0.92       507

    accuracy                           0.99     15508
   macro avg       0.93      0.99      0.96     15508
weighted avg       1.00      0.99      0.99     15508



In [19]:
# Print the balanced accuracy score of the model
balanced_accuracy_resampled = balanced_accuracy_score(y_test, y_predict_resampled)
print(f"Balanced Accuracy: {balanced_accuracy_resampled:.2f}")

Balanced Accuracy: 0.99


### Step 4: Answer the following question

**Question:** How well does the logistic regression model, fit with oversampled data, predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** The Logistic Regression model, fitted with oversampled data, demonstrates excellent performance in predicting both the 0 (healthy loan) and 1 (high-risk loan) labels:

1. **Accuracy**: The model achieves an accuracy of 0.99, indicating that it correctly classifies approximately 99% of the instances, demonstrating its high overall predictive power.

2. **Confusion Matrix**: The confusion matrix shows that the model makes very few errors. It correctly identifies a large majority of healthy loans (label 0) with 99% recall and has a good precision of 0.85 for high-risk loans (label 1). Only a small number of instances are misclassified.

3. **Classification Report**: The classification report provides additional insights. The model's precision for both labels is high, indicating that when it predicts a loan as either healthy (label 0) or high-risk (label 1), it's typically correct. The recall for both labels is also high, meaning that the model effectively captures instances of both healthy and high-risk loans. The F1-scores are strong for both labels, showing a balance between precision and recall.

4. **Balanced Accuracy**: The balanced accuracy score is 0.99, indicating that the model performs exceptionally well in handling both classes while accounting for the class imbalance.

In summary, the Logistic Regression model, trained with oversampled data, is highly effective in predicting both healthy and high-risk loans. It demonstrates strong precision, recall, and accuracy for both classes, suggesting its robustness in identifying loan categories.