In [2]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [4]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
file_path = Path("Resources/lending_data.csv")
df = pd.read_csv(file_path)

# Review the DataFrame
df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [5]:
# Separate the data into labels and features

# Separate the y variable, the labels
y = df["loan_status"]

# Separate the X variable, the features
X = df.drop(columns=["loan_status"])

In [6]:
# Review the y variable Series
print(f"y shape: {y.shape}")

y shape: (77536,)


In [7]:
# Review the X variable DataFrame
print(f"X shape: {X.shape}")

X shape: (77536, 7)


In [9]:
print(X.head())
print(y.head())

   loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   

   derogatory_marks  total_debt  
0                 1       22800  
1                 0       13600  
2                 0       16100  
3                 1       22700  
4                 1       23000  
0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [10]:
# Import the train_test_split module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

# Review the shapes of the training and testing sets to ensure they are correct
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (58152, 7)
X_test shape: (19384, 7)
y_train shape: (58152,)
y_test shape: (19384,)


---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [11]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
logistic_regression_model = LogisticRegression(random_state=1)

# Fit the model using the training data
logistic_regression_model.fit(X_train, y_train)

# Output the coefficients and intercept of the trained model
print(f"Model Coefficients: {logistic_regression_model.coef_}")
print(f"Model Intercept: {logistic_regression_model.intercept_}")

Model Coefficients: [[ 4.60888714e-03 -2.68144046e-08 -1.15831857e-03  1.90541720e-05
  -2.01070597e-06  7.62267938e-05  2.61927387e-04]]
Model Intercept: [-4.73416517e-08]


### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [17]:
# Make a prediction using the testing data
y_pred = logistic_regression_model.predict(X_test)

# Output the first few predictions to verify
y_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [13]:
# Generate a confusion matrix for the model
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the confusion matrix
print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[18663   102]
 [   56   563]]


In [14]:
# Print the classification report for the model
class_report = classification_report(y_test, y_pred)

# Print the classification report
print("Classification Report:")
print(class_report)

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.91      0.88       619

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** 

Used the LogisticRegression algorithm from scikit-learn.

Results:

Logistic Regression Model:
Accuracy: 99%

Precision:

For label 0 (healthy loan): 1.00
For label 1 (high-risk loan): 0.85

Recall:

For label 0 (healthy loan): 0.99
For label 1 (high-risk loan): 0.91

Summary:

The logistic regression model performed exceptionally well in predicting both healthy and high-risk loans. It achieved high accuracy, with precision and recall scores indicating strong predictive capability for identifying both classes.

Recommendation:

The logistic regression model is recommended for use in credit risk analysis due to its high accuracy and balanced performance in predicting both healthy and high-risk loans.
The model's performance is crucially balanced between predicting healthy loans (0) and identifying high-risk loans (1). This balance ensures that both types of predictions are reliable, which is essential in financial risk assessment.
Overall, based on the analysis and results obtained, the logistic regression model appears to be well-suited for this task. 

---