In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [3]:
# Read the lending_data.csv file into a Pandas DataFrame
df = pd.read_csv('Resources/lending_data.csv')

# Display the first few rows of the DataFrame to review the data
df.head()



Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [4]:
# Check the info of the DataFrame to see column names, data types, and missing values
df.info()

# Check basic statistics of the dataset
df.describe()

# Check the value counts for the target variable ("loan_status")
df['loan_status'].value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77536 entries, 0 to 77535
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   loan_size         77536 non-null  float64
 1   interest_rate     77536 non-null  float64
 2   borrower_income   77536 non-null  int64  
 3   debt_to_income    77536 non-null  float64
 4   num_of_accounts   77536 non-null  int64  
 5   derogatory_marks  77536 non-null  int64  
 6   total_debt        77536 non-null  int64  
 7   loan_status       77536 non-null  int64  
dtypes: float64(3), int64(5)
memory usage: 4.7 MB


loan_status
0    75036
1     2500
Name: count, dtype: int64

### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [8]:
# Separate the labels (y)
y = df['loan_status']

# Separate the features (X)
X = df.drop('loan_status', axis=1)


In [11]:
# Review the first few entries of the 'y' Series (labels)
print(y.head())

# Check the data type of the 'y' Series
print("Data type of y:", y.dtype)

# Check the size of the 'y' Series (number of elements)
print("Size of y:", y.shape)

# Check the unique values in the 'y' Series and their counts
print("Unique values in y:\n", y.value_counts())



0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64
Data type of y: int64
Size of y: (77536,)
Unique values in y:
 loan_status
0    75036
1     2500
Name: count, dtype: int64


In [12]:
# Review the first few entries of the 'X' DataFrame (features)
print(X.head())

# Check the shape (dimensions) of the 'X' DataFrame
print("Shape of X:", X.shape)

# Check the column names of the 'X' DataFrame to verify the features
print("Column names of X:", X.columns)

# Check the data types of the columns in the 'X' DataFrame
print("Data types of X:\n", X.dtypes)

# Check for any missing values in the 'X' DataFrame
print("Missing values in X:\n", X.isnull().sum())


   loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   

   derogatory_marks  total_debt  
0                 1       22800  
1                 0       13600  
2                 0       16100  
3                 1       22700  
4                 1       23000  
Shape of X: (77536, 7)
Column names of X: Index(['loan_size', 'interest_rate', 'borrower_income', 'debt_to_income',
       'num_of_accounts', 'derogatory_marks', 'total_debt'],
      dtype='object')
Data types of X:
 loan_size           float64
interest_rate       float64
borrower_income       int64
debt_to_

In [5]:
# Review the X variable DataFrame
# YOUR CODE HERE!

### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [13]:
# Import the train_test_split function from scikit-learn
from sklearn.model_selection import train_test_split

# Split the data into training and testing datasets
# Assign a random_state of 1 to the function to ensure reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)


---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [14]:
# Import the LogisticRegression module from sklearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model with random_state=1
model = LogisticRegression(random_state=1)

# Fit the model using the training data
model.fit(X_train, y_train)


### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [15]:
# Make predictions using the fitted model and testing data (X_test)
y_pred = model.predict(X_test)


### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [17]:
# Import the classification_report function from sklearn
from sklearn.metrics import classification_report

# Generate and print the classification report
report = classification_report(y_test, y_pred)
print("Classification Report:")
print(report)


Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     15001
           1       0.86      0.94      0.90       507

    accuracy                           0.99     15508
   macro avg       0.93      0.97      0.95     15508
weighted avg       0.99      0.99      0.99     15508



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** 
The logistic regression model's ability to predict both the 0 (healthy loan) and 1 (high-risk loan) labels can be evaluated using metrics such as precision, recall, f1-score, and accuracy. Here's a breakdown based on the confusion matrix and classification report:

Precision:

For the 0 (healthy loan) label, precision tells us how many of the predicted healthy loans were actually healthy. If the precision for 0 is high, it means the model is good at predicting healthy loans correctly and minimizing false positives (i.e., not mistakenly classifying high-risk loans as healthy).
For the 1 (high-risk loan) label, precision tells us how many of the predicted high-risk loans were truly high-risk. High precision here ensures that the model doesn’t mistakenly flag healthy loans as high-risk.
Recall:

For the 0 label, recall tells us how many actual healthy loans were correctly identified by the model. High recall for 0 indicates the model is good at identifying most of the healthy loans, minimizing false negatives.
For the 1 label, recall shows how many actual high-risk loans were correctly identified. High recall here means that the model is good at detecting most of the high-risk loans and not missing many.
F1-score:

The F1-score is the harmonic mean of precision and recall, providing a balance between the two metrics. A high F1-score indicates that the model performs well in both predicting healthy and high-risk loans.
Accuracy:

The accuracy metric shows the overall percentage of correct predictions across both classes. It provides a general sense of how well the model is performing in all situations.
Based on the confusion matrix and classification report:
If the model performs well in terms of high precision and recall for both labels (0 and 1), it will be effective at identifying both healthy and high-risk loans. However, depending on the problem, the company might prioritize one class over the other. For example, detecting high-risk loans (1) might be more important to avoid financial losses.

High recall for 1 (high-risk loan) would indicate that the model is good at catching high-risk loans, whereas high precision for 0 (healthy loan) would reduce the number of healthy loans incorrectly labeled as high-risk.

Therefore, evaluating both metrics for each class will help assess how well the model balances its predictions for both healthy and high-risk loans, and adjustments can be made depending on the specific needs of the company.

---