# Credit Risk Classification

Credit risk poses a classification problem that’s inherently imbalanced. This is because healthy loans easily outnumber risky loans. In this Challenge, you’ll use various techniques to train and evaluate models with imbalanced classes. You’ll use a dataset of historical lending activity from a peer-to-peer lending services company to build a model that can identify the creditworthiness of borrowers.

## Instructions:

This challenge consists of the following subsections:

* Split the Data into Training and Testing Sets

* Create a Logistic Regression Model with the Original Data

* Predict a Logistic Regression Model with Resampled Training Data 

### Split the Data into Training and Testing Sets

Open the starter code notebook and then use it to complete the following steps.

1. Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

2. Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

    > **Note** A value of `0` in the “loan_status” column means that the loan is healthy. A value of `1` means that the loan has a high risk of defaulting.  

3. Check the balance of the labels variable (`y`) by using the `value_counts` function.

4. Split the data into training and testing datasets by using `train_test_split`.

### Create a Logistic Regression Model with the Original Data

Employ your knowledge of logistic regression to complete the following steps:

1. Fit a logistic regression model by using the training data (`X_train` and `y_train`).

2. Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

3. Evaluate the model’s performance by doing the following:

    * Calculate the accuracy score of the model.

    * Generate a confusion matrix.

    * Print the classification report.

4. Answer the following question: How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

### Predict a Logistic Regression Model with Resampled Training Data

Did you notice the small number of high-risk loan labels? Perhaps, a model that uses resampled data will perform better. You’ll thus resample the training data and then reevaluate the model. Specifically, you’ll use `RandomOverSampler`.

To do so, complete the following steps:

1. Use the `RandomOverSampler` module from the imbalanced-learn library to resample the data. Be sure to confirm that the labels have an equal number of data points. 

2. Use the `LogisticRegression` classifier and the resampled data to fit the model and make predictions.

3. Evaluate the model’s performance by doing the following:

    * Calculate the accuracy score of the model.

    * Generate a confusion matrix.

    * Print the classification report.
    
4. Answer the following question: How well does the logistic regression model, fit with oversampled data, predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

### Write a Credit Risk Analysis Report

For this section, you’ll write a brief report that includes a summary and an analysis of the performance of both machine learning models that you used in this challenge. You should write this report as the `README.md` file included in your GitHub repository.

Structure your report by using the report template that `Starter_Code.zip` includes, and make sure that it contains the following:

1. An overview of the analysis: Explain the purpose of this analysis.


2. The results: Using bulleted lists, describe the balanced accuracy scores and the precision and recall scores of both machine learning models.

3. A summary: Summarize the results from the machine learning models. Compare the two versions of the dataset predictions. Include your recommendation for the model to use, if any, on the original vs. the resampled data. If you don’t recommend either model, justify your reasoning.

In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix
from imblearn.metrics import classification_report_imbalanced
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [19]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
lending_data =pd.read_csv(
    Path('Resources/lending_data.csv')
)    
    
# Review the DataFrame
display(lending_data.head())
display(lending_data.tail())

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
77531,19100.0,11.261,86600,0.65358,12,2,56600,1
77532,17700.0,10.662,80900,0.629172,11,2,50900,1
77533,17600.0,10.595,80300,0.626401,11,2,50300,1
77534,16300.0,10.068,75300,0.601594,10,2,45300,1
77535,15600.0,9.742,72300,0.585062,9,2,42300,1


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [20]:
# Separate the data into labels and features

# Separate the y variable, the labels
y = lending_data['loan_status']

# Separate the X variable, the features
X = lending_data.drop(columns=['loan_status'])

In [21]:
# Review the y variable Series
display(y.head())
display(y.tail())

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

77531    1
77532    1
77533    1
77534    1
77535    1
Name: loan_status, dtype: int64

In [22]:
# Review the X variable DataFrame
display(X.head())
display(X.tail())

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
77531,19100.0,11.261,86600,0.65358,12,2,56600
77532,17700.0,10.662,80900,0.629172,11,2,50900
77533,17600.0,10.595,80300,0.626401,11,2,50300
77534,16300.0,10.068,75300,0.601594,10,2,45300
77535,15600.0,9.742,72300,0.585062,9,2,42300


### Step 3: Check the balance of the labels variable (`y`) by using the `value_counts` function.

In [23]:
# Check the balance of our target values
y.value_counts()

loan_status
0    75036
1     2500
Name: count, dtype: int64

### Step 4: Split the data into training and testing datasets by using `train_test_split`.

In [6]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=1,
                                                    stratify=y)

In [9]:
# Instantiate a StandardScaler instance
#scaler = StandardScaler()

# Fit the training data to the standard scaler
#X_scaler = scaler.fit(X_train)

# Transform the training data using the scaler
#X_train_scaled = X_scaler.transform(X_train)

# Transform the testing data using the scaler
#X_test_scaled = X_scaler.transform(X_test)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [7]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
classifier =LogisticRegression(solver='lbfgs', random_state=1)
# Fit the model using training data
classifier.fit(X_train, y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [8]:
# Make a prediction using the testing data
predictions = classifier.predict(X_test)
df = pd.DataFrame({'Prediction': predictions, 'Actual': y_test})
display(df.head())
display(df.tail())

Unnamed: 0,Prediction,Actual
36831,0,0
75818,0,1
36563,0,0
13237,0,0
43292,0,0


Unnamed: 0,Prediction,Actual
38069,0,0
36892,0,0
5035,0,0
40821,0,0
35030,0,0


### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [9]:
# Print the balanced_accuracy score of the model
original_balanced_accuracy_score = balanced_accuracy_score(y_test, predictions)
original_balanced_accuracy_score

0.9442676901753825

In [10]:
# Generate a confusion matrix for the model
original_confusion_matrix = confusion_matrix(y_test, predictions)
original_confusion_matrix

array([[18679,    80],
       [   67,   558]], dtype=int64)

In [11]:
# Print the classification report for the model
original_classification_report = classification_report_imbalanced(y_test, predictions)
print(original_classification_report)

                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      1.00      0.89      1.00      0.94      0.90     18759
          1       0.87      0.89      1.00      0.88      0.94      0.88       625

avg / total       0.99      0.99      0.90      0.99      0.94      0.90     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** This model gives us an accuracy score of 94% using the dataset that has notable imbalanced classes, the y target (loan_status) has 75036 data points for 0=healthy loan and   250 data points for 1=default risk. It's a considerable difference in the training data. Therefore, the dataset is biased. The model is learning more on the healthy loan features than high risk loan features. This would mean, the predictions may have more false negatives than anticipated, or incorrect predictions of healthy loans. Considering that a true positive is a high risk loan, and having more data points with high risk loans, we may be missing on false negatives. The recall metric should be given priority if desire to capture as many high risk loans as possible, even if it leads to some false positives (incorrect predictions of high risk loans). The question I raise is: What's more detrimental? Flagging customers with incorrect labels of high risk loans or flagging a high risk loan customer as healthy? In the first case, the customer who has been wrongly flagged with a high risk loan would contact its financial enterprise and will want to find out leading to investigation and further clarification. A financial enterprise's goal is to increase revenue, not lose it. So, I think it would be more beneficial to prioritize a high recall value to flad the false negatives, customers with high risk loans flagged as healthy. Another observation I want to make is that this data wasn't scaled (as I was following the instructions in the notebook). However, once I scaled the training data with StandardScalar, I got a much higher rate on the accuracy score ( 98% ). That is something to consider as well then, the preprocessing of the data before fitting it to the model. 0

---

## Predict a Logistic Regression Model with Resampled Training Data

### Step 1: Use the `RandomOverSampler` module from the imbalanced-learn library to resample the data. Be sure to confirm that the labels have an equal number of data points. 

In [13]:
# Import the RandomOverSampler module form imbalanced-learn
from imblearn.over_sampling import RandomOverSampler

# Instantiate the random oversampler model
# # Assign a random_state parameter of 1 to the model
ros = RandomOverSampler(random_state=1)
# Fit the original training data to the random_oversampler model
X_oversampled, y_oversampled = ros.fit_resample(X_train, y_train)

In [29]:
# Count the distinct values of the resampled labels data
y_oversampled.value_counts()

loan_status
0    56277
1    56277
Name: count, dtype: int64

### Step 2: Use the `LogisticRegression` classifier and the resampled data to fit the model and make predictions.

In [15]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
classifier_resampled =LogisticRegression(solver='lbfgs', random_state=1)

# Fit the model using the resampled training data
classifier_resampled.fit(X_oversampled, y_oversampled)

# Make a prediction using the testing data
predictions_resampled = classifier_resampled.predict(X_test)
df_resampled = pd.DataFrame({'Prediction Resampled': predictions_resampled, 'Actual': y_test})
display(df_resampled.head())
display(df_resampled.tail())

Unnamed: 0,Prediction Resampled,Actual
36831,0,0
75818,1,1
36563,0,0
13237,0,0
43292,0,0


Unnamed: 0,Prediction Resampled,Actual
38069,0,0
36892,0,0
5035,0,0
40821,0,0
35030,0,0


### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [16]:
# Print the balanced_accuracy score of the model 
resampled_balanced_accuracy_score = balanced_accuracy_score(y_test, predictions_resampled)
resampled_balanced_accuracy_score

0.9959744975744975

In [17]:
# Generate a confusion matrix for the model
resampled_confusion_matrix = confusion_matrix(y_test, predictions_resampled)
resampled_confusion_matrix

array([[18668,    91],
       [    2,   623]], dtype=int64)

In [18]:
# Print the classification report for the model
resampled_classification_report = classification_report_imbalanced(y_test, predictions_resampled)
print(resampled_classification_report)

                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      1.00      1.00      1.00      1.00      0.99     18759
          1       0.87      1.00      1.00      0.93      1.00      0.99       625

avg / total       1.00      1.00      1.00      1.00      1.00      0.99     19384



### Step 4: Answer the following question

**Question:** How well does the logistic regression model, fit with oversampled data, predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** We can see in this balanced model that the accuracy score is 99%. After using the oversampler model, we used the same amount of data points for training for both of the classes (56277) for both healthy loan (0), and (high risk loan (1). Thr data isn't biased this time. In this model I also ask myself the same question, what is it more painful? Dealing with more False Negatives ( Flagging a customer with an incorrect prediction of a healthy loan), or False Positives (Flagging a customer with an incorect prediction of high risk loan). In the grand scheme of things, in finance we want to grow revenue, and we want customers to be responsible with their financial decisions and not default on a payment. From the perspective of the financial enterprise they would benefit by keeping the asset (a house, if this is a mortgage loan), the customer loses, it isn't something we want. Although there are many reasons why customers would default on a loan. But overall, we want to flag the accounts of customers with high risk loans. It would mean we want a higher recall value. In this case it is a 100% on class 1 (high risk loan) predictions, but the trade off is slightly lower precision, meaning we would have some accounts flagged as False Positive when they shouldn't be, however it isn't a bad score (87%). I have learned that it all depends on the uses, and what problems we are trying to resolve to choose the metrics that will be of most benefit, and that will be less costly, painful, or least favorable situation.