# Activity: Comparing Imbalanced Classifiers

In this activity, you’ll fit various balanced and imbalanced models to small business loan data. You’ll then compare the results by using the metrics that you’ve learned.

## Overview

The U.S. Small Business Administration (SBA) is a government agency that exists to support the creation and growth of small companies. The SBA accomplishes this growth in several ways, one of which involves lending to these firms.

The dataset for this activity contains information about actual small business loans that the SBA has issued. This dataset contains the following columns:

- “Year”: The fiscal year of the loan application.

- “Month”: The month of the fiscal year.

- “Amount”: The issued loan amount.

- “Term”: The term of the loan, in months.

- “Zip”: The borrower’s zip code.

- “CreateJob”: The number of jobs that were created by using the loan.

- “NoEmp”: The number of business employees.

- “RealEstate”: Whether the loan is backed by real estate.

- “RevLineCr”: Whether the loan is a revolving line of credit.

- “UrbanRural”: The location type of the borrower.

- “Default”: Whether the borrower defaulted on the loan (1) or not (0).

This dataset is imbalanced. Failing to repay a loan (that is, when the “Default” value equals 1) occurred rarely compared to the number of loans that borrowers successfully repaid.

Using some of these variables as features, you need to try various models to find the one which can best predict which SBA loans are most likely to default.

## Files

Use the Jupyter notebook file in the `Unsolved` folder to write your code. The `Resources` folder contains the CSV file that you’ll import.

## Instructions

1. Read in the CSV file from the `Resources` folder into a Pandas DataFrame.

2. Create a Series named `y` that contains the data from the "Default" column of the original DataFrame. Note that this Series will contain the labels. Create a new DataFrame named `X` that contains the remaining columns from the original DataFrame. Note that this DataFrame will contain the features.

3. Split the features and labels into training and testing sets, and `StandardScaler` your `X` data.

4. Check the magnitude of imbalance in the dataset by viewing the number of distinct values (`value_counts`) for the labels. 

5. Fit two versions of a random forest model to the data: the first, a regular `RandomForest` classifier, and the second, a `BalancedRandomForest` classifier.

6. Resample and fit the training data by one additional method for imbalanced data, such as `RandomOverSampler`, undersampling, or a synthetic technique.

7. Print the confusion matrixes, accuracy scores, and classification reports for the three different models.

8. Evaluate the effectiveness of `RandomForest`, `BalancedRandomForest`, and your one additional imbalanced classifier for predicting the minority class. 
    * Answer the following question: Does the model generated using one of the resampled classifiers more accurately flag all the loans that eventually defaulted?


In [1]:
# Import the required modules
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import balanced_accuracy_score
from imblearn.metrics import classification_report_imbalanced

## Step 1: Read in the CSV file from the `Resources` folder into a Pandas DataFrame. 

In [4]:
# Read the sba_loans.csv file from the Resources folder into a Pandas DataFrame
loans_df = pd.read_csv(Path('../Resources/sba_loans.csv'))

# Review the DataFrame
loans_df.head()

Unnamed: 0,Year,Month,Amount,Term,Zip,CreateJob,NoEmp,RealEstate,RevLineCr,UrbanRural,Default
0,2001,11,32812,36,92801,0,1,0,1,0,0
1,2001,4,30000,56,90505,0,1,0,1,0,0
2,2001,4,30000,36,92103,0,10,0,1,0,0
3,2003,10,50000,36,92108,0,6,0,1,0,0
4,2006,7,343000,240,91345,3,65,1,0,2,0


## Step 2: Create a Series named `y` that contains the data from the "Default" column of the original DataFrame. Note that this Series will contain the labels. Create a new DataFrame named `X` that contains the remaining columns from the original DataFrame. Note that this DataFrame will contain the features.

In [6]:
# Split the data into X (features) and y (lables)

# The y variable should focus on the Default column
y = loans_df['Default']

# The X variable should include all features except the Default column
X = loans_df.drop(columns='Default')

### Step 3: Split the features and labels into training and testing sets, and `StandardScaler` your X data.

In [7]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [8]:
# Scale the data
scaler = StandardScaler()
X_scaler = scaler.fit(X_train)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

## Step 4: Check the magnitude of imbalance in the data set by viewing  the number of distinct values  (`value_counts`) for the labels. Scale the data.

In [9]:
# Count the distinct values in the orignal labels data
y_train.value_counts()

0    1055
1     104
Name: Default, dtype: int64

## Step 5: Fit two versions of a random forest model to the data: the first, a regular `RandomForest` classifier, and the second, a `BalancedRandomForest` classifier.

In [10]:
from sklearn.ensemble import RandomForestClassifier

# Create a random forest classifier
rf_model = RandomForestClassifier()

# Fitting the model
rf_model = rf_model.fit(X_train_scaled,y_train)

# Making predictions using the testing data
rf_predictions = rf_model.predict(X_test_scaled)

In [11]:
# Import BalancedRandomForestClassifier from imblearn
from imblearn.ensemble import BalancedRandomForestClassifier

# Instantiate a BalancedRandomForestClassifier instance
brf = BalancedRandomForestClassifier()

# Fit the model to the training data
brf.fit(X_train_scaled,y_train)

  warn(
  warn(


In [12]:
# Predict labels for testing features
brf_predictions = brf.predict(X_test_scaled)

## Step 6: Resample and fit the training data by one additional method for imbalanced data, such as `RandomOverSampler`, undersampling, or a synthetic technique. Re-esimate by `RandomForest`.

In [13]:
from imblearn.over_sampling import SMOTE

smote_sampler = SMOTE(random_state = 1, sampling_strategy = 'auto')
X_resampled, y_resampled = smote_sampler.fit_resample(X_train, y_train)
model_resampled_rf = RandomForestClassifier()
model_resampled_rf.fit(X_resampled, y_resampled)
rf_resampled_predictions = model_resampled_rf.predict(X_test)

## Step 7: Print the confusion matrixes, accuracy scores, and classification reports for the three different models.

In [17]:
# Print the confusion matrix for RandomForest on the original data
confusion_matrix(y_test,rf_predictions)

array([[353,   3],
       [  8,  23]])

In [18]:
# Print the confusion matrix for balanced random forest data
confusion_matrix(y_test,brf_predictions)

array([[326,  30],
       [  2,  29]])

In [19]:
# Print the confusion matrix for your additional model on the resampled data
confusion_matrix(y_test,rf_resampled_predictions)

array([[346,  10],
       [  4,  27]])

In [20]:
# Print the accuracy score for the original data
balanced_accuracy_score(y_test,rf_predictions)

0.8667542587894165

In [21]:
# Print the accuracy score for the balanced random forest data
balanced_accuracy_score(y_test,brf_predictions)

0.9256071040231968

In [22]:
# Print the accuracy score for your additional model with resampled data
balanced_accuracy_score(y_test,rf_resampled_predictions)

0.9214389271475172

In [27]:
# Print the classification report for the original data
print(classification_report_imbalanced(y_test,rf_predictions))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.98      0.99      0.74      0.98      0.86      0.75       356
          1       0.88      0.74      0.99      0.81      0.86      0.72        31

avg / total       0.97      0.97      0.76      0.97      0.86      0.75       387



In [28]:
# Print the classification report for the balanced random forest data
print(classification_report_imbalanced(y_test,brf_predictions))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.99      0.92      0.94      0.95      0.93      0.85       356
          1       0.49      0.94      0.92      0.64      0.93      0.86        31

avg / total       0.95      0.92      0.93      0.93      0.93      0.86       387



In [29]:
# Print the classification report for your additional model with resampled data
print(classification_report_imbalanced(y_test,rf_resampled_predictions))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.99      0.97      0.87      0.98      0.92      0.86       356
          1       0.73      0.87      0.97      0.79      0.92      0.84        31

avg / total       0.97      0.96      0.88      0.97      0.92      0.85       387



## Step 8: Evaluate the effectiveness of `RandomForest`, `BalancedRandomForest`, and your one additional imbalanced classifier for predicting the minority class. 

### Answer the following question: Does the model generated using one of the imbalanced methods more accurately flag all the loans that eventually defaulted?

**Question:** Does the model generated using one of the imbalanced methods more accurately flag all the loans that eventually defaulted?
    
**Answer:** The BRF model does a better job at flagging the loans that defaulted, because it has the highest recall rate