# Synthetic Sampling

In this activity you will use the provided dataset of a bank's telemarketing campaign. You will compare the effectiveness of synthetic resampling methods using a random forest. You will measure the random forest's recall of the minority class for both a random forest fitted to the resampled data and the original.

**Hint**: The column `y` is the target column.

## Instructions:

1. Read the data into a Pandas DataFrame.

2. Separate the features `X` from the target `y`.

3. Encode the categorical variables from the features data using [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html).

4. Separate the data into training and testing subsets.

5. Scale the data using [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

**RandomForestClassifier**

6. Create and fit a `RandomForestClassifier` to the **scaled** training data.

7.  Make predictions using the scaled testing data.

**Cluster Centroids**

8. Import `ClusterCentroids` from `imblearn`.

9. Fit the `ClusterCentroids` model to the scaled training data.

10. Check the `value_counts` for the resampled target.

11. Create and fit a `RandomForestClassifier` to the resampled training data.

12. Make predictions using the scaled testing data.

13. Generate and compare classification reports for each model.

**SMOTE**

14. Import `SMOTE` from `imblearn`.

15. Fit the `SMOTE` model to the scaled training data.

16. Check the `value_counts` for the resampled target.

17. Create and fit a `RandomForestClassifier` to the resampled training data.

18. Make predictions using the scaled testing data.

19. Generate and compare classification reports for each model.

**SMOTEENN**

20. Import `SMOTEENN` from `imblearn`.

21. Fit the `SMOTEENN` model to the scaled training data.

22. Check the `value_counts` for the resampled target.

23. Create and fit a `RandomForestClassifier` to the resampled training data.

24. Make predictions using the scaled testing data.

25. Generate and compare classification reports for each model.

## Prepare the Data

In [1]:
# Import modules
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

### 1. Read the data into a Pandas DataFrame.

In [2]:
# Read the data from the CSV file into a Pandas DataFrame
df = # YOUR CODE HERE

# Review the DataFrame
# YOUR CODE HERE


Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


### 2. Separate the features `X` from the target `y`

In [3]:
# Seperate the features data, X, from the target data, y
y = # YOUR CODE HERE
X = # YOUR CODE HERE


### 3. Encode the categorical variables from the features data using [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html).

In [4]:
# Encode the features dataset's categorical variables using get_dummies
X = # YOUR CODE HERE

# Review the features DataFrame
# YOUR CODE HERE


Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,job_admin.,job_blue-collar,job_entrepreneur,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,30,1787,19,79,1,-1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
1,33,4789,11,220,1,339,4,0,0,0,...,0,0,1,0,0,0,1,0,0,0
2,35,1350,16,185,1,330,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,30,1476,3,199,4,-1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
4,59,0,5,226,1,-1,0,0,1,0,...,0,0,1,0,0,0,0,0,0,1


### 4. Separate the data into training and testing subsets.

In [5]:
# Split data into training and testing datasets
X_train, X_test, y_train, y_test = # YOUR CODE HERE


In [6]:
# Review the distinct values from y
# YOUR CODE HERE


no     3012
yes     378
Name: y, dtype: int64

### 5. Scale the data using [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [7]:
# Instantiate a StandardScaler instance
scaler = # YOUR CODE HERE

# Fit the training data to the standard scaler
X_scaler = # YOUR CODE HERE

# Transform the training data using the scaler
X_train_scaled = # YOUR CODE HERE

# Transform the testing data using the scaler
X_test_scaled = # YOUR CODE HERE

---

## RandomForestClassifier

### 6. Create and fit a `RandomForestClassifier` to the **scaled** training data.

In [8]:
# Import the RandomForestClassifier from sklearn
# YOUR CODE HERE

# Instantiate a RandomForestClassifier instance
model = # YOUR CODE HERE

# Fit the traning data to the model
# YOUR CODE HERE


RandomForestClassifier()

### 7. Make predictions using the scaled testing data.

In [9]:
# Predict labels for original scaled testing features
y_pred = # YOUR CODE HERE


---

## Cluster Centroids

### 8. Import `ClusterCentroids` from `imblearn`.

In [10]:
# Import ClusterCentroids from imblearn
# YOUR CODE HERE

# Instantiate a ClusterCentroids instance
cc_sampler = # YOUR CODE HERE


### 9. Fit the `ClusterCentroids` model to the scaled training data.

In [11]:
# Fit the training data to the cluster centroids model
X_resampled, y_resampled = # YOUR CODE HERE


### 10. Check the `value_counts` for the resampled target.

In [12]:
# Count distinct values for the resampled target data
# YOUR CODE HERE


no     378
yes    378
Name: y, dtype: int64

### 11. Create and fit a `RandomForestClassifier` to the resampled training data.

In [13]:
# Instantiate a new RandomForestClassier model
cc_model = # YOUR CODE HERE

# Fit the resampled data the new model
# YOUR CODE HERE


RandomForestClassifier()

### 12. Make predictions using the scaled testing data.

In [14]:
# Predict labels for resampled testing features
cc_y_pred = # YOUR CODE HERE


### 13. Generate and compare classification reports for each model.
  * Print a classification report for the model fitted to the original data
  * Print a classification report for the model fitted to the date resampled with CentroidClusters

In [15]:
# Print classification reports
print(f"Classifiction Report - Original Data")
print(# YOUR CODE HERE)
print("---------")
print(f"Classifiction Report - Redsampled Data - CentroidClusters")
print(# YOUR CODE HERE)
    

Classifiction Report - Originial Data
              precision    recall  f1-score   support

          no       0.89      0.98      0.93       988
         yes       0.56      0.19      0.28       143

    accuracy                           0.88      1131
   macro avg       0.73      0.58      0.61      1131
weighted avg       0.85      0.88      0.85      1131

---------
Classifiction Report - Redsampled Data - CentroidClusters
              precision    recall  f1-score   support

          no       1.00      0.26      0.41       988
         yes       0.16      0.99      0.28       143

    accuracy                           0.35      1131
   macro avg       0.58      0.63      0.35      1131
weighted avg       0.89      0.35      0.39      1131



---

## SMOTE

### 14. Import `SMOTE` from `imblearn`.

In [16]:
# Import SMOTE from imblearn
# YOUR CODE HERE

# Instantiate the SMOTE instance 
# Set the sampling_strategy parameter equal to auto
smote_sampler = # YOUR CODE HERE


### 15. Fit the `SMOTE` model to the scaled training data.

In [17]:
# Fit the training data to the smote_sampler model
X_resampled, y_resampled = # YOUR CODE HERE


### 16. Check the `value_counts` for the resampled target.

In [18]:
# Count distinct values for the resampled target data
# YOUR CODE HERE


no     3012
yes    3012
Name: y, dtype: int64

### 17. Create and fit a `RandomForestClassifier` to the resampled training data.

In [19]:
# Instantiate a new RandomForestClassier model 
smote_model = # YOUR CODE HERE

# Fit the resampled data to the new model
# YOUR CODE HERE


RandomForestClassifier()

### 18. Make predictions using the scaled testing data.

In [20]:
# Predict labels for resampled testing features
smote_y_pred = # YOUR CODE HERE


### 19. Generate and compare classification reports for each model.
  * Print a classification report for the model fitted to the original data
  * Print a classification report for the model fitted to the data resampled with SMOTE

In [21]:
# Print classification reports
print(f"Classifiction Report - Original Data")
print(# YOUR CODE HERE)
print("---------")
print(f"Classifiction Report - Redsampled Data - SMOTE")
print(# YOUR CODE HERE)

Classifiction Report - Originial Data
              precision    recall  f1-score   support

          no       0.89      0.98      0.93       988
         yes       0.56      0.19      0.28       143

    accuracy                           0.88      1131
   macro avg       0.73      0.58      0.61      1131
weighted avg       0.85      0.88      0.85      1131

---------
Classifiction Report - Redsampled Data - SMOTE
              precision    recall  f1-score   support

          no       0.90      0.96      0.93       988
         yes       0.47      0.25      0.33       143

    accuracy                           0.87      1131
   macro avg       0.69      0.61      0.63      1131
weighted avg       0.84      0.87      0.85      1131



---

## SMOTEENN

### 20. Import `SMOTEENN` from `imblearn`.

In [22]:
# Import SMOTEEN from imblearn
# YOUR CODE HERE

# Instantiate the SMOTEENN instance
smote_enn = # YOUR CODE HERE


### 21. Fit the `SMOTEENN` model to the scaled training data.

In [23]:
# Fit the model to the training data
X_resampled, y_resampled = # YOUR CODE HERE


### 22. Check the `value_counts` for the resampled target.

In [24]:
# Count distinct values for the resampled target data
# YOUR CODE HERE


yes    2954
no     2394
Name: y, dtype: int64

### 23. Create and fit a `RandomForestClassifier` to the resampled training data.

In [25]:
# Instantiate a new RandomForestClassier model
smoteenn_model = # YOUR CODE HERE

# Fit the resampled data the new model
# YOUR CODE HERE


RandomForestClassifier()

### 24. Make predictions using the scaled testing data.

In [27]:
# Predict labels for resampled testing features
smoteenn_y_pred = # YOUR CODE HERE


### 25. Generate and compare classification reports for each model.
  * Print a classification report for the model fitted to the original data
  * Print a classification report for the model fitted to the data resampled using SMOTEENN

In [29]:
# Print classification reports
print(f"Classifiction Report - Original Data")
print(# YOUR CODE HERE)
print("---------")
print(f"Classifiction Report - Redsampled Data - SMOTEENN")
print(# YOUR CODE HERE)


Classifiction Report - Originial Data
              precision    recall  f1-score   support

          no       0.89      0.98      0.93       988
         yes       0.56      0.19      0.28       143

    accuracy                           0.88      1131
   macro avg       0.73      0.58      0.61      1131
weighted avg       0.85      0.88      0.85      1131

---------
Classifiction Report - Redsampled Data - SMOTEENN
              precision    recall  f1-score   support

          no       0.93      0.91      0.92       988
         yes       0.45      0.52      0.48       143

    accuracy                           0.86      1131
   macro avg       0.69      0.72      0.70      1131
weighted avg       0.87      0.86      0.86      1131

