# Synthetic Sampling

In this activity you will use the provided dataset of a bank's telemarketing campaign. You will compare the effectiveness of synthetic resampling methods using a random forest. You will measure the random forest's recall of the minority class for both a random forest fitted to the resampled data and the original.

**Hint**: The column `y` is the target column.

## Prepare the Data

In [1]:
# Import modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

### 1. Read the data into a Pandas DataFrame.

In [2]:
# Read the data from the CSV file into a Pandas DataFrame
df = pd.read_csv('https://static.bc-edx.com/mbc/ai/m5/datasets/bank.csv')

# Review the DataFrame
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


### 2. Separate the features `X` from the target `y`

In [3]:
# Separate the features data, X, from the target data, y
y = df['y']
X = df.drop(columns='y')

### 3. Encode the categorical variables from the features data using [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html).

In [4]:
# Encode the dataset's categorical variables using get_dummies
X = pd.get_dummies(X)

# Review the features DataFrame
X.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,job_admin.,job_blue-collar,job_entrepreneur,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,30,1787,19,79,1,-1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
1,33,4789,11,220,1,339,4,0,0,0,...,0,0,1,0,0,0,1,0,0,0
2,35,1350,16,185,1,330,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,30,1476,3,199,4,-1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
4,59,0,5,226,1,-1,0,0,1,0,...,0,0,1,0,0,0,0,0,0,1


### 4. Separate the data into training and testing subsets.

In [5]:
# Split data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [6]:
# Review the distinct values from y
y_train.value_counts()

no     3012
yes     378
Name: y, dtype: int64

### 5. Scale the data using [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [7]:
# Instantiate a StandardScaler instance
scaler = StandardScaler()

# Fit the training data to the standard scaler
X_scaler = scaler.fit(X_train)

# Transform the training data using the scaler
X_train_scaled = X_scaler.transform(X_train)

# Transform the testing data using the scaler
X_test_scaled = X_scaler.transform(X_test)

---

## RandomForestClassifier

### 6. Create and fit a `RandomForestClassifier` to the **scaled** training data.

In [8]:
# Import the RandomForestClassifier from sklearn
from sklearn.ensemble import RandomForestClassifier

# Instantiate a RandomForestClassifier instance
model = RandomForestClassifier()

# Fit the traning data to the model
model.fit(X_train_scaled, y_train)

### 7. Make predictions using the scaled testing data.

In [9]:
# Predict labels for original scaled testing features
y_pred = model.predict(X_test_scaled)

---

## Cluster Centroids

### 8. Import `ClusterCentroids` from `imblearn`.

In [10]:
# Import ClusterCentroids from imblearn
from sklearn.cluster import KMeans
from imblearn.under_sampling import ClusterCentroids

# Instantiate a ClusterCentroids instance
cc_sampler = ClusterCentroids(estimator=KMeans(n_init='auto', random_state=0), random_state=1)

### 9. Fit the `ClusterCentroids` model to the scaled training data.

In [11]:
# Fit the training data to the cluster centroids model
X_resampled, y_resampled = cc_sampler.fit_resample(X_train_scaled, y_train)

### 10. Check the `value_counts` for the resampled target.

In [12]:
# Count distinct values for the resampled target data
y_resampled.value_counts()

no     378
yes    378
Name: y, dtype: int64

### 11. Create and fit a `RandomForestClassifier` to the resampled training data.

In [13]:
# Instantiate a new RandomForestClassier model
cc_model = RandomForestClassifier()

# Fit the resampled data the new model
cc_model.fit(X_resampled, y_resampled)

### 12. Make predictions using the scaled testing data.

In [14]:
# Predict labels for resampled testing features
cc_y_pred = cc_model.predict(X_test_scaled)

### 13. Generate and compare classification reports for each model.
  * Print a classification report for the model fitted to the original data
  * Print a classification report for the model fitted to the date resampled with CentroidClusters

In [28]:
# Print classification reports
print(f"Classifiction Report - Original Data")
print(classification_report(y_test, y_pred))
print("---------")
print(f"Classifiction Report - Resampled Data - ClusterCentroids")
print(classification_report(y_test, cc_y_pred))

Classifiction Report - Original Data
              precision    recall  f1-score   support

          no       0.89      0.98      0.94       988
         yes       0.60      0.19      0.29       143

    accuracy                           0.88      1131
   macro avg       0.75      0.59      0.61      1131
weighted avg       0.86      0.88      0.85      1131

---------
Classifiction Report - Resampled Data - ClusterCentroids
              precision    recall  f1-score   support

          no       0.99      0.28      0.44       988
         yes       0.17      0.99      0.28       143

    accuracy                           0.37      1131
   macro avg       0.58      0.63      0.36      1131
weighted avg       0.89      0.37      0.42      1131



---

## SMOTE

### 14. Import `SMOTE` from `imblearn`.

In [16]:
# Import SMOTE from imblearn
from imblearn.over_sampling import SMOTE

# Instantiate the SMOTE instance 
# Set the sampling_strategy parameter equal to auto
smote_sampler = SMOTE(random_state=1, sampling_strategy='auto')


### 15. Fit the `SMOTE` model to the scaled training data.

In [17]:
# Fit the training data to the smote_sampler model
X_resampled, y_resampled = smote_sampler.fit_resample(X_train_scaled, y_train)

### 16. Check the `value_counts` for the resampled target.

In [18]:
# Count distinct values for the resampled target data
y_resampled.value_counts()

no     3012
yes    3012
Name: y, dtype: int64

### 17. Create and fit a `RandomForestClassifier` to the resampled training data.

In [19]:
# Instantiate a new RandomForestClassier model 
smote_model = RandomForestClassifier()

# Fit the resampled data to the new model
smote_model.fit(X_resampled, y_resampled)

### 18. Make predictions using the scaled testing data.

In [20]:
# Predict labels for resampled testing features
smote_y_pred = smote_model.predict(X_test_scaled)

### 19. Generate and compare classification reports for each model.
  * Print a classification report for the model fitted to the original data
  * Print a classification report for the model fitted to the data resampled with SMOTE

In [21]:
# Print classification reports
print(f"Classifiction Report - Original Data")
print(classification_report(y_test, y_pred))
print("---------")
print(f"Classifiction Report - Resampled Data - SMOTE")
print(classification_report(y_test, smote_y_pred))

Classifiction Report - Original Data
              precision    recall  f1-score   support

          no       0.89      0.98      0.94       988
         yes       0.60      0.19      0.29       143

    accuracy                           0.88      1131
   macro avg       0.75      0.59      0.61      1131
weighted avg       0.86      0.88      0.85      1131

---------
Classifiction Report - Resampled Data - SMOTE
              precision    recall  f1-score   support

          no       0.90      0.96      0.93       988
         yes       0.49      0.27      0.35       143

    accuracy                           0.87      1131
   macro avg       0.70      0.62      0.64      1131
weighted avg       0.85      0.87      0.86      1131



---

## SMOTEENN

### 20. Import `SMOTEENN` from `imblearn`.

In [22]:
# Import SMOTEEN from imblearn
from imblearn.combine import SMOTEENN

# Instantiate the SMOTEENN instance
smote_enn = SMOTEENN(random_state=1)


### 21. Fit the `SMOTEENN` model to the scaled training data.

In [23]:
# Fit the model to the training data
X_resampled, y_resampled = smote_enn.fit_resample(X_train_scaled, y_train)

### 22. Check the `value_counts` for the resampled target.

In [24]:
# Count distinct values for the resampled target data
y_resampled.value_counts()

yes    2927
no     2394
Name: y, dtype: int64

### 23. Create and fit a `RandomForestClassifier` to the resampled training data.

In [25]:
# Instantiate a new RandomForestClassier model
smoteenn_model = RandomForestClassifier()

# Fit the resampled data the new model
smoteenn_model.fit(X_resampled, y_resampled)

### 24. Make predictions using the scaled testing data.

In [26]:
# Predict labels for resampled testing features
smoteenn_y_pred = smoteenn_model.predict(X_test_scaled)

### 25. Generate and compare classification reports for each model.
  * Print a classification report for the model fitted to the original data
  * Print a classification report for the model fitted to the data resampled using SMOTEENN

In [29]:
# Print classification reports
print(f"Classifiction Report - Original Data")
print(classification_report(y_test, y_pred))
print("---------")
print(f"Classifiction Report - Resampled Data - SMOTEENN")
print(classification_report(y_test, smoteenn_y_pred))


Classifiction Report - Original Data
              precision    recall  f1-score   support

          no       0.89      0.98      0.94       988
         yes       0.60      0.19      0.29       143

    accuracy                           0.88      1131
   macro avg       0.75      0.59      0.61      1131
weighted avg       0.86      0.88      0.85      1131

---------
Classifiction Report - Resampled Data - SMOTEENN
              precision    recall  f1-score   support

          no       0.94      0.90      0.92       988
         yes       0.46      0.57      0.51       143

    accuracy                           0.86      1131
   macro avg       0.70      0.73      0.71      1131
weighted avg       0.87      0.86      0.87      1131



# Interpretations
Which synthetic resampling tool would you recommend for this application?

ANSWER: SMOTE and SMOTEEN both improved the recall of the "yes" class, while Cluster Centroids *dramatically* improved it to almost 100%. That said, Cluster Centroids also sacrificed a significant amount of precision from the "yes" class and also lost a great deal of overall accuracy. It seems that SMOTEEN provides the best improvement of recall without sacrificing too much of the other metrics.