# Random Sampling

In this activity you will use the provided dataset of a bank's telemarketing campaign. You will compare the effectiveness of random resampling methods using a random forest. You will measure the random forest's recall of the minority class for both a random forest fitted to the resampled data and the original.

## Prepare the Data

In [1]:
# Import modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

### 1. Read the CSV file into a Pandas DataFrame

In [2]:
# Read the CSV file into a Pandas DataFrame
df = pd.read_csv('https://static.bc-edx.com/mbc/ai/m5/datasets/bank.csv')

# Review the DataFrame
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


### 2. Separate the features, `X`, from the target, `y`, data.

In [3]:
# Split the features and target data
y = df['y']
X = df.drop(columns='y')

### 3. Encode categorical variables with `get_dummies`

In [4]:
# Encode the features dataset's categorical variables using get_dummies
X = pd.get_dummies(X)

# Review the features DataFrame
X.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,job_admin.,job_blue-collar,job_entrepreneur,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,30,1787,19,79,1,-1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
1,33,4789,11,220,1,339,4,0,0,0,...,0,0,1,0,0,0,1,0,0,0
2,35,1350,16,185,1,330,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,30,1476,3,199,4,-1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
4,59,0,5,226,1,-1,0,0,1,0,...,0,0,1,0,0,0,0,0,0,1


### 4. Split the data into training and testing sets

In [5]:
# Split data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [6]:
# Review the distinct values from y
y_train.value_counts()

no     3012
yes     378
Name: y, dtype: int64

### 5. Scale the data using `StandardScaler`

In [7]:
# Instantiate a StandardScaler instance
scaler = StandardScaler()

# Fit the training data to the standard scaler
X_scaler = scaler.fit(X_train)

# Transform the training data using the scaler
X_train_scaled = X_scaler.transform(X_train)

# Transform the testing data using the scaler
X_test_scaled = X_scaler.transform(X_test)

---

## RandomForestClassifier

### 6. Create and fit a `RandomForestClassifier` to the **scaled** training data.

In [8]:
# Import the RandomForestClassifier from sklearn
from sklearn.ensemble import RandomForestClassifier

# Instantiate a RandomForestClassifier instance
model = RandomForestClassifier()

# Fit the traning data to the model
model.fit(X_train_scaled, y_train)

### 7. Make predictions using the scaled testing data.

In [9]:
# Predict labels for original scaled testing features
y_pred = model.predict(X_test_scaled)

---

## Random Undersampler

### 8. Import `RandomUnderSampler` from `imblearn`.

In [10]:
# Import RandomUnderSampler from imblearn
from imblearn.under_sampling import RandomUnderSampler

# Instantiate a RandomUnderSampler instance
rus = RandomUnderSampler(random_state=1)

### 9. Fit the random undersampler to the scaled training data.

In [11]:
# Fit the training data to the random undersampler model
X_undersampled, y_undersampled = rus.fit_resample(X_train_scaled, y_train)

### 10. Check the `value_counts` for the undersampled target.

In [12]:
# Count distinct values for the resampled target data
y_undersampled.value_counts()

no     378
yes    378
Name: y, dtype: int64

### 11. Create and fit a `RandomForestClassifier` to the **undersampled** training data.

In [13]:
# Instantiate a new RandomForestClassier model
model_undersampled = RandomForestClassifier()

# Fit the undersampled data the new model
model_undersampled.fit(X_undersampled, y_undersampled)

### 12. Make predictions using the scaled testing data.

In [14]:
# Predict labels for oversampled testing features
y_pred_undersampled = model_undersampled.predict(X_test_scaled)

### 13. Generate and compare classification reports for each model.
  * Print a classification report for the model fitted to the original data
  * Print a classification report for the model fitted to the undersampled data

In [15]:
# Print classification reports
print(f"Classifiction Report - Original Data")
print(classification_report(y_test, y_pred))
print("---------")
print(f"Classifiction Report - Undersampled Data")
print(classification_report(y_test, y_pred_undersampled))

Classifiction Report - Original Data
              precision    recall  f1-score   support

          no       0.90      0.98      0.94       988
         yes       0.58      0.22      0.32       143

    accuracy                           0.88      1131
   macro avg       0.74      0.60      0.63      1131
weighted avg       0.86      0.88      0.86      1131

---------
Classifiction Report - Undersampled Data
              precision    recall  f1-score   support

          no       0.97      0.80      0.88       988
         yes       0.37      0.80      0.51       143

    accuracy                           0.80      1131
   macro avg       0.67      0.80      0.69      1131
weighted avg       0.89      0.80      0.83      1131



---

## Random Oversampler

### 14. Import `RandomOverSampler` from `imblearn`.

In [16]:
# Import RandomOverSampler from imblearn
from imblearn.over_sampling import RandomOverSampler

# Instantiate a RandomOversampler instance
ros = RandomOverSampler(random_state=1)


### 15. Fit the random over sampler to the scaled training data.

In [17]:
# Fit the model to the training data
X_oversampled, y_oversampled = ros.fit_resample(X_train_scaled, y_train)

### 16. Check the `value_counts` for the resampled target.

In [18]:
# Count distinct values
y_oversampled.value_counts()

no     3012
yes    3012
Name: y, dtype: int64

### 17. Create and fit a `RandomForestClassifier` to the **oversampled** training data.

In [19]:
# Instantiate a new RandomForestClassier model
model_oversampled = RandomForestClassifier()

# Fit the oversampled data the new model
model_oversampled.fit(X_oversampled, y_oversampled)

### 18. Make predictions using the scaled testing data.

In [20]:
# Predict labels for oversampled testing features
y_pred_oversampled = model_oversampled.predict(X_test_scaled)

### 19. Generate and compare classification reports for each model.
  * Print a classification report for the model fitted to the original data
  * Print a classification report for the model fitted to the undersampled data
  * Print a classification report for the model fitted to the oversampled data

In [21]:
# Print classification reports
print(f"Classifiction Report - Original Data")
print(classification_report(y_test, y_pred))
print("---------")
print(f"Classifiction Report - Undersampled Data")
print(classification_report(y_test, y_pred_undersampled))
print("---------")
print(f"Classifiction Report - Oversampled Data")
print(classification_report(y_test, y_pred_oversampled))

Classifiction Report - Original Data
              precision    recall  f1-score   support

          no       0.90      0.98      0.94       988
         yes       0.58      0.22      0.32       143

    accuracy                           0.88      1131
   macro avg       0.74      0.60      0.63      1131
weighted avg       0.86      0.88      0.86      1131

---------
Classifiction Report - Undersampled Data
              precision    recall  f1-score   support

          no       0.97      0.80      0.88       988
         yes       0.37      0.80      0.51       143

    accuracy                           0.80      1131
   macro avg       0.67      0.80      0.69      1131
weighted avg       0.89      0.80      0.83      1131

---------
Classifiction Report - Oversampled Data
              precision    recall  f1-score   support

          no       0.90      0.98      0.94       988
         yes       0.61      0.26      0.36       143

    accuracy                           0.89 

# Interpretations

Which of these techniques yielded the best results?

ANSWER: Using only accuracy, the original data performed better than both under sampling and over sampling! However, in the instructions we were told to focus on the recall of the "yes" minority class. The original data had a rather miserable 0.22 recall for the "yes" class. The oversampled data yielded a slight improvement up to 0.27. The undersampled data, however, showed a remarkable improvement in performance: 0.81! Between these three options, undersampling performed the best.