# Random Sampling

In this activity you will use the provided dataset of a bank's telemarketing campaign. You will compare the effectiveness of random resampling methods using a random forest. You will measure the random forest's recall of the minority class for both a random forest fitted to the resampled data and the original.

## Instructions:

1. Read the CSV file into a Pandas DataFrame.

2. Separate the features `X` from the target `y`.

3. Encode the categorical variables from the features data using [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html).

4. Separate the data into training and testing subsets.

5. Scale the data using [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

**RandomForestClassifier**

6. Create and fit a `RandomForestClassifier` to the **scaled** training data.

7.  Make predictions using the scaled testing data.

**Random Undersampler**

8. Import `RandomUnderSampler` from `imblearn`.

9. Fit the random undersampler to the scaled training data.

10. Check the `value_counts` for the resampled target.

11. Create and fit a `RandomForestClassifier` to the **undersampled** training data.

12. Make predictions using the scaled testing data.

13. Generate and compare classification reports for each model.

**Random Oversampler**

14. Import `RandomOverSampler` from `imblearn`.

15. Fit the random over sampler to the scaled training data.

16. Check the `value_counts` for the resampled target.

17. Create and fit a `RandomForestClassifier` to the **oversampled** training data.

18. Make predictions using the scaled testing data.

19. Generate and compare classification reports for each model.

## Prepare the Data

In [1]:
# Import modules
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

### 1. Read the CSV file into a Pandas DataFrame

In [2]:
# Read the CSV file into a Pandas DataFrame
df = pd.read_csv(Path("../Resources/bank.csv"))

# Review the DataFrame
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4516,33,services,married,secondary,no,-333,yes,no,cellular,30,jul,329,5,-1,0,unknown,no
4517,57,self-employed,married,tertiary,yes,-3313,yes,yes,unknown,9,may,153,1,-1,0,unknown,no
4518,57,technician,married,secondary,no,295,no,no,cellular,19,aug,151,11,-1,0,unknown,no
4519,28,blue-collar,married,secondary,no,1137,no,no,cellular,6,feb,129,4,211,3,other,no


In [3]:
df.columns

Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'y'],
      dtype='object')

### 2. Separate the features, `X`, from the target, `y`, data.

In [5]:
# Split the features and target data
y = df['y']
X = df.drop(columns='y')


Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4516,33,services,married,secondary,no,-333,yes,no,cellular,30,jul,329,5,-1,0,unknown
4517,57,self-employed,married,tertiary,yes,-3313,yes,yes,unknown,9,may,153,1,-1,0,unknown
4518,57,technician,married,secondary,no,295,no,no,cellular,19,aug,151,11,-1,0,unknown
4519,28,blue-collar,married,secondary,no,1137,no,no,cellular,6,feb,129,4,211,3,other


### 3. Encode categorical variables with `get_dummies`

In [6]:
# Encode the features dataset's categorical variables using get_dummies
X = pd.get_dummies(X,drop_first=True)

# Review the features DataFrame
X

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,job_blue-collar,job_entrepreneur,job_housemaid,...,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_other,poutcome_success,poutcome_unknown
0,30,1787,19,79,1,-1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
1,33,4789,11,220,1,339,4,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,35,1350,16,185,1,330,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,30,1476,3,199,4,-1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
4,59,0,5,226,1,-1,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4516,33,-333,30,329,5,-1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
4517,57,-3313,9,153,1,-1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
4518,57,295,19,151,11,-1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4519,28,1137,6,129,4,211,3,1,0,0,...,0,0,0,0,0,0,0,1,0,0


### 4. Split the data into training and testing sets

In [7]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1)


In [23]:
# Review the distinct values from y
y_train.value_counts()


no     3012
yes     378
Name: y, dtype: int64

### 5. Scale the data using `StandardScaler`

In [25]:
# Instantiate a StandardScaler instance
scaler = StandardScaler()

# Fit the training data to the standard scaler
X_scaler = scaler.fit(X_train)

# Transform the training data using the scaler
X_train_scaled = X_scaler.transform(X_train)

# Transform the testing data using the scaler
X_test_scaled = X_scaler.transform(X_test)

---

## RandomForestClassifier

### 6. Create and fit a `RandomForestClassifier` to the **scaled** training data.

In [27]:
# Import the RandomForestClassifier from sklearn
from sklearn.ensemble import RandomForestClassifier

# Instantiate a RandomForestClassifier instance
model =  RandomForestClassifier(n_estimators=500, random_state=78)

# Fit the training data to the model
model = model.fit(X_train_scaled, y_train)


### 7. Make predictions using the scaled testing data.

In [28]:
# Predict labels for original scaled testing features
y_pred = model.predict(X_test_scaled)



---

## Random Undersampler

### 8. Import `RandomUnderSampler` from `imblearn`.

In [29]:
# Import RandomUnderSampler from imblearn
from imblearn.under_sampling import RandomUnderSampler

# Instantiate a RandomUnderSampler instance
rus = RandomUnderSampler(random_state=1)


### 9. Fit the random undersampler to the scaled training data.

In [32]:
# Fit the training data to the random undersampler model
X_resampled, y_resampled = rus.fit_resample(X_train_scaled, y_train)

### 10. Check the `value_counts` for the undersampled target.

In [33]:
# Count distinct values for the resampled target data
y_resampled.value_counts()

no     378
yes    378
Name: y, dtype: int64

### 11. Create and fit a `RandomForestClassifier` to the **undersampled** training data.

In [34]:
# Instantiate a new RandomForestClassier model 
model_resampled =  RandomForestClassifier(random_state=1)

# Fit the undersampled data the new model
model.fit(X_resampled, y_resampled)


RandomForestClassifier(n_estimators=500, random_state=78)

### 12. Make predictions using the scaled testing data.

In [35]:
# Predict labels for resampled testing features
y_pred_undersampled = model.predict(X_test_scaled)


### 13. Generate and compare classification reports for each model.
  * Print a classification report for the model fitted to the original data
  * Print a classification report for the model fitted to the undersampled data

In [36]:
# Print classification reports
print(f"Classifiction Report - Original Data")
print(classification_report(y_test, y_pred))
print("---------")
print(f"Classifiction Report - Undersampled Data")
print(classification_report(y_test, y_pred_undersampled))    

Classifiction Report - Original Data
              precision    recall  f1-score   support

          no       0.89      0.98      0.93       988
         yes       0.56      0.20      0.29       143

    accuracy                           0.88      1131
   macro avg       0.73      0.59      0.61      1131
weighted avg       0.85      0.88      0.85      1131

---------
Classifiction Report - Undersampled Data
              precision    recall  f1-score   support

          no       0.97      0.80      0.88       988
         yes       0.38      0.81      0.51       143

    accuracy                           0.81      1131
   macro avg       0.67      0.81      0.70      1131
weighted avg       0.89      0.81      0.83      1131



---

## Random Oversampler

### 14. Import `RandomOverSampler` from `imblearn`.

In [None]:
# Import RandomOverSampler from imblearn
from imblearn.over_sampling import RandomOverSampler
# Instantiate a RandomOversampler instance
ros = 


### 15. Fit the random over sampler to the scaled training data.

In [None]:
# Fit the model to the training data
X_oversampled, y_oversampled = # YOUR CODE HERE


### 16. Check the `value_counts` for the resampled target.

In [None]:
# Count distinct values
# YOUR CODE HERE


### 17. Create and fit a `RandomForestClassifier` to the **oversampled** training data.

In [None]:
# Instantiate a new RandomForestClassier model
model_oversampled = # YOUR CODE HERE

# Fit the oversampled data the new model
# YOUR CODE HERE


### 18. Make predictions using the scaled testing data.

In [None]:
# Predict labels for oversampled testing features
y_pred_oversampled = # YOUR CODE HERE


### 19. Generate and compare classification reports for each model.
  * Print a classification report for the model fitted to the original data
  * Print a classification report for the model fitted to the undersampled data
  * Print a classification report for the model fitted to the oversampled data

In [None]:
# Print classification reports
print(f"Classifiction Report - Original Data")
print(# YOUR CODE HERE)
print("---------")
print(f"Classifiction Report - Undersampled Data")
print(# YOUR CODE HERE)
print("---------")
print(f"Classifiction Report - Oversampled Data")
print(# YOUR CODE HERE)