# Module 5 Assignment


A few things you should keep in mind when working on assignments:

1. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. 
2. Make sure that you save your work.
3. Upload your notebook to Compass.

-----


# Prepare MPG Data

In this assignment, we will use the mpg dataset to make a regression model. Before we attempt to build a model, we first prepare the data.

Please run the next code cell before proceeding to Problem 1.

-----

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
#load MPG dataset
mpg = pd.read_csv('mpg.csv')
mpg.dropna(inplace=True)
mpg['origin'] = LabelEncoder().fit_transform(mpg.origin)
y = mpg['mpg']
x = mpg[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', 'origin']]

# Split data intro training:testing data set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=23)
x_train.sample(2)

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
162,6,231.0,110.0,3039,15.0,75,2
374,4,120.0,88.0,2160,14.5,82,1


---
# Problem 1: Get Feature Ranking by Recursive Feature Extraction

Perform RFE on a Random Forest Regressor and retrieve feature rankings.

This problem will use **x** and __y__ created above.

To solve this problem do the following:
1. Import needed modules.
2. Create a `RandomForestRegressor` estimator. Set `n_estimators` to 100 and accept default values for all other hyperparameters.
3. Create a Recursive Feature Estimator `RFE` using the random forest regressor created in step 1 as the `estimator`, set `n_features_to_select ` to 1. Accept default values for other arguments.
4. Fit the RFE estimator using **x** and __y__.
5. Display feature rankings.
 - Retrive feature rankings from the `RFE` selector's `ranking_` attribute.
 - Zip the feature ranking with column names of **x** and sort the zipped object with the ranking.
 - Print out the feature rankings in `column_name rank = ranking` format. (ie. `displacement rank = 1`)


In [2]:
# Your answer
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFE

rfr = RandomForestRegressor(n_estimators=100)
rfe = RFE(rfr, n_features_to_select=1)
rfe.fit(x, y)
for var, name in sorted(zip(rfe.ranking_, x.columns), key=lambda x: x[0]):
    print(f'{name:>18} rank = {var}')

      displacement rank = 1
            weight rank = 2
        horsepower rank = 3
         cylinders rank = 4
        model_year rank = 5
      acceleration rank = 6
            origin rank = 7


---

# Problem 2: Get Accuracy Score of a Random Forest Regressor

This problem will use **x_train, x_test, y_train** and __y_test__ created above.

Your task for this problem is to build and train a `RandomForestRegressor` estimator on mpg data and calculate the estimator's $R^2$ score.

To solve this problem do the following:
1. Import needed modules.
2. Create a `RandomForestRegressor` estimator **rfr**. Set `n_estimators` to 100 and accept default values for all other hyperparameters.
3. Fit the RandomForestRegressor estimator using x_train and y_train.
4. Apply rfr `predict` function on x_test to get predicted mpg, save it as **y_pred**.
5. Use `r2_score` function with y_test and y_pred to get $R^2$ and display the $R^2$ score.

-----

In [3]:
# Your answer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

rfr = RandomForestRegressor(n_estimators=100)
rfr.fit(x_train, y_train)
y_pred = rfr.predict(x_test)
r2 = r2_score(y_test, y_pred)
r2

0.841259293744281

---

# Problem 3: Get the Cross Validation Scores

Get the cross validation scores for a random forest regressor.

This problem will use **x** and __y__ created above.

To solve this problem do the following:
1. Import needed modules.
2. Create a `RandomForestRegressor` estimator. Set `n_estimators` to 100 and accept default values for all other hyperparameters.
3. Create `KFold` iterator. Set `n_splits` to 5.
4. Calculate cross validation scores using `cross_val_score` function with the random forest regressor, x, y and the `KFold` iterator. Assign scores to variable **cv_scores**.
5. Use numpy mean() method to calculate the average cross validation score and display the average score.

-----

In [4]:
# Your answer
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

rfr = RandomForestRegressor(n_estimators=100)
skf = KFold(n_splits=5)
cv_scores = cross_val_score(rfr, x, y, cv=skf)
np.mean(cv_scores)

0.7498041953962813

-----

# Prepare Breast Cancer Data

For next 2 problems we will use the breast cancer dataset. Before we attempt to build models, we first prepare the data.

Please run the next two code cells before proceeding to Problem 4.

-----

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

#Load breast cancer dataset
df = pd.read_csv('breast-cancer-wisconsin.csv')
#reduce data size
df = df.sample(200, random_state=23)
label = df['class']
data = df[['clump thickness', 'uniformity cell size', 'uniformity cell shape', 'marginal adhesion', 'epithelial cell size', 'bare nuclei', 'bland chromatin', 'normal nucleoli', 'mitoses']]
data.head(2)

Unnamed: 0,clump thickness,uniformity cell size,uniformity cell shape,marginal adhesion,epithelial cell size,bare nuclei,bland chromatin,normal nucleoli,mitoses
117,3,2,1,1,2,2,3,1,1
29,3,1,1,1,1,1,2,1,1


---

# Problem 4: Find Best Estimator with Grid Search

Conduct grid search cross validation on the random forest classifier and get the best estimator.

This problem will use **data** and __label__ created above.

To solve this problem do the following:
1. Import needed modules.
2. Create a `RandomForestClassifier` estimator.
3. Create a `StratifiedKFold` iterator with `n_splits` equals 5.
4. Create variable **estimators** with value `[20, 40, 60, 80, 100]`.
5. Create parameter dictionary **params** and set key to `'n_estimators'` and value to the list `estimators` created in previous step.
6. Create `GridSearchCV` object **gse**:
 - Set `estimator` to the random forest classifier estimator.
 - Set `param_grid` to the parameter dictionary `params`.
 - Set `cv` to the stratified k-fold iterator created in step 3.
7. Fit the `GridSearchCV` object created in the previous step using **data** and __label__.
8. Print out the optimum `n_estimators`.
 - Retrieve the best estimator by calling `get_parames()` method from the `GridSearchCV` object's `best_estimator_` attribute.
 - Get the optimum n_estimators value with the key `n_estimators`.
9. Retrieve the best cross validation score from the `GridSearchCV` object's `best_score_` and print out the best score.

-----

In [6]:
# Your answer

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, GridSearchCV

rfc = RandomForestClassifier()
n_estimators = [20, 40, 60, 80, 100]
skf = StratifiedKFold(n_splits=5)

# Create a dictionary of hyperparameters and values
params = {'n_estimators':n_estimators}

# Create grid search cross validator
gse = GridSearchCV(estimator=rfc, param_grid=params, cv=skf)

# Fit estimator
gse.fit(data, label)

print(f'Best n_estimators={gse.best_estimator_.get_params()["n_estimators"]:5.4f}')
print(f'Best CV Score = {gse.best_score_:4.3f}')

Best n_estimators=40.0000
Best CV Score = 0.975


---

# Problem 5: Find Best Estimator with Random Grid Search

Conduct random grid search cross validation on the random forest classifier and get the best estimator.

This problem will use **data** and __label__ created above.

To solve this problem do the following:
1. Import needed modules.
2. Create a `RandomForestClassifier` estimator.
3. Create a `StratifiedKFold` iterator with `n_splits` equals 5.
4. Create variable **estimators** with value `range(20,100)`.
5. Create variable __weights__ with value `[None, 'balanced']`.
6. Create parameter dictionary **params** with two keys, `'n_estimators'` and `'class_weight'`, and two list values `estimators` and `weights` created in previous steps.
7. Create `RandomizedSearchCV` object **rgse**.
 - Set `estimator` to the random forest classifier estimator.
 - Set `param_distributions` to the parameter dictionary `params`.
 - Set `cv` to the stratified k-fold iterator.
 - Set `n_iter` to 5.
8. Fit the `RandomizedSearchCV` object created in previous step using **data** and __label__.
9. Print out the optimum `n_estimators` and `class_weight`.
 - Retrieve the best estimator by calling `get_parames()` method from the `RandomizedSearchCV` object's `best_estimator_` attribute.
 - Get the optimum n_estimators value with the key `n_estimators`.
 - Get the optimum class weight value with the key `class_weight`.
10. Retrieve the best cross validation score from the `RandomizedSearchCV` object's `best_score_` and print out the best score.

-----

In [7]:
# Your answer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV

rfc = RandomForestClassifier()
skf = StratifiedKFold(n_splits=5)

estimators = range(20, 100)
weights = [None, 'balanced']
# Create a dictionary of hyperparameters and values
params = {'n_estimators':estimators, 'class_weight':weights}

# Create grid search cross validator
rgse = RandomizedSearchCV(estimator=rfc, param_distributions=params, n_iter=5, cv=skf)

# Fit estimator
rgse.fit(data, label)

print(f'Best n_estimators={rgse.best_estimator_.get_params()["n_estimators"]}')
print(f'Best class_weight={rgse.best_estimator_.get_params()["class_weight"]}')
print(f'Best CV Score = {gse.best_score_:5.3f}')

Best n_estimators=69
Best class_weight=None
Best CV Score = 0.975
