# Module 5 Assignment

A few things you should keep in mind when working on assignments:

1. Run the first code cell to import modules needed by this assignment before proceeding to problems.
2. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
3. Each problem has an autograder cell below the answer cell. Run the autograder cell to check your answer. If there's anything wrong in your answer, the autograder cell will display error messages.
4. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. If the notebook runs through the last code cell without an error message, you've answered all problems correctly.
5. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).

-----

# Run Me First!

In [None]:
import pandas as pd
import numpy as np

from nose.tools import assert_equal, assert_almost_equal, assert_true, assert_is_instance

# We do this to ignore warnings
import warnings
warnings.filterwarnings("ignore")

-----

# Prepare MPG Data

In this assignment, we will use the mpg dataset to make a regression model. Before we attempt to build a model, we first prepare the data.

Please run the next code cell before proceeding to Problem 1.

-----

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
#load MPG dataset
mpg = pd.read_csv('data/mpg.csv')
mpg.dropna(inplace=True)
mpg['origin'] = LabelEncoder().fit_transform(mpg.origin)
y = mpg['mpg']
x = mpg[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', 'origin']]

# Split data intro training:testing data set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=23)
x_train.sample(2)

---
# Problem 1: Get Feature Ranking by Recursive Feature Extraction

Perform RFE on a Random Forest Regressor and retrieve feature rankings.

This problem will use **x** and __y__ created above.

To solve this problem do the following:
- Create a `RandomForestRegressor` estimator. Set `n_estimators` to 100, `random_state` to 23 and accept default values for all other hyperparameters.
- Create a Recursive Feature Estimator `RFE` using the random forest regressor created in step 1 as the `estimator`, set `n_features_to_select ` to 1. Accept default values for other arguments.
- Fit the RFE estimator using **x** and __y__.
- Retrieve feature rankings from the `RFE` selector's `ranking_` attribute and assign it to variable **feature_ranking**.

After this problem, there's a new variable **feature_ranking** defined.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFE

# YOUR CODE HERE


In [None]:
assert_equal(feature_ranking.tolist(), [3, 1, 5, 2, 6, 4, 7])
# Display feature ranking
print('Feature Ranking:')
for var, name in sorted(zip(feature_ranking, x.columns), key=lambda x: x[0]):
    print(f'{name:>12} = {var}')

---

# Problem 2: Get Accuracy Score of a Random Forest Regressor

This problem will use **x_train, x_test, y_train** and __y_test__ created above.

Your task for this problem is to build and train a `RandomForestRegressor` estimator on mpg data and calculate the estimator's $R^2$ score.

To solve this problem do the following:
- Create a `RandomForestRegressor` estimator **rfr**. Set `n_estimators` to 100, `random_state` to 23 and accept default values for all other hyperparameters.
- Fit the RandomForestRegressor estimator using x_train and y_train.
- Apply rfr `predict` function on x_test to get predicted mpg, save it as **y_pred**.
- Use `r2_score` function with y_test and y_pred to get $R^2$ and assign it to variable **r2**.

After this problem, there will be a new variable **r2** defined.

-----

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

# YOUR CODE HERE


In [None]:
assert_almost_equal(r2, 0.8441667341695959, msg="R2 score is not correct")
print(f'R2 score: {r2:4.2f}')

---

# Problem 3: Get the Cross Validation Scores

Get the cross validation scores for a random forest regressor.

This problem will use **x** and __y__ created above.

To solve this problem do the following:
- Create a `RandomForestRegressor` estimator. Set `n_estimators` to 100, `random_state` to 23 and accept default values for all other hyperparameters.
- Create `KFold` iterator. Set `n_splits` to 5 and `random_state` to 23.
- Calculate cross validation scores using `cross_val_score` function with the random forest regressor, x, y and the `KFold` iterator. Assign scores to variable **cv_scores**.

After this problem, there's a new variable **cv_scores** defined.

-----

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# YOUR CODE HERE


In [None]:
assert_almost_equal(cv_scores[0], 0.88705262, msg='Cross validation scores are not correct')
assert_almost_equal(cv_scores[2], 0.85913104, msg='Cross validation scores are not correct')
print(f"Average Cross Validation Score: {np.mean(cv_scores):4.2f}")

-----

# Prepare Breast Cancer Data

For next 2 problems we will use the breast cancer dataset. Before we attempt to build models, we first prepare the data.

Please run the next two code cells before proceeding to Problem 4.

-----

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

#Load breast cancer dataset
df = pd.read_csv('data/breast-cancer-wisconsin.csv')
#reduce data size
df = df.sample(200, random_state=23)
label = df['class']
data = df[['clump thickness', 'uniformity cell size', 'uniformity cell shape', 'marginal adhesion', 'epithelial cell size', 'bare nuclei', 'bland chromatin', 'normal nucleoli', 'mitoses']]

---

# Problem 4: Find Best Estimator with Grid Search

Conduct grid search cross validation on the random forest classifier and get the best estimator.

This problem will use **data** and __label__ created above.

To solve this problem do the following:
- Create a `RandomForestClassifier` estimator. Set `random_state` to 23.
- Create a `StratifiedKFold` iterator with `n_splits` equals 5 and `random_state` equals 23.
- Create variable **estimators** with value `[20, 40, 60, 80, 100]`.
- Create parameter dictionary **params** and set key to `'n_estimators'` and value to the list `estimators` created in previous step.
- Create `GridSearchCV` object **gse**:
 - Set `estimator` to the random forest classifier estimator.
 - Set `param_grid` to the parameter dictionary `params`.
 - Set `cv` to the stratified k-fold iterator.
- Fit the `GridSearchCV` object created in previous step using **data** and __label__.
- Retrieve the best estimator from the `GridSearchCV` object's `best_estimator_` attribute and assign it to variable **best_estimator_gs**.
- Retrieve the best cross validation score from the `GridSearchCV` object's `best_score_` attribute and assign it to variable **best_score_gs**.

After this problem, there will be five new variables defined: **estimators, params, gse, best_estimator_gs,** and __best_score_gs__.

-----

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, GridSearchCV

# YOUR CODE HERE


In [None]:
assert_true(20 in gse.get_params()['param_grid']['n_estimators'] and \
            40 in gse.get_params()['param_grid']['n_estimators'] and \
            60 in gse.get_params()['param_grid']['n_estimators'] and \
            80 in gse.get_params()['param_grid']['n_estimators'] and \
            100 in gse.get_params()['param_grid']['n_estimators'], msg='Option of n_estimators is not correct.')
assert_is_instance(gse.get_params()['cv'], StratifiedKFold, msg='Cross validation is not StratifiedKFold.')

print(f'Best n_estimators={best_estimator_gs.get_params()["n_estimators"]}')
print(f'Best CV Score = {best_score_gs:5.3f}')

---

# Problem 5: Find Best Estimator with Random Grid Search

Conduct random grid search cross validation on the random forest classifier and get the best estimator.

This problem will use **data** and __label__ created above.

To solve this problem do the following:
- Create a `RandomForestClassifier` estimator. Set `random_state` to 23.
- Create a `StratifiedKFold` iterator with `n_splits` equals 5 and `random_state` equals 23.
- Create variable **estimators** with value `range(20,100)`.
- Create variable __weights__ with value `[None, 'balanced']`.
- Create parameter dictionary **params** with two keys, `'n_estimators'` and `'class_weight'`, and two list values `estimators` and `weights` created in previous step.
- Create `RandomizedSearchCV` object **rgse**.
 - Set `estimator` to the random forest classifier estimator.
 - Set `param_distributions` to the parameter dictionary `params`.
 - Set `cv` to the stratified k-fold iterator.
 - Set `n_iter` to 5.
 - Set `random_state` to 23.
- Fit the `RandomizedSearchCV` object created in previous step using **data** and __label__.
- Retrieve the best estimator from the `RandomizedSearchCV` object's `best_estimator_` attribute and assign it to variable **best_estimator_rgs**.
- Retrieve the best cross validation score from the `RandomizedSearchCV` object's `best_score_` attribute and assign it to variable **best_score_rgs**.

After this problem, there will be six new variable defined: **estimators, weights, params, rgse, best_estimator_rgs** and __best_score_rgs__.

**Note**: If you get warning messages after running your code, just ignore them.

-----

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV

# YOUR CODE HERE


In [None]:
assert_equal(min(rgse.get_params()['param_distributions']['n_estimators']), 20, msg="n_estimator range is not correct.")
assert_equal(max(rgse.get_params()['param_distributions']['n_estimators']), 99, msg="n_estimator range is not correct.")
assert_is_instance(rgse.get_params()['estimator'], RandomForestClassifier, msg="estimator is not RandomForestClassifier")
assert_true('balanced' in rgse.get_params()['param_distributions']['class_weight'], msg="class_weight option is not correct.")
print(f'Best n_estimators={best_estimator_rgs.get_params()["n_estimators"]}')
print(f'Best class_weight={best_estimator_rgs.get_params()["class_weight"]}')
print(f'Best CV Score = {best_score_rgs:5.3f}')