# Module 4 Assignment

A few things you should keep in mind when working on assignments:

1. Run the first code cell to import modules needed by this assignment before proceeding to problems.
2. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
3. Each problem has an autograder cell below the answer cell. Run the autograder cell to check your answer. If there's anything wrong in your answer, the autograder cell will display error messages.
4. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. If the notebook runs through the last code cell without an error message, you've answered all problems correctly.
5. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).

-----

# Run Me First!

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

from nose.tools import assert_equal, assert_almost_equal, assert_true, assert_is_instance

# We do this to ignore warnings
import warnings
warnings.filterwarnings("ignore")

-----

# Prepare MPG Data

In this assignment, we will use the mpg dataset to make a regression model. Before we attempt to build a model, we first prepare the data.

Please run the next code cell before proceeding to Problem 1.

-----

In [2]:
from sklearn.model_selection import train_test_split
#load MPG dataset
mpg = pd.read_csv('data/mpg.csv')
y = mpg['mpg']
x = mpg[['cylinders', 'displacement', 'weight', 'acceleration', 'model_year']]

# Split data intro training:testing data set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=23)
x_train.sample(2)

Unnamed: 0,cylinders,displacement,weight,acceleration,model_year
346,4,105.0,2215,14.9,81
23,4,121.0,2234,12.5,70


---

# Problem 1: Perform Random Forest Regression

For this problem, use x_train, y_train, x_test and y_test created above.

Your task for this problem is to build and use the scikit-learn library's `RandomForestRegressor` estimator to make 
predictions on the mpg dataset.  
To complete this function, you must explicitly:  
- Create a `RandomForestRegressor` estimator **rfr_model** by using scikit-learn. Set `n_estimators` to 100, `random_state` to 23 and accept default values for all other hyperparameters.
- Fit the `RandomForestRegressor` estimator using x_train and y_train.

After this problem, there will be a trained Fandom Forest Regression model **rfr_model**.

-----

In [3]:
from sklearn.ensemble import RandomForestRegressor

### BEGIN SOLUTION
rfr_model = RandomForestRegressor(n_estimators=100, random_state=23)
rfr_model.fit(x_train, y_train)
### END SOLUTION

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=23, verbose=0,
                      warm_start=False)

In [4]:
assert_equal(type(rfr_model), type(RandomForestRegressor()), msg="rfr_model is not defined as a RandomForestRegressor model")
assert_equal(rfr_model.get_params()['random_state'], 23,
            msg="rfr_model is not created with random_state 23")
assert_equal(rfr_model.get_params()['n_estimators'], 100,
            msg="rfr_model is not created with 100 estimators")

---

# Problem 2: Calculate Regression Metrics

For this problem, you will compute the regression metrics for rfr_model created in problem 1.  
To complete this function, you must explicitly:
- Apply rfr_model `predict` function on x_test to get predicted mpg, save it as **y_pred**
- Use y_test and y_pred to calculate:
    - The R2 score using `r2_score` function in the metrics module. Assign value to **r2**
    - The Mean Absolute Error (MAE) with `mean_absolute_error` function in the metrics module. Assign value to **mae**
    - The Mean Squared Error (MSE) with `mean_squared_error` function in the metrics module. Assign value to **mse**
    - The Root Mean Squared Error (RMSE) which is the square root of MSE. Assign value to **rmse**

After this problem, there will be four new variables **r2_score, mae, mse** and __rmse__ defined.

-----

In [5]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import math

### BEGIN SOLUTION
y_pred = rfr_model.predict(x_test)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = math.sqrt(mse)
### END SOLUTION


In [6]:
assert_almost_equal(r2, 0.8320456240094305, msg="R2 score is not correct")
assert_almost_equal(mae, 2.076878980891719, msg="Mean absolute error is not correct")
assert_almost_equal(mse, 8.593142904458595, msg="Mean squared error is not correct")
assert_almost_equal(rmse, 2.931406301497388, msg="Root mean squared error is not correct")


---
# Prepare Breast Cancer Data

The next two problems will use the breast cancer dataset. Before we attempt to build a model, we first prepare the data.

Please run the next code cell before proceeding to Problem 3.


In [7]:
from sklearn.model_selection import train_test_split

#Load breast cancer dataset
df = pd.read_csv('data/breast-cancer-wisconsin.csv')
label = df['class']
data = df[['clump thickness', 'uniformity cell size', 'uniformity cell shape', 'marginal adhesion', 'epithelial cell size', 'bare nuclei', 'bland chromatin', 'normal nucleoli', 'mitoses']]
d_train, d_test, l_train, l_test = train_test_split(data, label, test_size=0.3, random_state=23)
d_train.sample(2)

Unnamed: 0,clump thickness,uniformity cell size,uniformity cell shape,marginal adhesion,epithelial cell size,bare nuclei,bland chromatin,normal nucleoli,mitoses
466,5,3,2,4,2,1,1,1,1
582,5,1,3,1,2,1,3,1,1


---

# Problem 3: Perform Random Forest Classification

For this problem, use d_train, l_train, d_test and l_test created above.

Your task for this problem is to build and use the scikit-learn library's `RandomForestClassifier` estimator to make 
predictions on the breast cancer dataset. 
To complete this function, you must explicitly:  
- Create a `RandomForestClassifier` estimator **rfc_model** by using scikit-learn. Set `n_estimators` to 100, `random_state` to 23 and accept default values for all other hyperparameters.
- Fit the `RandomForestClassifier` estimator using d_train and l_train.

After this problem, there will be a trained Random Forest Classification model **rfc_model**.

-----

In [8]:
from sklearn.ensemble import RandomForestClassifier

### BEGIN SOLUTION
rfc_model = RandomForestClassifier(n_estimators=100, random_state=23)
rfc_model.fit(d_train, l_train)
### END SOLUTION

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=23, verbose=0,
                       warm_start=False)

In [9]:
assert_equal(type(rfc_model), type(RandomForestClassifier()), msg="rfc_model is not defined as a RandomForestClassifier model")
assert_equal(rfc_model.get_params()['random_state'], 23,
            msg="rfc_model is not created with random_state 23")
assert_equal(rfc_model.get_params()['n_estimators'], 100,
            msg="rfc_model is not created with 100 estimators")

---

# Problem 4: Calculate Classification Metrics

For this problem, you will compute the classification metrics of the rfc_model.  

To complete this function, you must explicitly:

- Apply rfc_model `predict` function on d_test to get predicted label, assign it to variable **l_pred**.
- Use l_test and l_pred to calculate:
 - The mean accuracy score using `accuracy_score` function in `metrics` module and save the value to **mas_score**
 - The classification report using `classification_report` function in `metrics` module and save it to variable **c_report**.

After this problem, there will be two new variables, **mas_score** and __c_report__ defined.

-----

In [10]:
from sklearn import metrics

### BEGIN SOLUTION
l_pred = rfc_model.predict(d_test)
mas_score = metrics.accuracy_score(l_test, l_pred)
c_report = metrics.classification_report(l_test, l_pred)
### END SOLUTION

In [11]:
assert_almost_equal(mas_score, 0.9707317073170731, msg="Mean accuracy score is not correct")
assert_true('127' in c_report, msg="classification report is not correct")
assert_true('205' in c_report, msg="classification report is not correct")
print(c_report)

              precision    recall  f1-score   support

           2       0.97      0.98      0.98       127
           4       0.97      0.95      0.96        78

    accuracy                           0.97       205
   macro avg       0.97      0.97      0.97       205
weighted avg       0.97      0.97      0.97       205

