# Module 3 Assignment

A few things you should keep in mind when working on assignments:

1. Run the first code cell to import modules needed by this assignment before proceeding to problems.
2. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
3. Each problem has an autograder cell below the answer cell. Run the autograder cell to check your answer. If there's anything wrong in your answer, the autograder cell will display error messages.
4. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. If the notebook runs through the last code cell without error message, you've answered all problems correctly.
5. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).

-----

# Run Me First!

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

from nose.tools import assert_equal, assert_almost_equal, assert_true, assert_is_instance

# We do this to ignore warnings
import warnings
warnings.filterwarnings("ignore")


-----

# Predicting Breast Cancer

In this assignment, we will work with a breast cancer data set to make a classification model. Before we build a model, we first load the data into the assignment notebook, and randomly sample several rows. Next we dispaly the DataFrame information. The data is clean and all columns are numrical.

Please run the next two code cells before proceeding to Problem 1.

-----

In [2]:
df = pd.read_csv('data/breast-cancer-wisconsin.csv')
df.sample(5)

Unnamed: 0,id,clump thickness,uniformity cell size,uniformity cell shape,marginal adhesion,epithelial cell size,bare nuclei,bland chromatin,normal nucleoli,mitoses,class
78,1137156,2,2,2,1,1,1,7,1,1,2
14,1044572,8,7,5,10,7,9,5,5,4,4
280,555977,5,6,6,8,6,10,4,10,4,4
220,1227481,10,5,7,4,4,10,8,9,1,4
262,390840,8,4,7,1,3,10,3,9,2,4


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 683 entries, 0 to 682
Data columns (total 11 columns):
id                       683 non-null int64
clump thickness          683 non-null int64
uniformity cell size     683 non-null int64
uniformity cell shape    683 non-null int64
marginal adhesion        683 non-null int64
epithelial cell size     683 non-null int64
bare nuclei              683 non-null int64
bland chromatin          683 non-null int64
normal nucleoli          683 non-null int64
mitoses                  683 non-null int64
class                    683 non-null int64
dtypes: int64(11)
memory usage: 58.8 KB


---

# Problem 1: Data Preprocessing

For this problem you will use the DataFrame **df** defined above.

To complete the task, do the following:
1. Choose column 'class' as label and assign it to variable **label**. Note: since DataFrame has an attribute 'class', you can't refer to the 'class' column by using `df.class`. Use `df['class']` instead.
2. Choose following columns as training data and assign it to variable **data**:  
'clump thickness', 'uniformity cell size', 'uniformity cell shape', 'marginal adhesion', 'epithelial cell size', 'bare nuclei', 'bland chromatin', 'normal nucleoli', 'mitoses'.   
__data__ should be a DataFrame.
3. Splite the independent and dependent variables to training and testing set.
    - Assign the training and testing data to variable d_train and d_test.
    - Assign the training and testing label to variable l_train and l_test.
    - The `test_size` argument in `train_test_split` should be set to 0.3.
    - The `random_state` argument in `train_test_split` should be set to 23.

After this problem, there are six new variables defined, data, label, d_train, d_test, l_train, l_test.

-----

In [4]:
from sklearn.model_selection import train_test_split

### BEGIN SOLUTION
label = df['class']
data = df[['clump thickness', 'uniformity cell size', 'uniformity cell shape', 'marginal adhesion', 'epithelial cell size', 'bare nuclei', 'bland chromatin', 'normal nucleoli', 'mitoses']]
d_train, d_test, l_train, l_test = train_test_split(data, label, test_size=0.3, random_state=23)
### END SOLUTION

In [5]:
assert_equal(type(data), pd.DataFrame, msg="data is not a DataFrame")
assert_equal(data.shape, (683, 9), msg="data is not correct")
assert_equal(len(l_test), 205, msg="Test set size is not correct.")
assert_equal(tuple(d_test.values[0]), (3, 2, 1, 1, 2, 2, 3, 1, 1),
             msg='Test data is not correct. Make sure you set random_state=23 when splitting the dataset')
#display first 2 training data
d_train.head(2)

Unnamed: 0,clump thickness,uniformity cell size,uniformity cell shape,marginal adhesion,epithelial cell size,bare nuclei,bland chromatin,normal nucleoli,mitoses
408,5,1,3,1,2,1,2,1,1
319,5,4,6,6,4,10,4,3,1


---

# Problem 2: Create and Train a Random Forest Classifier

Your task for this problem is to build and use the scikit learn library's `RandomForestClassifier` estimator to make predictions on the breast cancer dataset. 

To complete this function, you must explicitly:
- Create a `RandomForestClassifier` estimator **rdf_model** by using scikit learn. Set __n_estimators__ to 100 , **random_state** to 23 and accept default values for all other hyperparameter.
- Fit the `RandomForestClassifier` estimator using d_train and l_train created in problem 1.
- Calculate mean accuracy score of rdf_model and assign accuracy score to variable **mas_score**
    - Apply rdf_model `predict` function to d_test to get predicted label, assign it to variable **predicted**.
    - Compute the mean accuracy score using `accuracy_score` function in `metrics` module with true label **l_test** and predicted label __predicted__.
    - Assign the accuracy score to variable **mas_score**.

After this problem, there will be two new variable defined, **rdf_model** and __mas_score__.

-----

In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

### BEGIN SOLUTION
# Create and fit our logistic regression model to training data
rdf_model = RandomForestClassifier(n_estimators=100, random_state=23)
rdf_model.fit(d_train, l_train)
predicted = rdf_model.predict(d_test)
mas_score = metrics.accuracy_score(predicted, l_test)
### END SOLUTION

In [7]:
assert_equal(type(rdf_model), type(RandomForestClassifier()), msg="rdf_model is not a RandomForestClassifier")
assert_equal(rdf_model.get_params()['random_state'], 23,
            msg="rdf_model is not created with random_state 23")
assert_equal(rdf_model.get_params()['n_estimators'], 100,
            msg="rdf_model is not created with 100 n_estimators")
assert_almost_equal(mas_score, 0.9707317073170731, msg="Mean accuracy score is not correct")
print(f"Random Forest Classifier prediction accuracy = {mas_score*100:4.1f}%")

Random Forest Classifier prediction accuracy = 97.1%


---

# Problem 3: Get Feature Importance

For this problem, you will retrive the feature importances from the rdf_model created in problem 2.

To complete this function, you must explicitly:

- Get feature importance from `feature_importances_` attribute of the rdf_model.
- Zip the feature importance with training features(columns in d_train) and create a DataFrame **feature_importance** with two columns, 'Feature' and 'Importance'.
- Sort **feature_importance** DataFrame by 'Importance' column in descending order. __Note__: Either sort the DataFrame _inplace_ or assign sorted DataFrame back to feature_importance.

After this problem, there will be a sorted DataFrame **feature_importance** defined.

-----

In [8]:
### BEGIN SOLUTION
feature_importance = pd.DataFrame(list(zip(d_train.columns, rdf_model.feature_importances_)), columns=['Feature', 'Importance'])
feature_importance.sort_values(by='Importance', ascending=False, inplace=True)
### END SOLUTION

In [14]:
assert_equal(type(feature_importance), pd.DataFrame, msg="feature_importance is not a DataFrame")
assert_almost_equal(feature_importance.iloc[1,1], 0.2119099489409781, msg="feature_importance is not sorted in descending order")
feature_importance

Unnamed: 0,Feature,Importance
5,bare nuclei,0.253141
2,uniformity cell shape,0.21191
1,uniformity cell size,0.162631
7,normal nucleoli,0.12667
4,epithelial cell size,0.098348
6,bland chromatin,0.058242
0,clump thickness,0.05165
3,marginal adhesion,0.029204
8,mitoses,0.008203


---

# Problem 4: Scale Train and Test Data

Your task for this problem is to scale the training and testing data. Use the d_train and d_test created in problem 1.

To complete this function, you must explicitly:
- Create a MinMaxScaler
- Fit the the MinMaxScaler with training data d_train
- Transform d_train with the MinMaxScaler and set transformed data to **d_train_mms**
- Transform d_test with the MinMaxScaler and set transformed data to **d_test_mms**

After this problem, there will be two scaled dataset, **d_train_mms** and __d_test_mms__.

-----

In [10]:
from sklearn.preprocessing import MinMaxScaler

### BEGIN SOLUTION
# Create and fit scaler
mms = MinMaxScaler().fit(d_train)

d_train_mms = mms.transform(d_train)
d_test_mms = mms.transform(d_test)
### END SOLUTION

In [11]:
assert_true(0.4444444444444445 in d_train_mms[0], msg="Train set is not scaled correctly")
assert_true(0.2222222222222222 in d_train_mms[0], msg="Train set is not scaled correctly")
assert_true(0.2222222222222222 in d_test_mms[0], msg="Test set is not scaled correctly")
assert_true(0.1111111111111111 in d_test_mms[0], msg="Test set is not scaled correctly")

---

# Problem 5: Create and Train a Support Vector Machine Classifier

Your task for this problem is to build and use the scikit learn library's `SVC` estimator to make predictions on the breast cancer dataset. 

To complete this function, you must explicitly:
- Create a `SVC` estimator **svc_model** by using scikit learn. Accept default values for all hyperparameter.
- Fit the `SVC` estimator using **d_train_mms** and __l_train__.
- Calculate mean accuracy score of svc_model and assign accuracy score to variable **mas_score_svc**
    - Apply svc_model `predict` function to __d_test_mms__ to get predicted label, assign it to variable **predicted**.
    - Compute the mean accuracy score using `accuracy_score` function in `metrics` module with true label **l_test** and predicted label __predicted__.
    - Assign the accuracy score to variable **mas_score_svc**.


After this problem, there will be two new variable defined, **svc_model** and __mas_score_svc__.

-----

In [12]:
from sklearn.svm import SVC
from sklearn import metrics

### BEGIN SOLUTION
# Create and fit our logistic regression model to training data
svc_model = SVC()
svc_model.fit(d_train_mms, l_train)
predicted = svc_model.predict(d_test_mms)
mas_score_svc = metrics.accuracy_score(predicted, l_test)
### END SOLUTION

In [13]:
assert_equal(type(svc_model), type(SVC()), msg="svc_model is not a SVC")
assert_equal(svc_model.get_params()['kernel'], 'rbf',
            msg="svc_model doesn't have default kernel rbf")
assert_almost_equal(mas_score_svc, 0.9804878048780488, msg="Mean accuracy score is not correct")
print(f"Support Vector Machine Classifier prediction accuracy = {mas_score_svc*100:4.1f}%")

Support Vector Machine Classifier prediction accuracy = 98.0%
