# Module 3 Assignment

A few things you should keep in mind when working on assignments:

1. Run the first code cell to import modules needed by this assignment before proceeding to problems.
2. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
3. Each problem has an autograder cell below the answer cell. Run the autograder cell to check your answer. If there's anything wrong in your answer, the autograder cell will display error messages.
4. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. If the notebook runs through the last code cell without an error message, you've answered all problems correctly.
5. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).

-----

# Run Me First!

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

from nose.tools import assert_equal, assert_almost_equal, assert_true, assert_is_instance

# We do this to ignore warnings
import warnings
warnings.filterwarnings("ignore")


-----

# Predicting Breast Cancer

In this assignment, we will work with a breast cancer data set to make a classification model. Before we build a model, we first load the data into the assignment notebook, and randomly sample several rows. Next, we display the DataFrame information. The data is clean, and all columns are numerical.

Please run the next two Code cells before proceeding to Problem 1.

-----

In [2]:
#Load breast cancer dataset
df = pd.read_csv('data/breast-cancer-wisconsin.csv')
df.sample(5)

Unnamed: 0,id,clump thickness,uniformity cell size,uniformity cell shape,marginal adhesion,epithelial cell size,bare nuclei,bland chromatin,normal nucleoli,mitoses,class
161,1198128,10,8,10,10,6,1,3,1,10,4
653,1350423,5,10,10,8,5,5,7,10,1,4
305,718641,1,1,1,1,5,1,3,1,1,2
468,474162,8,7,8,5,5,10,9,10,1,4
241,167528,4,1,1,1,2,1,3,6,1,2


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 683 entries, 0 to 682
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype
---  ------                 --------------  -----
 0   id                     683 non-null    int64
 1   clump thickness        683 non-null    int64
 2   uniformity cell size   683 non-null    int64
 3   uniformity cell shape  683 non-null    int64
 4   marginal adhesion      683 non-null    int64
 5   epithelial cell size   683 non-null    int64
 6   bare nuclei            683 non-null    int64
 7   bland chromatin        683 non-null    int64
 8   normal nucleoli        683 non-null    int64
 9   mitoses                683 non-null    int64
 10  class                  683 non-null    int64
dtypes: int64(11)
memory usage: 58.8 KB


---

# Problem 1: Data Preprocessing

For this problem you will use the DataFrame **df** defined above.

To complete the task, do the following:
1. Choose column 'class' as label and assign it to variable **label**. Note: since DataFrame has an attribute 'class', you can't refer to the 'class' column by using `df.class`. Use `df['class']` instead.
2. Choose following columns as training data and assign it to variable **data**:  
'clump thickness', 'uniformity cell size', 'uniformity cell shape', 'marginal adhesion', 'epithelial cell size', 'bare nuclei', 'bland chromatin', 'normal nucleoli', 'mitoses'.   
__data__ should be a DataFrame.
3. Split the independent and dependent variables to training and testing set.
    - Assign the training and testing data to variable d_train and d_test.
    - Assign the training and testing label to variable l_train and l_test.
    - The `test_size` argument in `train_test_split` should be set to 0.3.
    - The `random_state` argument in `train_test_split` should be set to 23.

After this problem, there are six new variables defined, data, label, d_train, d_test, l_train, l_test.

-----

In [4]:
from sklearn.model_selection import train_test_split

### BEGIN SOLUTION
label = df['class']
data = df[['clump thickness', 'uniformity cell size', 'uniformity cell shape', 'marginal adhesion', 'epithelial cell size', 'bare nuclei', 'bland chromatin', 'normal nucleoli', 'mitoses']]
d_train, d_test, l_train, l_test = train_test_split(data, label, test_size=0.3, random_state=23)
### END SOLUTION

In [5]:
assert_equal(type(data), pd.DataFrame, msg="data is not a DataFrame")
assert_equal(data.shape, (683, 9), msg="data is not correct")
assert_equal(len(l_test), 205, msg="Test set size is not correct.")
assert_equal(tuple(d_test.values[0]), (3, 2, 1, 1, 2, 2, 3, 1, 1),
             msg='Test data is not correct. Make sure you set random_state=23 when splitting the dataset')
#display first 2 training data
d_train.head(2)

Unnamed: 0,clump thickness,uniformity cell size,uniformity cell shape,marginal adhesion,epithelial cell size,bare nuclei,bland chromatin,normal nucleoli,mitoses
408,5,1,3,1,2,1,2,1,1
319,5,4,6,6,4,10,4,3,1


---

# Problem 2: Create and Train a K-Nearest Neighbors Classification Model

Your task for this problem is to build and use the scikit-learn library's `KNeighborsClassifier` estimator to make predictions on the breast cancer dataset. 

To complete this function, you must explicitly:
- Create a `KNeighborsClassifier` estimator **knn_model** by using scikit learn. Set n_neighbors to 10 and accept default values for other hyperparameter.
- Fit the `KNeighborsClassifier` estimator using d_train and l_train created in problem 1.

After this problem, there will be a trained K-Nearest Neighbors Classifier **knn_model**.

-----

In [6]:
from sklearn.neighbors import KNeighborsClassifier

### BEGIN SOLUTION
# Create and fit our K-Nearest Neighbors Classifier
knn_model = KNeighborsClassifier(n_neighbors=10)
knn_model.fit(d_train, l_train)
### END SOLUTION

KNeighborsClassifier(n_neighbors=10)

In [7]:
assert_equal(type(knn_model), type(KNeighborsClassifier()), msg="knn_model is not a KNeighborsClassifier model")
assert_equal(knn_model.get_params()['n_neighbors'], 10,
            msg="n_neighbors of knn_model is not 10.")

---

# Problem 3: Calculate Accuracy Score

For this problem, you will compute the accuracy score of the knn_model created in problem 2.  

To complete this function, you must explicitly:

- Apply knn_model `predict` function to d_test to get predicted label, assign it to variable **predicted**.
- Compute the accuracy score using `accuracy_score` function in `metrics` module with true label **l_test** and predicted label __predicted__.
- Assign the accuracy score to variable **mas_score**.

After this problem, there will be a new variable **mas_score** defined.

-----

In [8]:
from sklearn import metrics

### BEGIN SOLUTION
predicted = knn_model.predict(d_test)
mas_score = metrics.accuracy_score(predicted, l_test)
### END SOLUTION

In [9]:
assert_almost_equal(mas_score, 0.9853658536585366, msg="Accuracy score is not correct")
print(f"K-Nearest Neighbors Classifier prediction accuracy = {mas_score*100:4.1f}%")

K-Nearest Neighbors Classifier prediction accuracy = 98.5%


---

# Problem 4: Create and Train a Random Forest Classifier

Your task for this problem is to build and use the scikit-learn library's `RandomForestClassifier` estimator to make predictions on the breast cancer dataset. 

To complete this function, you must explicitly:
- Create a `RandomForestClassifier` estimator **rdf_model** by using scikit-learn. Set n_estimators to 10 and random_state to 23, accept default values for all other hyperparameter.
- Fit the `RandomForestClassifier` estimator using d_train and l_train created in problem 1.
- Calculate accuracy score of rdf_model and assign accuracy score to variable **mas_score_rdf**.
    - Apply rdf_model `predict` function to d_test to get predicted label, assign it to variable **predicted**.
    - Compute the accuracy score using `accuracy_score` function in `metrics` module with true label **l_test** and predicted label __predicted__.
    - Assign the accuracy score to variable **mas_score_rdf**.


After this problem, there will be two new variable defined, **rdf_model** and __mas_score_rdf__.

-----

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

### BEGIN SOLUTION
# Create and fit our logistic regression model to training data
rdf_model = RandomForestClassifier(n_estimators=10, random_state=23)
rdf_model.fit(d_train, l_train)
predicted = rdf_model.predict(d_test)
mas_score_rdf = metrics.accuracy_score(predicted, l_test)
### END SOLUTION

In [11]:
assert_equal(type(rdf_model), type(RandomForestClassifier()), msg="rdf_model is not a RandomForestClassifier")
assert_equal(rdf_model.get_params()['random_state'], 23,
            msg="rdf_model is not created with random_state 23")
assert_almost_equal(mas_score_rdf, 0.9707317073170731, msg="Accuracy score is not correct")
print(f"Random Forest Classifier prediction accuracy = {mas_score_rdf*100:4.1f}%")

Random Forest Classifier prediction accuracy = 97.1%
