# Module 2 Assignment

A few things you should keep in mind when working on assignments:

1. Run the first code cell to import modules needed by this assignment before proceeding to problems.
2. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
3. Each problem has an autograder cell below the answer cell. Run the autograder cell to check your answer. If there's anything wrong in your answer, the autograder cell will display error messages.
4. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. If the notebook runs through the last code cell without an error message, you've answered all problems correctly.
5. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).

-----

# Run Me First!

In [1]:
import pandas as pd
import numpy as np

from nose.tools import assert_equal, assert_almost_equal, assert_true, assert_is_instance

# We do this to ignore warnings
import warnings
warnings.filterwarnings("ignore")


ModuleNotFoundError: No module named 'nose'

-----

# Predicting Breast Cancer

In this assignment, we will work with a breast cancer data set to make a classification model. Before we build a model, we first load the data into the assignment notebook, and then randomly sample several rows. Next we display the DataFrame information. The data is clean and all columns are numerical.

Please run the next two Code cells before proceeding to Problem 1.

-----

In [None]:
#Load breast cancer dataset
df = pd.read_csv('data/breast-cancer-wisconsin.csv')
df.sample(5)

In [None]:
df.info()

---

# Problem 1: Data Preprocessing

For this problem you will use the DataFrame **df** defined above.

To complete the task, do the following:
1. Choose column 'class' as label and assign it to variable **label**. Note: since DataFrame has an attribute 'class', you can't refer to the 'class' column by using `df.class`. Use `df['class']` instead.
2. Choose the following columns as training data and assign it to variable **data**:  
'clump thickness', 'uniformity cell size', 'uniformity cell shape', 'marginal adhesion', 'epithelial cell size', 'bare nuclei', 'bland chromatin', 'normal nucleoli' and, 'mitoses'.   
__data__ should be a DataFrame.
3. Split the independent and dependent variables to training and testing set.
    - Assign the training and testing data to variable d_train and d_test.
    - Assign the training and testing label to variable l_train and l_test.
    - The `test_size` argument in `train_test_split` should be set to 0.3.
    - The `random_state` argument in `train_test_split` should be set to 23.

After this problem, there are six new variables defined, **data, label, d_train, d_test, l_train** and __l_test__.

-----

In [None]:
from sklearn.model_selection import train_test_split

### BEGIN SOLUTION
label = df['class']
data = df[['clump thickness', 'uniformity cell size', 'uniformity cell shape', 'marginal adhesion', 'epithelial cell size', 'bare nuclei', 'bland chromatin', 'normal nucleoli', 'mitoses']]
d_train, d_test, l_train, l_test = train_test_split(data, label, test_size=0.3, random_state=23)
### END SOLUTION

In [None]:
assert_equal(type(data), pd.DataFrame, msg="data is not a DataFrame")
assert_equal(data.shape, (683, 9), msg="data is not correct")
assert_equal(len(l_test), 205, msg="Test set size is not correct.")
assert_equal(tuple(d_test.values[0]), (3, 2, 1, 1, 2, 2, 3, 1, 1),
             msg='Test data is not correct. Make sure you set random_state=23 when splitting the dataset')
#display first 2 training data
d_train.head(2)

---

# Problem 2: Create and Train a Logistic Regression Model

Your task for this problem is to build and use the scikit learn library's `LogisticRegression` estimator to make predictions on the breast cancer dataset.  

To complete this task, you must explicitly:
- Create a `LogisticRegression` estimator **lr_model** by using scikit learn. Accept default values for all arguments.
- Fit the `LogisticRegression` estimator using d_train and l_train created in problem 1.

After this problem, there will be a trained logistic regression model **lr_model**.

-----

In [None]:
from sklearn.linear_model import LogisticRegression

### BEGIN SOLUTION
# Create and fit our logistic regression model to training data
lr_model = LogisticRegression()
lr_model.fit(d_train, l_train)
### END SOLUTION

In [None]:
assert_equal(type(lr_model), type(LogisticRegression()), msg="lr_model is not a LogisticRegression model")
assert_equal(lr_model.get_params()['C'], 1.0,
            msg="lr_model is not created with all default argument values")

---

# Problem 3: Calculate Mean Accuracy Score

For this problem, you will compute the mean accuracy score of the lr_model.  

To complete this task, you must explicitly:

- Apply lr_model `predict` function to d_test to get predicted label, assign it to variable **predicted**.
- Compute the mean accuracy score using `accuracy_score` function in `metrics` module with true label **l_test** and predicted label __predicted__.
- Assign the accuracy score to variable **mas_score**.

After this problem, there will be a new variable **mas_score** defined.

-----

In [None]:
from sklearn import metrics

### BEGIN SOLUTION
predicted = lr_model.predict(d_test)
mas_score = metrics.accuracy_score(predicted, l_test)
### END SOLUTION

In [None]:
assert_almost_equal(mas_score, 0.975609756097561, msg="Mean accuracy score is not correct")
print(f"Logistic Regression prediction accuracy = {mas_score*100:4.1f}%")

---

# Problem 4: Create and Train a Decision Tree Classifier

Your task for this problem is to build and use the scikit learn library's `DecisionTreeClassifier` estimator to make predictions on the breast cancer dataset. 

To complete this function, you must explicitly:
- Create a `DecisionTreeClassifier` estimator **dtc_model** by using scikit learn. Set **random_state** to 23 and accept default values for all other hyperparameters.
- Fit the `DecisionTreeClassifier` estimator using d_train and l_train created in problem 1.

After this problem, there will be a trained decision tree classification model **dtc_model**.

-----

In [None]:
from sklearn.tree import DecisionTreeClassifier

### BEGIN SOLUTION
# Create and fit our logistic regression model to training data
dtc_model = DecisionTreeClassifier(random_state=23)
dtc_model.fit(d_train, l_train)
### END SOLUTION

In [None]:
assert_equal(type(dtc_model), type(DecisionTreeClassifier()), msg="dtc_model is not a DecisionTreeClassifier")
assert_equal(dtc_model.get_params()['random_state'], 23,
            msg="dtc_model is not created with random_state 23")
assert_equal(dtc_model.get_params()['criterion'], 'gini',
            msg="dtc_model is not created with all default argument values")

---

# Problem 5: Calculate Mean Accuracy Score

For this problem, you will compute the mean accuracy score of the dtc_model.

To complete this function, you must explicitly:

- Apply dtc_model `predict` function to d_test to get predicted label, assign it to variable **predicted_dtc**.
- Compute the mean accuracy score using `accuracy_score` function in `metrics` module with true label **l_test** and predicted label __predicted_dtc__.
- Assign the accuracy score to variable **mas_score_dtc**.

After this problem, there will be a new variable **mas_score_dtc** defined.

-----

In [None]:
from sklearn import metrics

### BEGIN SOLUTION
predicted_dtc = dtc_model.predict(d_test)
mas_score_dtc = metrics.accuracy_score(predicted_dtc, l_test)
### END SOLUTION

In [None]:
assert_almost_equal(mas_score_dtc, 0.9609756097560975, msg="Mean accuracy score is not correct")
print(f"Decision Tree prediction accuracy = {mas_score_dtc*100:4.1f}%")