# Assignment 3: Machine Learning

## Overview

The purpose of this assignment is to get you more familiar with scikit-learn. Using the same principles from **Lab 3a** and **Lab 3b**, you'll be building a classifier that, given a list of tumor biopsy features, will determine whether a breast tumor is malignant or benign!

The dataset is from the Breast Cancer Wisconsin Diagnostic Database.  It is a classic in that has been used in many machine learning and statistics courses.  
***



# Part 1: Preparing the data

To train a machine learning model, we first need data. The dataset we'll be using is located in `data/breastcancer_data.csv`

Here is what each column of the file represents, as well as its domain.

            Attribute                 Domain
    1. Sample code number            id number
    2. Clump Thickness               1 - 10
    3. Uniformity of Cell Size       1 - 10
    4. Uniformity of Cell Shape      1 - 10
    5. Marginal Adhesion             1 - 10
    6. Single Epithelial Cell Size   1 - 10
    7. Bare Nuclei                   1 - 10
    8. Bland Chromatin               1 - 10
    9. Normal Nucleoli               1 - 10
    10. Mitoses                       1 - 10
    11. Class:                        (2 for benign, 4 for malignant)
    
   
Unless you have a background in biology, you may not be familiar with what each of these attributes means. As data scientists, however, you're able to recognize that some attributes will be more useful than others in training a machine learning model.


### **Task 1**: Load the data
Read in the breast cancer data from the CSV file and create a dataframe using `pandas.
`. Print the first five rows.  Produce a summary of the data.

**NOTE:** The csv file we provide has no header. Notice that if you use the [pandas.read_csv()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function from Pandas without specifying the `names` parameter, it takes the first row of the data as the column names. This is not what we want!

Instead, pass in a list of column names in your call to `read_csv`. Finally, store the result in a variable named `df`.

Hint: Use the following [pandas.Dataframe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) methods:
* To print out the first n (5, by default) rows, use 'Dataframe.head(n=50)2. 
* Use 'Dataframe.describe()' to produce a nice data summary

In [None]:
#Import the required packages
import pandas as pd
import numpy as np

# names = ...
# df = ...
# YOUR CODE HERE



### **Task 2:** Using lab3a and lab3b as a reference, answer the following questions

1. Which attributes in your data will you include in the model? In other words, which attributes above will be your **features**?

2. Which attributes will you not include in the model? Why?

3. What is the **target**/**response**, the attribute we are trying to predict? 

Before we use scikit-learn to train a model based on our data, we need the data to be in the format scikit-learn requests.

From lab3a, recall the following requirements for working with data in scikit-learn:

### Requirements for working with data in scikit-learn

1. Features and response are **separate objects**
2. Features and response should be **numeric**
3. Features and response should be **NumPy arrays**
4. Features and response should have **specific shapes**


All of our data is in a single dataframe. We don't satisfy the first requirement, and thus, none of the following requirements.

### **Task 3:** Separate your dataframe into two NumPy arrays
You will be creating two new NumPy arrays, *data* and *target*. Use your answers to Task 2 to guide you. Are there any columns in the dataframe you can ignore?

**Hints:**
* The target values are in the `class` column.
* Panda's dataframes store tabular data internally using 'numpy' arrays.  You can access this data directly using `Dataframe.values`

If you're confused, look at **Lab 3a** as a reference.

In [None]:
# YOUR CODE HERE

You may have been surprised after printing out your accuracy score in **Task 4**. 

### **Task 5:** What are the drawbacks to training and testing your classifier on the entire dataset? 

Please write your response in the context of the breast cancer dataset we're using.

### Model Selection
Instead of simply training and testing your classifier on the entire dataset, you can perform a technique called **K-fold cross validation**. 

K-fold cross validation splits your dataset into K chunks. Then, for each chunk, it fits and scores the classifier using all the other chunks as the **training set** and the current chunk as the **test set**. We then treat the average of all the scores as the total score of the model.  This provides a more accurate estimate of how your classifier will perform in the real world (i.e. with new data).

![5-fold cross-validation](images/07_cross_validation_diagram.png)

### **Task 6**: Use cross-validation on your decision tree classifier.
Using [sklearn.model_selection.cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html), perform 10-fold cross-validation on your decision tree classifier.

Print the average of all the scores.

In [None]:
from sklearn.cross_validation import cross_val_score
# YOUR CODE HERE

***
# Part 3: Hyperparameter tuning

The focus for this assignment will be in Grid Search, which you will be implementing from scratch.

See the link below for more information:

https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)#Grid_search

In machine learning, there are parameters that the data scientist can set before training their model. These parameters are called **hyperparameters**. 

Decision trees have many such parameters, such as `max_features`, `max_depth`, and `min_samples_leaf`. This makes Decision trees perfect for using procedures such as [grid search](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) or [randomized search](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html), which find better hyperparameters by maximizing cross-validation score.

***
Although Grid Search may sound complicated, it can typically be implemented in just a few lines of code.

Consider the pseudocode for Grid Search below, which finds the optimal combination of two parameters: `max_features` and `max_depth`. (Note: these are both features in sklearn's DecisionTreeClassifier).

```
    maxFeaturesList = list of "reasonable" values for max_features
    maxDepthList = list of "reasonable" values for max_depth
    
    
    for max_features in maxFeaturesList:
        for max_depth in maxDepthList:
            if crossValidation(tree(max_features, max_depth)) is highest yet:
                bestMaxFeatures = max_features
                bestMaxDepth = max_depth
```

### **Task 7**: Implement Grid Search 
Improved your classifier by coding a grid search algorithm to find the better hyperparameter values for `max_features`, `max_depth`, and `min_samples_leaf`.

You are responsible for creating a "reasonable" set of values for each hyperparameter. Read the documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) for some ideas.

**DO NOT** use `sklearn.model_selection.GridSearchCV`. You can, however, use `cross_val_score` to evaluate the score at each step.

In [None]:
'''
Hints: 
Initialize a list of reasonable values for each hyperparemeter
* maxFeaturesList = ...
* maxDepthList = ...
* minSamplesLeafList = ...

Then train and test your DecisionTreeClassifier using all possible 
combinations of the selected parameters. This will require 
len(maxFeatureList)*len(maxDepthList)*len(minSampleLeafList) train and test steps


Print the best set of hyperparemeters and the score obtained 
using those hyperparameters.
'''

# YOUR CODE HERE

Our implementation of grid search finds hyperparemeters that achieve a cross validation score of 97%. In theory, our classifier would almost classify tumors correctly. But how often would you classify a benign tumor as malignant? And how often would you classify a malignant tumor as benign?

### **Task 8**:  Confusion Matrix
Using the classifier with optimal hyperparameters, construct a confusion matrix that describes its performance

Read documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) to see how to build a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix).

We've provided code that splits the dataset into a training set and a test set. Call `fit` using the training set, and then call `predict` and construct the confusion matrix using the test set.


In [None]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

#print your confusion matrix here

# YOUR CODE HERE

### **Task 9**: Describe your confusion matrix.

What does each value in your confusion matrix correspond to?

Identify what the following terms mean in regards to the breast cancer dataset:

- **True Positives (TP):** 
- **True Negatives (TN):** 
- **False Positives (FP):** 
- **False Negatives (FN):** 

In the context of diagnosing breast cancer, is it worse to have a false positive or a false negative? Why?

### Task 10: Mini-project
Perform the same set of classification tasks in Tasks 1 through 9 on a either the [Titanic dataset](https://www.kaggle.com/c/titanic) or a dataset of your choosing.  Use your data from a source other than sklearn.  Include a set of next steps (i.e. suggestions) on how you could improve your classifiers performance.

Hints:
1. Be careful during data preparation to handle missing values appropriately.  You can either fill them in with estimate (say by the mean of the ones that are present), or you can drop rows or columns that contain them.
3. You will need to either drop columns that contain categorical data (unordered data) or convert them to numerical form.
2. Converting categorical data (unordered category data) to numerical form requires using [One Hot Encoding](https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/) or some other appropriate technique.  Here is some [discussion](http://pbpython.com/categorical-encoding.html)
3. Use the Z-transform functionality in sklean to convert continuous values to z-scores and use them appropriately. Read this general information on [sklearn preprocessing]( http://scikit-learn.org/stable/modules/preprocessing.html) and then use [sklearn.preprocessing.StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)


In [None]:
# Your solution here.  Please include additional cells as needed.