# Assignment 3: Machine Learning

## Overview

The purpose of this assignment is to get you more familiar with scikit-learn. Using the same principles from **Lab 3a** and **Lab 3b**, you'll be building a classifier that, given a list of tumor biopsy features, will determine whether a breast tumor is malignant or benign!

The dataset is from the Breast Cancer Wisconsin Diagnostic Database.  It is a classic in that has been used in many machine learning and statistics courses.  
***



# Part 1: Preparing the data

To train a machine learning model, we first need data. The dataset we'll be using is located in `data/breastcancer_data.csv`

Here is what each column of the file represents, as well as its domain.

            Attribute                 Domain
    1. Sample code number            id number
    2. Clump Thickness               1 - 10
    3. Uniformity of Cell Size       1 - 10
    4. Uniformity of Cell Shape      1 - 10
    5. Marginal Adhesion             1 - 10
    6. Single Epithelial Cell Size   1 - 10
    7. Bare Nuclei                   1 - 10
    8. Bland Chromatin               1 - 10
    9. Normal Nucleoli               1 - 10
    10. Mitoses                       1 - 10
    11. Class:                        (2 for benign, 4 for malignant)
    
   
Unless you have a background in biology, you may not be familiar with what each of these attributes means. As data scientists, however, you're able to recognize that some attributes will be more useful than others in training a machine learning model.


### **Task 1**: Load the data
Read in the breast cancer data from the CSV file and create a dataframe using `pandas.
`. Print the first five rows.  Produce a summary of the data.

**NOTE:** The csv file we provide has no header. Notice that if you use the [pandas.read_csv()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function from Pandas without specifying the `names` parameter, it takes the first row of the data as the column names. This is not what we want!

Instead, pass in a list of column names in your call to `read_csv`. Finally, store the result in a variable named `df`.

Hint: Use the following [pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) methods:
* To print out the first n (5, by default) rows, use 'DataFrame.head(n=5)'. 
* Use 'DataFrame.describe()' to produce a nice data summary

In [1]:
# Import the required packages
import pandas as pd
import numpy as np

names = ['sample_code_number', 'clump_thickness', 'cell_size_uniformity', 'uniformity of cell shape', \
         'marginal_adhesion', 'single_epithelial_cell_size', 'bare_nuclei', 'bland_chromatin',        \
         'normal nucleoli', 'mitoses', 'class']
df = pd.read_csv('~/Desktop/data1030/a3-sjmccorm1993/data/breastcancer_data.csv', names=names) 
print(df.describe())
df.head()

       sample_code_number  clump_thickness  cell_size_uniformity  \
count        6.830000e+02       683.000000            683.000000   
mean         1.076720e+06         4.442167              3.150805   
std          6.206440e+05         2.820761              3.065145   
min          6.337500e+04         1.000000              1.000000   
25%          8.776170e+05         2.000000              1.000000   
50%          1.171795e+06         4.000000              1.000000   
75%          1.238705e+06         6.000000              5.000000   
max          1.345435e+07        10.000000             10.000000   

       uniformity of cell shape  marginal_adhesion  \
count                683.000000         683.000000   
mean                   3.215227           2.830161   
std                    2.988581           2.864562   
min                    1.000000           1.000000   
25%                    1.000000           1.000000   
50%                    1.000000           1.000000   
75%      

Unnamed: 0,sample_code_number,clump_thickness,cell_size_uniformity,uniformity of cell shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2




### **Task 2:** Using lab3a and lab3b as a reference, answer the following questions

1. Which attributes in your data will you include in the model? In other words, which attributes above will be your **features**?

2. Which attributes will you not include in the model? Why?

3. What is the **target**/**response**, the attribute we are trying to predict? 

Before we use scikit-learn to train a model based on our data, we need the data to be in the format scikit-learn requests.

From lab3a, recall the following requirements for working with data in scikit-learn:

#### Requirements for working with data in scikit-learn

1. Features and response are **separate objects**
2. Features and response should be **numeric**
3. Features and response should be **NumPy arrays**
4. Features and response should have **specific shapes**


All of our data is in a single dataframe. We don't satisfy the first requirement, and thus, none of the following requirements.

### **Task 3:** Separate your dataframe into two NumPy arrays
You will be creating two new NumPy arrays, *data* and *target*. Use your answers to Task 2 to guide you. Are there any columns in the dataframe you can ignore?

**Hints:**
* The target values are in the `class` column.
* Panda's dataframes store tabular data internally using 'numpy' arrays.  You can access this data directly using `Dataframe.values`

If you're confused, look at **Lab 3a** as a reference.

In [2]:
# you may consider df.pop('id'), but there are other ways to select columns too!

# get every column except class and id, then convert that dataframe into a numpy array
features = df.drop('class', axis=1)
features.drop('sample_code_number', axis=1, inplace=True)

print(features.columns)
data = features.as_matrix()

# Repeat same process for target
target = df['class'].values

# Test cases, do not change!
assert(683, 9) == data.shape
assert(683,) == target.shape

Index(['clump_thickness', 'cell_size_uniformity', 'uniformity of cell shape',
       'marginal_adhesion', 'single_epithelial_cell_size', 'bare_nuclei',
       'bland_chromatin', 'normal nucleoli', 'mitoses'],
      dtype='object')


***
# Part 2: Training machine learning models

Now that your data is in the correct format for using scikit-learn, you can use `sklearn`'s built in classification models!

For this assignment, we'll be using [Decision Trees](http://scikit-learn.org/stable/modules/tree.html) to perform binary classification.

### **Task 4:** Train and test your classifier on training data
Train and test your classifier on all of your data and test its accuracy on the same data.

For this task, you'll first need to initialize your classifier by creating an instance of it and then calling the `fit` method to train it.


Hint: To test the model's accuracy, use [sklearn.metrics.accuracy_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html).

Look at **Lab 3a** for an example.

In [3]:
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier

# usually we use X to denote training features, and y to denote training targets
X = data
y = target

# instantiate the model (using the default parameters)
tree = DecisionTreeClassifier()

# fit the model using the entire dataset
tree.fit(X, y)

# find the accuracy score of your model by testing it on the entire dataset, again
y_pred = tree.predict(X)
accuracy = metrics.accuracy_score(y_pred, y)

print(accuracy)

# YOUR CODE HERE

1.0


You may have been surprised after printing out your accuracy score in **Task 4**. 

### **Task 5:** What are the drawbacks to training and testing your classifier on the entire dataset? 

Please write your response in the context of the breast cancer dataset we're using.

### Model Selection
Instead of simply training and testing your classifier on the entire dataset, you can perform a technique called **K-fold cross validation**. 

K-fold cross validation splits your dataset into K chunks. Then, for each chunk, it fits and scores the classifier using all the other chunks as the **training set** and the current chunk as the **test set**. We then treat the average of all the scores as the total score of the model.  This provides a more accurate estimate of how your classifier will perform in the real world (i.e. with new data).

![5-fold cross-validation](images/07_cross_validation_diagram.png)

### **Task 6**: Use cross-validation on your decision tree classifier.
Using [sklearn.model_selection.cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html), perform 10-fold cross-validation on your decision tree classifier.

Print the average of all the scores.

In [4]:
from sklearn.cross_validation import cross_val_score

print(cross_val_score(tree, X, y, cv=10).mean())

0.945927968851




***
# Part 3: Hyperparameter tuning

The focus for this assignment will be in Grid Search, which you will be implementing from scratch.

See the link below for more information:

https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)#Grid_search

In machine learning, there are parameters that the data scientist can set before training their model. These parameters are called **hyperparameters**. 

Decision trees have many such parameters, such as `max_features`, `max_depth`, and `min_samples_leaf`. This makes Decision trees perfect for using procedures such as [grid search](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) or [randomized search](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html), which find better hyperparameters by maximizing cross-validation score.

***
Although Grid Search may sound complicated, it can typically be implemented in just a few lines of code.

Consider the pseudocode for Grid Search below, which finds the optimal combination of two parameters: `max_features` and `max_depth`. (Note: these are both features in sklearn's DecisionTreeClassifier).

```
    maxFeaturesList = list of "reasonable" values for max_features
    maxDepthList = list of "reasonable" values for max_depth
    
    for max_features in maxFeaturesList:
        for max_depth in maxDepthList:
            if crossValidation(tree(max_features, max_depth)) is highest yet:
                bestMaxFeatures = max_features
                bestMaxDepth = max_depth
```

### **Task 7**: Implement Grid Search 
Improved your classifier by coding a grid search algorithm to find the better hyperparameter values for `max_features`, `max_depth`, and `min_samples_leaf`.

You are responsible for creating a "reasonable" set of values for each hyperparameter. Read the documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) for some ideas.

**DO NOT** use `sklearn.model_selection.GridSearchCV`. You can, however, use `cross_val_score` to evaluate the score at each step.

In [5]:
"""
Hints: 
Initialize a list of reasonable values for each hyperparemeter
* maxFeaturesList = ...
* maxDepthList = ...
* minSamplesLeafList = ...

Then train and test your DecisionTreeClassifier using all possible 
combinations of the selected parameters. This will require 
len(maxFeatureList)*len(maxDepthList)*len(minSampleLeafList) 
train and test steps


Print the best set of hyperparemeters and the score obtained 
using those hyperparameters.
"""

# There are 9 features in total, so it makes sense to try all integers in the range 1 to 9
maxFeaturesList = [i for i in range(1, 10)]

# The depth of the tree is the length of the longest path from a root to a leaf
# Assume that this length should not be longer than the number of features
maxDepthList = [i for i in range(1, 10)]

# Minimum number of samples required to be at a leaf node (a point where tree does not continue to split)
# 1 is a logical value to start with, as this will lead to the largest tree possible; try a range of values
# from 0 to 20
minSamplesLeafList = [i for i in range(1, 20)]

accuracy = 0

# Loop through all possibilities
for max_features in maxFeaturesList:
    for max_depth in maxDepthList:
        for min_samples_leaf in minSamplesLeafList:
            
            tree = DecisionTreeClassifier(max_features = max_features, max_depth = max_depth, \
                                          min_samples_leaf=min_samples_leaf)
            tree.fit(X, y)

            # find the accuracy score of your model by testing it on the entire dataset, again
            y_pred = tree.predict(X)
            
            if cross_val_score(tree, X, y).mean() > accuracy:
                
                accuracy = cross_val_score(tree, X, y).mean()
                
                bestMaxFeatures = max_features
                bestMaxDepth = max_depth
                bestMinSamplesLeaf = min_samples_leaf
                
         
            
            
# Print results
print("The optimal tree has the following hyperparameters:")
print("Max Features: ", bestMaxFeatures)
print("Max Depth: ", bestMaxDepth)
print("Min Samples Leaf: ", bestMinSamplesLeaf)
print("Accuracy: ", accuracy)
            


The optimal tree has the following hyperparameters:
Max Features:  7
Max Depth:  5
Min Samples Leaf:  5
Accuracy:  0.959006363191


Our implementation of grid search finds hyperparemeters that achieve a cross validation score of 95%. In theory, our classifier would almost always classify tumors correctly. But how often would you classify a benign tumor as malignant? And how often would you classify a malignant tumor as benign?

### **Task 8**:  Confusion Matrix
Using the classifier with optimal hyperparameters, construct a confusion matrix that describes its performance

Read documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) to see how to build a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix).

We've provided code that splits the dataset into a training set and a test set. Call `fit` using the training set, and then call `predict` and construct the confusion matrix using the test set.


In [6]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Create tree using optimal parameter values
tree = DecisionTreeClassifier(max_features = 5, max_depth = 8, \
                              min_samples_leaf=5)
# Fit model
tree.fit(X_train, y_train)

# Predict new values using test data
y_pred = tree.predict(X_test)

# print your confusion matrix here
print(metrics.confusion_matrix(y_true=y_test, y_pred=y_pred))

[[103   4]
 [  8  56]]


### **Task 9**: Describe your confusion matrix.

What does each value in your confusion matrix correspond to?

Identify what the following terms mean in regards to the breast cancer dataset:

- **True Positives (TP):** 
- **True Negatives (TN):** 
- **False Positives (FP):** 
- **False Negatives (FN):** 

In the context of diagnosing breast cancer, is it worse to have a false positive or a false negative? Why?

### Task 10: Mini-project
Perform the same set of classification tasks in Tasks 1 through 9 on a either the [Titanic dataset](https://www.kaggle.com/c/titanic) or a dataset of your choosing.  Use your data from a source other than sklearn.  Include a set of next steps (i.e. suggestions) on how you could improve your classifiers performance.

Hints:
1. Be careful during data preparation to handle missing values appropriately.  You can either fill them in with estimate (say by the mean of the ones that are present), or you can drop rows or columns that contain them.
3. You will need to either drop columns that contain categorical data (unordered data) or convert them to numerical form.
2. Converting categorical data (unordered category data) to numerical form requires using [One Hot Encoding](https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/) or some other appropriate technique.  Here is some [discussion](http://pbpython.com/categorical-encoding.html)
3. Use the Z-transform functionality in sklean to convert continuous values to z-scores and use them appropriately. Read this general information on [sklearn preprocessing]( http://scikit-learn.org/stable/modules/preprocessing.html) and then use [sklearn.preprocessing.StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)


In [7]:
# Task 1: Load the Titanic data
import numpy as np
import pandas as pd

df = pd.read_csv('~/Desktop/data1030/a3-sjmccorm1993/data/titanic_train.csv')

# Examine shape, column names, and first few rows of data
print(df.shape)
print(df.head())
print(df.columns)

(891, 12)
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN    

In [8]:
# Task 2: Answer questions about which variables to use as features vs. target

# Q: Which attributes in your data will you include in the model? 

# A: I will plan to use the following variables as features: 
#    Pclass, sex, age, # siblings/spouses aboard, # parents/children aboard, fare, and embarkation point

# Q: Which attributes will you not include and why?

# A: Passenger ID, since it contains no information about the passenger;
#    Name, for the same reason as passenger ID
#    Ticket, because I don't have enough knowledge to understand how the ticket information might impact survival
#    Cabin, since there are many missing values

# Q: What is the target/response, the attribute we are trying to predict?

# A: Survived, an indicator variable telling us whether the passenger survived

In [9]:
# Task 3: Separate the data into data and target arrays

# Create features dataframe, and include survival variable for now 
# (will drop rows based on NA values and target vector length has to match features vector length)
feature_cols = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin', 'Embarked', 'Survived']
features = df[feature_cols]

print(features.columns)

# Look at distinct unique values for non-continuous variables
print(features.Pclass.unique())
print(features.Sex.unique())
print(features.SibSp.unique())
print(features.Parch.unique())
print(features.Embarked.unique())


Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin', 'Embarked',
       'Survived'],
      dtype='object')
[3 1 2]
['male' 'female']
[1 0 3 4 2 5 8]
[0 1 2 5 3 4 6]
['S' 'C' 'Q' nan]


In [10]:
# Pre-processing steps

# Two null values in Embarked variable - drop these rows
print(features.Embarked.isnull().sum())
features = features[pd.notnull(features['Embarked'])]

# 687 null values in Cabin variable: won't include this variable in model
print(features.Cabin.isnull().sum())
features.drop('Cabin', axis=1, inplace=True)
print(features.columns)

2
687
Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked',
       'Survived'],
      dtype='object')


In [11]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

# Convert gender and embarcation variable to numeric categorical variables
label_encoder = LabelEncoder()
features['Embarked'] = label_encoder.fit_transform(features['Embarked']) 
features['Sex'] = label_encoder.fit_transform(features['Sex'])

In [12]:
# Convert fare and age variables to z-scores and drop missing values
from sklearn.preprocessing import StandardScaler

# Remove rows that have null values for fare and age variables
features = features[pd.notnull(features['Fare'])]
features = features[pd.notnull(features['Age'])]

# values.reshape used as a result of a DeprecationError
x_fare = features['Fare'].values.reshape(-1, 1)
x_age = features['Age'].values.reshape(-1, 1)

# Use standard scaler to transform continuous varibles
standard_scaler = StandardScaler()
features['Fare'] = standard_scaler.fit_transform(X = x_fare) 
features['Age'] = standard_scaler.fit_transform(X = x_age)

In [13]:
# Extract target variable from features dataframe and remove it
target = features['Survived']
features.drop('Survived', axis=1, inplace=True)

# Convert features df to matrix
data = features.as_matrix()

# Test cases (two variables and ~200 obs containing missing values dropped from original dataset)
assert(712, 7) == data.shape
assert(712,) == target.shape

In [14]:
# Task 4: Train classifier on training data
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier

# X = training features, y = target
X = data
y = target

# Instantiate the model (using the default parameters)
titanic_tree = DecisionTreeClassifier()

# fit the model using the entire dataset
titanic_tree.fit(X, y)

# find the accuracy score of your model by testing it on the entire dataset
y_pred = titanic_tree.predict(X)
accuracy = metrics.accuracy_score(y_pred, y)

print(accuracy)

0.98595505618


In [15]:
# Task 5

# Training and testing a classifier on the entire dataset causes problems because while it will result in 
# very accurate predictions on the training data (the entire dataset, in this case) this leaves no data 
# for validating and testing that our model works well on new data 
# (other data than what the classifier was trained on). 

# In the context of the dataset we're using, this means that we can create a model that classifies 
# whether a passenger will survived with very high accuracy ONLY for the passenger in our training dataset; 
# however, there is no guarantee that our classifier will be accurate when presented when we test our 
# model on the official Kaggle "test" dataset.

In [16]:
# Task 6: Perform 10-fold cross validation on model
from sklearn.cross_validation import cross_val_score

print(cross_val_score(titanic_tree, X, y, cv=10).mean())

0.771386653253


In [17]:
# Task 7: Implement grid search (using built-in GridSearch module from sklearn this time)
from sklearn.model_selection import GridSearchCV

# define the parameter values that should be searched
feat_range = list(range(1, 7))
depth_range = list(range(1, 25)) 
min_samples = list(range(1, 10))

# create a parameter grid: map the parameter names to the values that should be searched
param_grid = dict(max_features=feat_range, max_depth=depth_range, min_samples_leaf=min_samples) 

In [18]:
# instantiate the grid
grid = GridSearchCV(titanic_tree, param_grid, cv=10, scoring='accuracy')

# Run the grid search on the data
grid.fit(X, y)

# Print parameters corresponding to best model
print(grid.best_params_)
print(grid.best_score_)
print(grid.best_estimator_)

{'max_depth': 9, 'max_features': 3, 'min_samples_leaf': 3}
0.823033707865
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=9,
            max_features=3, max_leaf_nodes=None, min_impurity_split=1e-07,
            min_samples_leaf=3, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')


In [19]:
# Task 8: Produce confusion matrix
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Create tree using optimal parameter values
titanic_tree_opt = DecisionTreeClassifier(max_features = 4, max_depth = 10, \
                              min_samples_leaf=7)
# Fit model
titanic_tree_opt.fit(X_train, y_train)

# Predict new values using test data
y_pred = titanic_tree_opt.predict(X_test)

# print your confusion matrix here
print(metrics.confusion_matrix(y_true=y_test, y_pred=y_pred))

[[83 21]
 [23 51]]


In [20]:
# Task 9: Describe your confusion matrix

# True positives: The number of people who the model predicted would survive who actually survived (49)
# True negatives: The number of people who the model predicted would not survive who died (89)
# False positives: The number of people who the model predicted would surive who died (15)
# False negatives: The number of people who the model predicted would not survive who did survive (25)
