# Week 1 Problem 2

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select *Kernel*, and restart the kernel and run all cells (*Restart & Run all*).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select *File* → *Save and CheckPoint*)

5. When you are ready to submit your assignment, go to *Dashboard* → *Assignments* and click the *Submit* button. Your work is not submitted until you click *Submit*.

6. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

7. **If your code does not pass the unit tests, it will not pass the autograder.**

# Due Date: 6 PM, January 22, 2018

In [1]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.utils import check_random_state
from sklearn.metrics import accuracy_score

from nose.tools import assert_equal, assert_in, assert_is_not
from numpy.testing import assert_array_equal, assert_array_almost_equal
from pandas.util.testing import assert_frame_equal, assert_index_equal

# Breast Cancer Dataset

For this assignment, we will be using the built-in dataset about Breast Cancer Diagnostic from Wisconsin. Features of the dataset are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei. The dataset tells us which image shows malignant (class 0) and benign (class 1). Our goal is to see if we can use classification to predict breast cancer.

In [2]:
# Load in the dataset as a Pandas DataFrame
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Add in the target (labels)
df['target'] = data.target

# Preview the first few lines
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


Lets examine the dataset a bit (and review a bit of Pandas) to see what we are working with.

In [3]:
# View the columns
print(df.columns)

# Count the number of features
print("Number of features:", len(df.columns))

# View the label distribution
print(df.target.value_counts(ascending=True))

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension',
       'target'],
      dtype='object')
Number of features: 31
0    212
1    357
Name: target, dtype: int64


While scikit-learn works with Pandas DataFrame, we need to split the features and labels.

In [4]:
# Split the DataFrame into a DataFrame of features and a Series of labels
cancer_df = df[df.columns[:-1]]
labels = df.target

## Question 1

Use scikit-learn's built-in function train_test_split() to split `cancer_df` and `label` into a training and testing set by creating a function called `split()`. Your function should take four parameters: (1) features, (2) labels, (3) test size, and (4) the random state.

In [9]:
def split(data_x, data_y, ts, rs):
    '''
    Perform a train-test-split of the dataset.
    
    Parameters
    ----------
    df: A pandas.DataFrame of the features
    labels: A pandas.Series of the labels
    test_size: A float representing the proportion of the dataset to split
    random_state: A numpy.random.RandomState instance
    
    Returns
    -------
    X_train, X_test, y_train, y_test: A 4-tuple of pandas.DataFrames
    
    '''
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test \
       = train_test_split(data_x, data_y, test_size = ts, random_state = rs)
        
    return X_train, X_test, y_train, y_test

In [10]:
X_train, X_test, y_train, y_test = split(cancer_df, labels, 0.3, check_random_state(0))

assert_equal(isinstance(X_train, pd.DataFrame), True)
assert_equal(isinstance(X_test, pd.DataFrame), True)
assert_equal(isinstance(y_train, pd.Series), True)
assert_equal(isinstance(y_test, pd.Series), True)

# Check the size of the split
assert_equal(len(X_train) - np.round(len(cancer_df) * 0.7) <= 1, True)
assert_equal(len(y_train) - np.round(len(cancer_df) * 0.7) <= 1, True)

# Check to make sure the features are the same
assert_index_equal(X_train.columns, cancer_df.columns)



## Question 2

Let us train a k-NN model using our training data. Write a function call `train_knn()` which takes the the training set, labels as well as the number of neighbors to fit a k-NN classifier model.

In [14]:
def train_knn(X_train, y_train, num_neighbors):
    '''
    Perform a train-test-split of the dataset.
    
    Parameters
    ----------
    X_train: A pandas.DataFrame of the features
    y_train: A pandas.Series of the labels
    num_neighbors: A integer specificying the number of neighbors
    
    Returns
    -------
    model: A sklearn.neighbors.KNeighborsClassifier instance
    '''
    # First, lets try k-nearest neighbors (KNN)
    from sklearn import neighbors

    # Next we construct our model
    model = neighbors.KNeighborsClassifier(n_neighbors=num_neighbors)

    # Now train our model
    model.fit(X_train, y_train)
    
    return model

In [15]:
# Train 3-NN model
knn_model = train_knn(X_train, y_train, 3)

In [16]:
assert_equal(isinstance(knn_model, KNeighborsClassifier), True)
assert_equal(knn_model.n_neighbors, 3)
assert_array_almost_equal(knn_model._fit_X, X_train)
assert_array_equal(knn_model._y, y_train.values.ravel())


## Question 3

Create a wrapper function called `predict_knn()` that takes a KNeighborsClassifier model and the test set to return predicted values in the form of a numpy.ndarray.

In [17]:
def predict_knn(model, X_test):
    '''
    Returns a `numpy.ndarray` of predicted values.
    
    Parameters
    ----------
    model: An sklearn.neighbors.KNeighborsClassifier object.
    X: pandas.DataFrame
    
    Returns
    -------
    prediction: A numpy.ndarray
    '''
    # Generate predictions
    prediction = model.predict(X_test)
     
    return prediction

In [18]:
# Obtain the prediction
prediction = predict_knn(knn_model, X_test)

In [19]:
assert_equal(isinstance(prediction, np.ndarray), True)
assert_equal(len(prediction), len(X_test))
assert_equal(set(prediction), {0,1})

Let us compute the accuracy of out classifier model.

In [20]:
accuracy_score(y_test, prediction)

0.91812865497076024

# Question 4

We want to determine which value of `k` will yield the highest accuracy score. Using your function `train_knn()`, `predict_knn()` and the built-in function `accuracy_score()` to create a function called `compute_accuracy()`. Your function should take the following inputs:

1. X_train: training set
2. y_train: labels of training set
3. X_test: testing set
4. y_test: labels of testing set
5. N: the max number of neighbors to train (i.e., 1 to N neighbors)

Your function should return a list of accuracy scores. For example, `scores[0]` corresponds to k=1, `scores[1]` corresponds to k=2, and so on.

In [27]:
def compute_accuracy(X_train, y_train, X_test, y_test, N):
    '''
    Returns a `numpy.ndarray` of predicted values.
    
    Parameters
    ----------
    X_train: A pandas.DataFrame of the training features
    y_train: A pandas.Series of the training labels
    X_test: A pandas.DataFrame of the testing features
    y_test: A pandas.Series of the testing labels
    N: A int representing the max value of K to compute
    
    Returns
    -------
    scores: A list of accuracy scores
    '''
    
    scores = []
    for k in range(1, N+1):
        mod = train_knn(X_train, y_train, k)
        pred = predict_knn(mod, X_test)
        scores.append(accuracy_score(y_test, pred))

    return scores

In [28]:
scores = compute_accuracy(X_train, y_train, X_test, y_test, 10)
assert_equal(len(scores), 10)
assert_equal(all(0 <= j <= 1 for j in scores), True)

scores2 = compute_accuracy(X_train, y_train, X_test, y_test, 100)
assert_equal(len(scores2), 100)
assert_equal(all(0 <= j <= 1 for j in scores2), True)
