# K Nearest Neighbors classification walkthrough

In this notebook we are going to look at how the kNN algorithm classifies malignant vs. benign tumor category in the Wisconsin breast cancer dataset.

## 1. Import necessary packages:

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline

# seaborn is a nice package for plotting, but you have to use pip to install
import seaborn as sns

from sklearn.neighbors import KNeighborsClassifier

## 2. Load in the dataset

Path is provided.

In [None]:
bcw = pd.read_csv('../assets/datasets/wdbc.data', header=None, index_col=None)

### 2.2 Assign the columns

The attributes below will be the columns of the dataset.

      Attribute                     
   --------------------------------------------
   1. Sample code number [subject ID]
   2. Class
   3. Cell nucleus mean radius
   4. Cell nucleus SE radius
   5. Cell nucleus worst radius
   6. Texture mean
   7. Texture SE
   8. Texture worst
   9. Perimeter mean
   10. Perimeter SE
   11. Perimeter worst
   12. Area mean
   13. Area SE
   14. Area worst
   15. Smoothness mean
   16. Smoothness SE
   17. Smoothness worst
   18. Compactness mean
   19. Compactness SE
   20. Compactness worst
   21. Concavity mean
   22. Concavity SE
   23. Concavity worst
   24. Concave points mean
   25. Concave points SE
   26. Concave points worst
   27. Symmetry mean
   28. Symmetry SE
   29. Symmetry worst
   30. Fractal dimension mean
   31. Fractal dimension SE
   32. Fractal dimension worst

The column names are taken from the dataset info file. 

For more information check out the information file:

`../assets/datasets/wdbc.names`

You can open it with a text editor of your choice.

Create an array with the column names and assign them as the header when loading the csv.

In [None]:
column_names = ['id','malignant',
                'nucleus_mean','nucleus_se','nucleus_worst',
                'texture_mean','texture_se','texture_worst',
                'perimeter_mean','perimeter_se','perimeter_worst',
                'area_mean','area_se','area_worst',
                'smoothness_mean','smoothness_se','smoothness_worst',
                'compactness_mean','compactness_se','compactness_worst',
                'concavity_mean','concavity_se','concavity_worst',
                'concave_pts_mean','concave_pts_se','concave_pts_worst',
                'symmetry_mean','symmetry_se','symmetry_worst',
                'fractal_dim_mean','fractal_dim_se','fractal_dim_worst']

bcw.columns = column_names

### 2. Check out the dataset information

Print out the head and the datatypes.

### 2.4 Recode the class field to be 0 vs. 1

The malignant class field is coded as "B" for benign and "M" as malignant. 

It is best to recode this to a binary integer for classification, with "1" as malign and "0" as benign (malign is assigned to 1 because our goal is to predict malign tumors with the data).

## 3. Break up the data and look at correlations

Split up the data into 3 datasets for the "mean", "standard error", and "worst" statistics on each predictor variable.

---

NOTE: The difference between standard error and standard deviation is subtle:

A new observation has about a 95% chance to be within **2 standard deviations** of the sample mean.

The sample mean has about a 95% chance to be within **2 standard errors** of the real population mean.


In [None]:
# A function that subsets the data to the columns indicating the
# mean, se, or wrong variable types
def df_subsetter(df, suffix):
    column_select = [x for x in bcw.columns if suffix in x]
    bcw_subset = bcw[['id','malignant'] + column_select]
    bcw_subset.columns = [x.replace(suffix, '') for x in bcw_subset.columns]
    return bcw_subset


### 3.1 Examine correlation matrices for the 3 datasets

Look at the correlations between variables for each of the subset datasets, excluding the id column.

1. The mean columns subset
2. The standard error columns subset
3. The "worst value" columns subset

### 3.2 Look at correlations between mean, standard error, and worst within variable

Look at the correlations between each single variables mean, se, and worst value:

In [None]:
# A function that prints the variable name, subsets the data to just
# be columns that have those variable names, and print out the
# correlation between the variables
def variable_corr_printer(df, varname):
    print varname
    df_sub = df[[x for x in df.columns if varname in x]]
    print df_sub.corr()
    print '--------------------------------------------\n'

# get the variable names without the _mean, _se, _worst suffixes and
# remove duplicate names by filtering
varnames = [
    x.replace('_mean','')
    for x in bcw.columns
    if x not in ['id','malignant']
    and '_se' not in x
    and '_worst' not in x
]



## 4. Use seaborn's pairplot to visualize relationships between variables

Look at the data using seaborn's `pairplot()` function. The hue will be the class variable "malignant". The variables will be the other columns excluding, of course, the subject ID column.

Most of these predictors are highly correlated with the "class" variable. This is already an indication that our classifier is very likely to perform well.

In [None]:
# set the seaborn style to have a white background
sns.set(style="ticks", color_codes=True)

# This function does a pairplot across your variables with the color
# set as the outcome "malignant" class variable
def bcw_pairplotter(df, variables, sample_frac=0.3):
    # sample_frac lets you specify an amount of the data to sample for the plot.
    # this speeds up the function which can take awhile with the full data.
    
    # get the number of rows/data points:
    rows = df.shape[0]
    
    # get downsample indicies for the data, if specified
    if sample_frac < 1.0:
        sample_inds = np.random.choice(range(0,rows), 
                                       size=int(round(rows*sample_frac)), 
                                       replace=False).astype(int)
    
    # make the pairplot for the variables:
    pairs = sns.pairplot(df.iloc[sample_inds, :], 
                         vars=variables, 
                         hue="malignant", 
                         palette=sns.xkcd_palette(['windows blue', 'amber']))


# get out the column variable names to put into the pairplotter function
colvars = [x for x in bcw_mean if x not in ['id','malignant']]

### 4.2 Plot the mean data subset with the pairplotter function

### 4.3 Plot the standard error data subset with the pairplotter function

### 4.4 Plot the worst value data subset using the pairplotter function

## 5. Test the performance of kNN classifiers on the data using cross-validation

Let's see how the kNN classifier performs on the dataset with cross-validation.

We are going to set some parameters in the classifier constructor. Some clarification below:

1. **n_neighbors** specifies how many neighbors will vote on the class
2. **weights** uniform weights indicate that all neighbors have the same weight
3. **metric** and **p** when distance is minkowski (the default) and p == 2 (the default), this is equivalent to the euclidean distance metric

Load scikit's handy cross-validation module.

The `cross_validation.StratifiedKFold()` function will return cross-validation indices which you can use to subset your data in a for loop that runs the model and tests it.

The **stratified** version of cross-validation ensures that there are equal proportions the predicted class in each train-test fold.

In [None]:
from sklearn import cross_validation

In [None]:
# Function to crossvalidate accuracy of a knn model acros folds
def accuracy_crossvalidator(X, Y, knn, cv_indices):
    
    # list to store the scores/accuracy of folds
    scores = []
    
    # iterate through the training and testing folds in cv_indices
    for train_i, test_i in cv_indices:
        
        # get the current X train & test subsets of X
        X_train = X[train_i, :]
        X_test = X[test_i, :]

        # get the Y train & test subsets of Y
        Y_train = Y[train_i]
        Y_test = Y[test_i]

        # fit the knn model on the training data
        knn.fit(X_train, Y_train)
        
        # get the accuracy predicting the testing data
        acc = knn.score(X_test, Y_test)
        scores.append(acc)
        
        print('Fold accuracy:', acc)
        
    print('Mean CV accuracy:', np.mean(scores))


### 5.2: Cross-validate accuracy for a kNN model with 5 neighbors on the mean data subset

### 5.3: Cross-validate accuracy for a kNN model with 1 neighbor on the mean data subset

As you can see the mean cross-validated accuracy is very high with 5 neighbors. 

Let's see what it's like when we use only 1 neighbor:

### 5.4 Cross-validate accuracy for a kNN model with 5 neighbors on the standard error subset

### 5.5 Cross-validate accuracy for a kNN model with 5 neighbors on the worst value subset

## 6. Plot the kNN prediction boundary

Even with 1 neighbor we do quite well at predicting the malignant observations.

We will fit a kNN classifier with n_neighbors=5 using just **`nucleus`** and **`perimeter`** predicting the **`malignant`** class column.

The plotting function below will plot the points and the boundary of where the classifier votes between malignant vs. benign classes. 

---

Below is the helper function for plotting. All the sections are documented so you can walk through it and see how it works! (As usual, matplotlib code is not easy to read..)

In [None]:
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
import matplotlib.cm as cm
import matplotlib.colors as cl


# MOST OF THIS FUNCTION STUFF LIFTED FROM SCIKIT-LEARN EXAMPLE!
# see:
# http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#example-neighbors-plot-classification-py

def knn_boundary_plotter(df, var1, var2, classvar='malignant',
                         nn=3, granularity=50.):
    
    # Subset the data to just the two variables to plot and the class variable
    df = df[[var1, var2, classvar]]
    
    # reset the index in case this matters..
    df.reset_index(drop=True, inplace=True)
    
    # get the point colors from a seaborn built in palette
    point_colors = sns.xkcd_palette(['windows blue', 'amber'])
    
    # set the mesh colors to be more "faded"/brighter versions of the point colors
    mesh_colors = ['#8FCCFF', '#FFED79']

    # the 'pcolormesh' matplotlib function requires we convert the mesh colors into a 
    # 'colormap'
    colormap = ListedColormap(mesh_colors)

    # fit a knn on the data with the nearest neighbors number passed into the function
    knn_mod = KNeighborsClassifier(n_neighbors=nn)
    knn_mod.fit(df[[var1, var2]].values, df[classvar].values)

    # get the minimum and maximum values for each of the predictor variables
    v1_min, v1_max = np.min(df[var1]), np.max(df[var1])
    v2_min, v2_max = np.min(df[var2]), np.max(df[var2])

    # get the range of each variable
    v1_range = v1_max - v2_min
    v2_range = v2_max - v2_min

    # set up the min and max ranges of the axes of the plot
    # I add a buffer here (1/15th of the range) so no points are on the axes
    buffer_denom =  15.
    
    x_min = v1_min - (v1_range/buffer_denom)
    x_max = v1_max + (v1_range/buffer_denom)
    
    y_min = v2_min - (v2_min/buffer_denom)
    y_max = v2_max + (v2_range/buffer_denom)

    # use the numpy meshgrid function to make a bunch of points across the range
    # of values.
    xx, yy = np.meshgrid(np.arange(x_min, x_max, (v1_range/granularity)),
                         np.arange(y_min, y_max, (v2_range/granularity)))
    
    # Predict using the knn model on all the meshgrid points. This will let us see
    # the knn boundary of where it predicts between one class and another!
    Z = knn_mod.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # get out the values of our two predictors and class target variable
    v1_points = df[var1].values
    v2_points = df[var2].values
    class_colors = df[classvar].values

    # point size of 70 seems to work well
    point_sizes = 70

    # Set the figure size to be big enough to see stuff
    plt.figure(figsize=[11,9])
    
    # Plot the background colormesh colors, showing the decision boundary
    # of the fit k nearest neighbors algorithm:
    plt.pcolormesh(xx, yy, Z, cmap=colormap)

    # Plot the actual points of the 2 predictor variables
    plt.scatter(v1_points, v2_points, c=point_colors, s=point_sizes)
    
    # set the axis limits:
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    
    # Add the labels corresponding to the variables and a title
    # (I remembered this time, Sam!)
    plt.xlabel(var1, fontsize=20)
    plt.ylabel(var2, fontsize=20)
    plt.title('kNN='+str(nn)+' model predicting '+classvar+' with '+var1+' & '+var2+'\n',
              fontsize=20)


### 6.2 Use the boundary plotter function to plot area vs. symmetry using the mean value data and nn=3

### 6.3 Use the boundary plotter function to plot area vs. symmetry using the mean value data and nn=9

### 6.4 Use the interactive widget to explore the effects of changing the knn values

Feel free to change the axis variables!

In [None]:
from ipywidgets import *

In [None]:
x_axis_var = 'area'
y_axis_var = 'symmetry'

def knn_area_symmetry_slider(nn):
    knn_boundary_plotter(bcw_mean, x_axis_var, y_axis_var, nn=nn)
    
widgets.interact(knn_area_symmetry_slider, 
                 nn=widgets.IntSlider(min=1, max=101, step=1, value=1))

## 7. What is the effect of increasing/decreasing the neighbors?

## 8 What could be wrong with using accuracy as your measure of performance?

## 9. Examine more paired variables


## 10. Explain changing the number of neighbors in terms of bias-variance tradeoff