# Nearest Neighbours Classification

In this notebook, we discuss how to fit and evaluate nearest neighbour classification models in Python.

We will use the Breast Cancer Wisconsin Diagnostic dataset, which was downloaded from the UCI Machine Learning repository.

In [1]:
# Import data
import pandas as pd

df = pd.read_csv('wdbc.csv')
df.head()

Unnamed: 0,target,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_pts_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_pts_worst,symmetry_worst,fractal_dim_worst
0,B,13.54,14.36,87.46,566.3,0.09779,0.08129,0.06664,0.04781,0.1885,...,15.11,19.26,99.7,711.2,0.144,0.1773,0.239,0.1288,0.2977,0.07259
1,B,13.08,15.71,85.63,520.0,0.1075,0.127,0.04568,0.0311,0.1967,...,14.5,20.49,96.09,630.5,0.1312,0.2776,0.189,0.07283,0.3184,0.08183
2,B,9.504,12.44,60.34,273.9,0.1024,0.06492,0.02956,0.02076,0.1815,...,10.23,15.66,65.13,314.9,0.1324,0.1148,0.08867,0.06227,0.245,0.07773
3,B,13.03,18.42,82.61,523.8,0.08983,0.03766,0.02562,0.02923,0.1467,...,13.3,22.81,84.46,545.9,0.09701,0.04619,0.04833,0.05013,0.1987,0.06169
4,B,8.196,16.84,51.71,201.9,0.086,0.05943,0.01588,0.005917,0.1769,...,8.964,21.96,57.26,242.2,0.1297,0.1357,0.0688,0.02564,0.3105,0.07409


The target variable is called `target`, and we want to use all remaining variables to try to predict it.

In [2]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X = df.drop('target', axis = 1)
y = df.target

# We specify the random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1,
                                                    random_state = 1234)
(X_train.shape, X_test.shape)

((512, 30), (57, 30))

Nearest neighbour classification can be fitted using `scikit-learn`, and the syntax should now look familiar. 

In [3]:
# Fit K-Nearest Neighbours
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors = 5)
model.fit(X_train, y_train)

# Get predicted values
y_pred = model.predict(X_test)
y_pred

array(['M', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'M', 'B',
       'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'M',
       'M', 'B', 'B', 'B', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'B',
       'B', 'M', 'M', 'B', 'B', 'B', 'M', 'B', 'M', 'M', 'M', 'B', 'B',
       'M', 'M', 'M', 'B', 'M'], dtype=object)

### Exercise

Compute the accuracy of this classification.

In [4]:
# Write your code below




### Exercise

Change the number of neareast neighbours used. Can you construct a more accurate classifier?

In [5]:
# Write your code below




### Exercise

The training dataset is (slightly) unbalanced. Compute the precision, recall, and F-score for these two models.

In [6]:
# Write your code below


