# K-Nearest Neighbor
This algorithm looks at the data points closest to the each data point and determines if they are the same class.  Changing the distance parameter/algorithm can give different results.

In [171]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.neighbors import KNeighborsClassifier

## LandSat Data

The database consists of the multi-spectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood. The aim is to predict this classification, given the multi-spectral values. In the sample database, the class of a pixel is coded as a number. The Landsat satellite data is one of the many sources of information available for a scene. The interpretation of a scene by integrating spatial data of diverse types and resolutions including multispectral and radar data, maps indicating topography, land use etc. is expected to assume significant importance with the onset of an era characterised by integrative approaches to remote sensing (for example, NASA's Earth Observing System commencing this decade). Existing statistical methods are ill-equipped for handling such diverse data types. Note that this is not true for Landsat MSS data considered in isolation (as in this sample database). This data satisfies the important requirements of being numerical and at a single resolution, and standard maximum-likelihood classification performs very well. Consequently, for this data, it should be interesting to compare the performance of other methods against the statistical approach. One frame of Landsat MSS imagery consists of four digital images of the same scene in different spectral bands. Two of these are in the visible region (corresponding approximately to green and red regions of the visible spectrum) and two are in the (near) infra-red. Each pixel is a 8-bit binary word, with 0 corresponding to black and 255 to white. The spatial resolution of a pixel is about 80m x 80m. Each image contains 2340 x 3380 such pixels. The database is a (tiny) sub-area of a scene, consisting of 82 x 100 pixels. Each line of data corresponds to a 3x3 square neighbourhood of pixels completely contained within the 82x100 sub-area. Each line contains the pixel values in the four spectral bands (converted to ASCII) of each of the 9 pixels in the 3x3 neighbourhood and a number indicating the classification label of the central pixel. The number is a code for the following classes: Number Class 1 red soil 2 cotton crop 3 grey soil 4 damp grey soil 5 soil with vegetation stubble 6 mixture class (all types present) 7 very damp grey soil NB. There are no examples with class 6 in this dataset. The data is given in random order and certain lines of data have been removed so you cannot reconstruct the original image from this dataset. In each line of data the four spectral values for the top-left pixel are given first followed by the four spectral values for the top-middle pixel and then those for the top-right pixel, and so on with the pixels read out in sequence left-to-right and top-to-bottom. Thus, the four spectral values for the central pixel are given by attributes 17,18,19 and 20. If you like you can use only these four attributes, while ignoring the others. This avoids the problem which arises when a 3x3 neighbourhood straddles a boundary.


In [511]:
landsat_train = pd.read_csv("LandSat/sat.trn", sep=" ",header=None)

In [512]:
landsat_train.columns

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
            34, 35, 36],
           dtype='int64')

I can't work with these column names. Based on my reading of the dataset, it appears that that each set of four numbers represent spectral values (R,G, Infrared1 and Infrared2).  The code below will create the column names I need for this data set by creating a Cartesian product of the pixel positions and the spectral values.

In [513]:
positions = ['top_left_','top_middle_','top_right_',
           'middle_left_','central_pixel_','middle_right_',
           'bottom_left_','bottom_middle_','bottom_right_']

In [514]:
pixels = ['r','g','infra1','infra2']

The product function gets the Cartesian product of two lists.

In [515]:
from itertools import product
temp = list(product(positions, pixels))

In [516]:
colnames = list()

In [517]:
for name in temp:
    colnames.append(name[0]+name[1])

colnames.append('land_type')

In [518]:
print(colnames)

['top_left_r', 'top_left_g', 'top_left_infra1', 'top_left_infra2', 'top_middle_r', 'top_middle_g', 'top_middle_infra1', 'top_middle_infra2', 'top_right_r', 'top_right_g', 'top_right_infra1', 'top_right_infra2', 'middle_left_r', 'middle_left_g', 'middle_left_infra1', 'middle_left_infra2', 'central_pixel_r', 'central_pixel_g', 'central_pixel_infra1', 'central_pixel_infra2', 'middle_right_r', 'middle_right_g', 'middle_right_infra1', 'middle_right_infra2', 'bottom_left_r', 'bottom_left_g', 'bottom_left_infra1', 'bottom_left_infra2', 'bottom_middle_r', 'bottom_middle_g', 'bottom_middle_infra1', 'bottom_middle_infra2', 'bottom_right_r', 'bottom_right_g', 'bottom_right_infra1', 'bottom_right_infra2', 'land_type']


In [519]:
landsat_train.columns = colnames

In [520]:
landsat_train.head()

Unnamed: 0,top_left_r,top_left_g,top_left_infra1,top_left_infra2,top_middle_r,top_middle_g,top_middle_infra1,top_middle_infra2,top_right_r,top_right_g,...,bottom_left_infra2,bottom_middle_r,bottom_middle_g,bottom_middle_infra1,bottom_middle_infra2,bottom_right_r,bottom_right_g,bottom_right_infra1,bottom_right_infra2,land_type
0,92,115,120,94,84,102,106,79,84,102,...,104,88,121,128,100,84,107,113,87,3
1,84,102,106,79,84,102,102,83,80,102,...,100,84,107,113,87,84,99,104,79,3
2,84,102,102,83,80,102,102,79,84,94,...,87,84,99,104,79,84,99,104,79,3
3,80,102,102,79,84,94,102,79,80,94,...,79,84,99,104,79,84,103,104,79,3
4,84,94,102,79,80,94,98,76,80,102,...,79,84,103,104,79,79,107,109,87,3


I going to put all of previous code into a function so I can apply it to the test data set.

In [521]:
def prep_data(df):
    positions = ['top_left_','top_middle_','top_right_',
           'middle_left_','central_pixel_','middle_right_',
           'bottom_left_','bottom_middle_','bottom_right_']
    
    pixels = ['r','g','infra1','infra2']
    
    temp = list(product(positions, pixels))
    
    colnames = list()
    
    for name in temp:
        colnames.append(name[0]+name[1])

    colnames.append('land_type')
    
    df.columns = colnames
    
    return df

In [522]:
temp = pd.read_csv("LandSat/sat.tst", sep=" ",header=None)
landsat_test = prep_data(temp)

In [523]:
landsat_test.head()

Unnamed: 0,top_left_r,top_left_g,top_left_infra1,top_left_infra2,top_middle_r,top_middle_g,top_middle_infra1,top_middle_infra2,top_right_r,top_right_g,...,bottom_left_infra2,bottom_middle_r,bottom_middle_g,bottom_middle_infra1,bottom_middle_infra2,bottom_right_r,bottom_right_g,bottom_right_infra1,bottom_right_infra2,land_type
0,80,102,102,79,76,102,102,79,76,102,...,87,79,107,109,87,79,107,113,87,3
1,76,102,102,79,76,102,106,83,76,102,...,87,79,107,113,87,79,103,104,83,3
2,80,98,106,79,76,94,102,76,76,94,...,79,79,95,100,79,79,95,96,75,4
3,76,94,102,76,76,94,102,76,76,94,...,79,79,95,96,75,79,95,100,75,4
4,76,94,102,76,76,94,102,76,76,89,...,75,79,95,100,75,75,95,100,79,4


In [554]:
X_train = np.array(landsat_train.iloc[:,0:36])

In [555]:
X_train 

array([[ 92, 115, 120, ..., 107, 113,  87],
       [ 84, 102, 106, ...,  99, 104,  79],
       [ 84, 102, 102, ...,  99, 104,  79],
       ...,
       [ 68,  75, 108, ..., 100, 104,  85],
       [ 71,  87, 108, ...,  91, 104,  85],
       [ 71,  91, 100, ...,  91, 100,  81]], dtype=int64)

In [556]:
y_train = np.array(landsat_train.iloc[:,-1])

In [573]:
X_test = np.array(landsat_test.iloc[:,0:36])
y_test = np.array(landsat_test.iloc[:,-1])

In [574]:
neigh = KNeighborsClassifier(n_neighbors=4)

In [576]:
neigh.fit(X_test, y_test)

KNeighborsClassifier(n_neighbors=4)

In [579]:
y_pred= neigh.predict(X_test)

In [578]:
neigh.score(X_test,y_test)

0.9305

In [580]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [581]:
accuracy_score(y_pred, y_test)*100

93.05

In [584]:
confusion_matrix(y_pred, y_test)

array([[461,   0,   2,   1,  13,   1],
       [  0, 221,   1,   0,   1,   0],
       [  0,   0, 380,  23,   1,  10],
       [  0,   0,   8, 173,   0,  40],
       [  0,   1,   1,   3, 217,  10],
       [  0,   2,   5,  11,   5, 409]], dtype=int64)

## Mushroom Data

In [585]:
mushroom = pd.read_csv("Mushroom/agaricus-lepiota.data", header=None)

In [586]:
mushroom.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


The column names are not in the dataset. I need to get the column names so I can use them. I'm going to use Regular Expression to do this.

In [587]:
import re

This is the pattern to search for.

In [588]:
pattern="\d+.\s[a-z]+-?[a-z]+"

Read the file that contains the column names.

In [589]:
file = open('Mushroom/columns.txt', mode='r')
data = file.read()
file.close()

Search the text and place the matched data into a list.

In [590]:
x = re.findall(pattern, data)

In [591]:
colnames = list()

In [592]:
colnames.append('mushroom_type')
for name in x:
    colnames.append(re.sub("^\d+.\s", "", name))

In [593]:
colnames

['mushroom_type',
 'cap-shape',
 'cap-surface',
 'cap-color',
 'bruises',
 'odor',
 'gill-attachment',
 'gill-spacing',
 'gill-size',
 'gill-color',
 'stalk-shape',
 'stalk-root',
 'stalk-surface',
 'stalk-surface',
 'stalk-color',
 'stalk-color',
 'veil-type',
 'veil-color',
 'ring-number',
 'ring-type',
 'spore-print',
 'population',
 'habitat']

In [594]:
mushroom.columns = colnames

In [595]:
mushroom.head()

Unnamed: 0,mushroom_type,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface,stalk-color,stalk-color.1,veil-type,veil-color,ring-number,ring-type,spore-print,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [596]:
len(mushroom.columns)

23

In [597]:
len(mushroom)

8124

In [598]:
X = np.array(mushroom.iloc[:,1:22])
y = np.array(mushroom.iloc[:,0])

In [599]:
from sklearn import preprocessing

In [600]:
l = preprocessing.LabelEncoder()
y = l.fit_transform(y)

In [601]:
y

array([1, 0, 0, ..., 0, 1, 0])

In [602]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')

In [603]:
X = enc.fit_transform(X).toarray()

In [604]:
enc.get_feature_names()

array(['x0_b', 'x0_c', 'x0_f', 'x0_k', 'x0_s', 'x0_x', 'x1_f', 'x1_g',
       'x1_s', 'x1_y', 'x2_b', 'x2_c', 'x2_e', 'x2_g', 'x2_n', 'x2_p',
       'x2_r', 'x2_u', 'x2_w', 'x2_y', 'x3_f', 'x3_t', 'x4_a', 'x4_c',
       'x4_f', 'x4_l', 'x4_m', 'x4_n', 'x4_p', 'x4_s', 'x4_y', 'x5_a',
       'x5_f', 'x6_c', 'x6_w', 'x7_b', 'x7_n', 'x8_b', 'x8_e', 'x8_g',
       'x8_h', 'x8_k', 'x8_n', 'x8_o', 'x8_p', 'x8_r', 'x8_u', 'x8_w',
       'x8_y', 'x9_e', 'x9_t', 'x10_?', 'x10_b', 'x10_c', 'x10_e',
       'x10_r', 'x11_f', 'x11_k', 'x11_s', 'x11_y', 'x12_f', 'x12_k',
       'x12_s', 'x12_y', 'x13_b', 'x13_c', 'x13_e', 'x13_g', 'x13_n',
       'x13_o', 'x13_p', 'x13_w', 'x13_y', 'x14_b', 'x14_c', 'x14_e',
       'x14_g', 'x14_n', 'x14_o', 'x14_p', 'x14_w', 'x14_y', 'x15_p',
       'x16_n', 'x16_o', 'x16_w', 'x16_y', 'x17_n', 'x17_o', 'x17_t',
       'x18_e', 'x18_f', 'x18_l', 'x18_n', 'x18_p', 'x19_b', 'x19_h',
       'x19_k', 'x19_n', 'x19_o', 'x19_r', 'x19_u', 'x19_w', 'x19_y',
       'x20_a', '

In [605]:
from sklearn.model_selection import train_test_split

In [606]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.20, random_state=32)

In [607]:
X_train

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [608]:
neigh = KNeighborsClassifier(n_neighbors=5)

In [609]:
neigh.fit(X_train,y_train)

KNeighborsClassifier()

In [610]:
y_pred = neigh.predict(X_test)

In [611]:
neigh.score(X_test,y_test)

1.0

In [612]:
confusion_matrix(y_pred, y_test)

array([[846,   0],
       [  0, 779]], dtype=int64)