# K Nearest Neighbors

Unlike <b>Linear Regression</b> where our approach is to develop a model that best fit the data. In <b>K Nearest Neighbor</b> we aim for a model that best divides the data. Its more of a kind of clustering of data.

This algorithm works on clustering of data based on the euclidean distances. This clustering criteria becomes too tedious for large datasets and then there is <b>SVM</b> that comes to rescue.

DATASET - http://archive.ics.uci.edu/ml/datasets.html<br/>
We would be working on the breast cancer - original dataset.

In [11]:
import numpy as np
from sklearn import preprocessing, cross_validation, neighbors
import pandas as pd

df = pd.read_csv('bcw.data')
df.replace('?', -99999, inplace = True)

'''
We have used the value -99999 to replace ?
because most of the algorithms will readily
detect it as an outlyer. Dumping the data is
never a good choice. Real world datasets has
lots of missing data and dropping all columns
for just on missing data is never a good idea.
'''

'''
Finding the useless data
1. Like ID has no relation with cancer
'''

df.drop(['id'], 1, inplace = True)

'''
(BIG)X = Features
(small)y = Label

Features here is everything except class
Label is the class
'''
X = np.array(df.drop(['class'], 1))
y = np.array(df['class'])

'''
We gonna be shuffling our data into train and test set
the test dataset size would be 20 percent of the actual dataset size
'''
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size = 0.2)

clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)
print(accuracy)

0.964285714286


Wanna see something crazy - Just don't drop the ID column.

In [16]:
example_measures = np.array([4,2,1,1,1,2,3,2,1])
example_measures = example_measures.reshape(1, -1)
'''
We need to be sure the machine has never seen 
the same set of data before
'''
prediction = clf.predict(example_measures)
print(prediction)

[2]


In [17]:
example_measures = np.array([[4,2,1,1,1,2,3,2,1], [4,2,1,2,2,2,3,2,1]])
example_measures = example_measures.reshape(2, -1)
prediction = clf.predict(example_measures)
print(prediction)

[2 2]


In [18]:
example_measures = np.array([[4,2,1,1,1,2,3,2,1], [4,2,1,2,2,2,3,2,1]])
example_measures = example_measures.reshape(len(example_measures), -1)
prediction = clf.predict(example_measures)
print(prediction)

[2 2]
