# Introduction to k Nearest Neighbors

In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression. 

* In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

* In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.

Please refer to https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm for more details about k-NN. It is highly recommended that you read this, before proceeding further with this lab exercise.

The first part, you deal with using inbuilt scikit learn SVM function.

Note: *StackOverFlow* is programmer's best friend. If you have any doubts syntax related or otherwise, there is a high probability that someone would have already posted about it.

We will use Breast cancer database provided by UCI Machine Learning repository. 

* This https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data contains the breast cancer database. 

* This https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names contains the details of what each attribute represents in the data. 

* In short, Class <b> 2 </b> reprsents benign tumour and Class <b> 4 </b> represents malignant tumour.

## Using *scikit-learn KNeighborsClassifier()* function

In [7]:
# import required stuff.

import numpy as np 
from sklearn import preprocessing, neighbors
from sklearn.model_selection import cross_validate, train_test_split
import pandas as pd 

In [8]:
# Read the required data. 
# This data is from UCI Machine Learning repository's Breast cancer database.
# https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data
# Refer to https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names
# for more info on data.

df = pd.read_csv('data/breast-cancer-wisconsin.data.txt')
# Handling missing attributes in data, '?' in data denotes missing attributes
df.replace('?', -99999, inplace=True)
df.drop(['id'], 1, inplace=True)

In [9]:
# Get the features and labels.
X = np.array(df.drop(['class'], 1))
y = np.array(df['class'])

In [10]:
# Use a random 80-20 split of data for training and testing resp.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [11]:
# Using scikit learn's KNeighborsClassifier() function
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)

KNeighborsClassifier()

In [12]:
# Finding the accuracy. Not bad, huh ??
accuracy = clf.score(X_test, y_test)
print (accuracy)

0.9571428571428572


In [13]:
# Finding out the prediction of this SVM classifier on new data points. 
# Here [4,2,1,1,1,2,3,2,1] and [4,2,1,2,2,2,3,2,1] are the two data points
example_measures = np.array([[4,2,1,1,1,2,3,2,1], [4,2,1,2,2,2,3,2,1]])

# to get of the deprecation error use the line below.
example_measures = example_measures.reshape(len(example_measures), -1)

prediction = clf.predict(example_measures)
print(prediction)


[2 2]


#### Questions:

  1. What do you think is the significance of this line "df.drop(['id'], 1, inplace=True)" in the above example. Run the above program by commenting out this line. What do you observe and why ?