# K-Nearest Neighbour

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems. The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point.

Algorithm.

1. Initialize K to your chosen number of neighbors
2. Calculate the distance between the query example and the    current example from the data.
3. Add the distance and the index of the example to an        ordered collection
4. Sort the ordered collection of distances and indices        from smallest to largest (in ascending order) by the    distances
5. Pick the first K entries from the sorted collection
6. Get the labels of the selected K entries




Here  I am using the **breast_cancer** dataset from sklearn library

In [0]:
from sklearn.datasets import load_breast_cancer       # To Get load_breast_cancer dataset
from sklearn.model_selection import train_test_split  # for split arrays or matrices into random train and test subsets
from sklearn.neighbors import KNeighborsClassifier    # here we are importing KNeighborsClassifie from sklearn
from sklearn import metrics                           # importing metrics from sklearn to findout the performance measures.
dset = load_breast_cancer()                           # loading breast_cancer dataset into dset.

The breast cancer dataset is a binary classification dataset

In [46]:
print(dset.data.shape) # return the number of columns and rows in the dataset

(569, 30)


It consists of 569 samples and dimentionality of 30 

In [47]:
print(dset.target_names) # return the target classes
print('\n')
print(dset.target)

['malignant' 'benign']


[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1


The dataset is consists of two target classes - 'malignant' and 'benign'. 0 and 1 for respective classes.

In [48]:
print(dset.feature_names) # return all the feature names.

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


In [0]:
X_train, X_test, y_train, y_test = train_test_split(dset.data, dset.target, test_size=0.3) # 70% training and 30% test

A single train/test split is made easy with the train_test_split function in the cross_validation library. We will split the dataset in to 70:30 propotion by keeping the 30% for test data. We assign more data for training, to train our model well enough other wise having less amount of data for traning will leads to underfitting.

In [0]:
#Create KNN Classifier with k value=5
knn = KNeighborsClassifier(n_neighbors=5)

#Train the model using the training sets
knn.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = knn.predict(X_test)

 K value we can make boundaries of each class. To select the K that’s right for your data, we run the KNN algorithm several times with different values of K and choose the K that reduces the number of errors 

In [51]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) # predicting the accuracy.

Accuracy: 0.9298245614035088


In [0]:
#Create KNN Classifier witth k = 4
knn = KNeighborsClassifier(n_neighbors=4)

#Train the model using the training sets
knn.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = knn.predict(X_test)

In [84]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) # # predicting the accuracy.

Accuracy: 0.9064327485380117


for k value of 4 the accuracy of classification is decreased.