# Knn(K Nearest Neighbours)

Knn is an algorithm which classifies a new datapoint on the basis of the its neighbours.

It has a really simple algorithm which finds out the k nearest neighbours of a datapoint and gives it a class which has the most neighbours in it.

<img src='https://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/63621/versions/2/screenshot.gif'>
here, k=5 and the new point will be classified as orange

A common way of choosing the value of k is taking k to be every value between 1 and 20 and choosing the best model.

In [1]:
#importing dependencies
import pandas as pd #for manipulating data
import numpy as np  #for working with matrices

In [2]:
#reading data 
names=['ID','Clump_thickness','uni_size','uni_shape','adhesion','epi_cellsize','bland_nucleii','bland_chromatin','normal','mitoses','class']
data=pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=None, names=names)

In [3]:
#This is the data which classifies the cancer of a patient(2 = benign  4 = malignant) 
data

Unnamed: 0,ID,Clump_thickness,uni_size,uni_shape,adhesion,epi_cellsize,bland_nucleii,bland_chromatin,normal,mitoses,class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2
5,1017122,8,10,10,8,7,10,9,7,1,4
6,1018099,1,1,1,1,2,10,3,1,1,2
7,1018561,2,1,2,1,2,1,3,1,1,2
8,1033078,2,1,1,1,2,1,1,1,5,2
9,1033078,4,2,1,1,2,1,2,1,1,2


In [4]:
#replacing bad data with outliers
data.loc[(data[data.bland_nucleii == '?'].index.tolist()),'bland_nucleii'] = '100'

In [5]:
#dropping useless info
data.drop(labels='ID',axis=1,inplace=True)

In [6]:
#converting obj to number
data.bland_nucleii = data.bland_nucleii.apply(pd.to_numeric)

In [7]:
#splitting the data(data sampling is recommended first)
#data=data.sample(frac=1).reset_index(drop=True)
#un-comment above line for better and more general perrformance
Xtest=data.loc[630:699,'Clump_thickness':'mitoses']
Ytest=data.loc[630:699,'class']
Xcv=data.loc[200:630,'Clump_thickness':'mitoses']
Ycv=data.loc[200:630,'class']
Xtr=data.loc[0:200,'Clump_thickness':'mitoses']
Ytr=data.loc[0:200,'class']

In [11]:
#knn algorithm
Ypr=[]  #predicted class
Ypr2=[]  #predicted confidence
k=11
shape=Xtr.shape[0]
for _ in Xcv.index:
    #cost is the squared difference between two points, basically the 'distance squared'.
    # take cost in a list, then sort the list and take the first k values and check which class is dominant
    nfinder=(Ytr[(((((Xtr-np.array(Xcv.loc[_,:].tolist()*(shape)).reshape((shape),9))**2).sum(axis=1)).sort_values(0))[0:k]).index.tolist()] - 2).sum()
    if nfinder>=(k+1):
        Ypr.append(4)
        Ypr2.append(nfinder/(2*k))
    else:
        Ypr.append(2)
        Ypr2.append(1-(nfinder/(2*k)))

Confidence is basically the percentage of the class that we predicted. We take that into consideration as well while giving predictions. In this example, it is fatal to make a mistake in prediction, and hence we choose the safe side and say that a cancer is malignant with say 70% confidence. Then the doctor can decide whether to start higher treatment or not.

In [15]:
#confidence_array gives the confidence with which it predicts '2' or '4'
conf_array=np.column_stack((Ypr,Ypr2))
print(conf_array)

[[4.         0.90909091]
 [4.         1.        ]
 [2.         1.        ]
 [2.         1.        ]
 [2.         1.        ]
 [4.         1.        ]
 [4.         0.90909091]
 [2.         1.        ]
 [2.         1.        ]
 [2.         1.        ]
 [4.         1.        ]
 [4.         1.        ]
 [2.         1.        ]
 [4.         1.        ]
 [4.         1.        ]
 [4.         0.90909091]
 [2.         1.        ]
 [2.         1.        ]
 [4.         1.        ]
 [2.         1.        ]
 [2.         1.        ]
 [4.         0.90909091]
 [2.         1.        ]
 [4.         0.90909091]
 [4.         0.90909091]
 [2.         1.        ]
 [4.         0.90909091]
 [4.         0.90909091]
 [2.         1.        ]
 [4.         1.        ]
 [4.         0.90909091]
 [4.         0.90909091]
 [4.         0.90909091]
 [4.         0.90909091]
 [2.         0.72727273]
 [2.         0.54545455]
 [4.         1.        ]
 [4.         0.90909091]
 [4.         1.        ]
 [4.         1.        ]


In [10]:
#accuracy
((Ypr==Ycv).sum())/len(Ypr)

0.9675174013921114

## Confusion Matrix

Confusion matrix gives complete information of our model. It looks like this:
<img src='https://i0.wp.com/lh3.ggpht.com/_qIDcOEX659I/SzjW6wGbmyI/AAAAAAAAAtY/Nls9tSN6DgU/contingency_thumb%5B3%5D.png'>

- TP is the number of truly predicted positives of the model
- FP is the number of actual negatives which were predicted as positive by our model
- FN is the number of actual negatives which were predicted as negative by our model
- TN is the number of truly predicted negatives by our model.

Using the confusion matrix, there are various evaluation metrics used to compare between models.

We will use the F1 score which uses the precision and recall of the model.

- Precision is the ratio of TP and the predicted positives(TP + FP)
- Recall is the ratio  of TP and the actual positives(TP + FN)
- F1 score is the harmonic mean of Precision and recall

In [40]:
#### true negatives(2)
tn=(Ycv[(Ypr == Ycv)] - 4).sum() * -1/2

In [36]:
#### true positives(4)
tp=((Ycv[(Ypr == Ycv)]) -2).sum() * 1/2

In [51]:
#### false negatives(2)
fn=(Ycv[(Ypr != Ycv)] -4).sum() * -1/2

In [52]:
#### false positives(4)
fp = (Ycv[(Ypr != Ycv)] -2).sum() * 1/2


In [53]:
#confusion matrix
c_matrix = np.array([[tn,fp],[fn,tp]])
print(c_matrix)

[[278.   5.]
 [  9. 278.]]


In [54]:
#f1 score
precision=tp/(tp+fp)
recall=tp/(fn+tp)
f1_score=(2*precision*recall)/(precision+recall)
print(f1_score)

0.9754385964912281


As you see, our model gives a pretty good F1 score.