# K-nearest neighbors (KNN)

## Theory

K-nearest neighbors is a supervised machinlearning algorithm used for classification 
and regression. It assumes that data points corresponding to the same class are close 
to eachother. The algorithm computes the distance between a new data point and 
all other data points. Subsequently, it selects the K nearest neighbors and counts which 
class is most common. The most common class in the K nearest neighbors is the predicted 
class of the new example.

## Implementation

In [42]:
import numpy as np
import pandas as pd 
import warnings
from collections import Counter

In [51]:
def k_nearest_neighbors(data, predict, k=3):
	if len(data) >= k:
		warnings.warn('K is set to a value less than total voting groups!')

	distances = []

	# Compute the distance between the predict data point and each training data point
	for group in data:
		for features in data[group]:
			euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
			distances.append([euclidean_distance, group])

	# Select the k nearest neighbors and count which class is the most common
	votes = [distance[1] for distance in sorted(distances)[:k]]
	vote_result = Counter(votes).most_common(n=1)[0][0]		
	confidence = Counter(votes).most_common(n=1)[0][1] / k 

	return vote_result, confidence

In [61]:
df = pd.read_csv("../data/data.csv")
df.drop(labels=["id", "Unnamed: 32"], axis=1, inplace=True)

test_size = 0.2
benign_data = df[df['diagnosis'] == 'B'].drop(labels=['diagnosis'], axis=1).to_numpy()
malignant_data = df[df['diagnosis'] == 'M'].drop(labels=['diagnosis'], axis=1).to_numpy()

train_set = {0: benign_data[:int(test_size*df.shape[0])] , 1: malignant_data[:int(test_size*df.shape[0])]}
test_set = {0: benign_data[int(test_size*df.shape[0]):], 1: malignant_data[int(test_size*df.shape[0]):]}

correct = 0
total = 0

for group in test_set:
    for data in test_set[group]:
        vote, confidence = k_nearest_neighbors(train_set, data, k=5)
        correct += int(group == vote)
        total += 1

print('Accuracy', correct/total)

Accuracy 0.9416909620991254
