# k-Nearest Neighbors implementation

- Doesn't use any library to perform KNN.
- Uses scikit-learn library for calculating various metrics and confusion matrix.

It is possible to provide file name, k value and training-test data split ratio as arguments such as the following:
        python knn.py data/iris.csv 5 0.67

It is tested with the following example data sets:
- [arrhythmia](./data/arrhythmia.csv): missed values replaced by -1 (https://archive.ics.uci.edu/ml/datasets/Arrhythmia)
- [banknote](./data/banknote.csv): nothing changed, converted to CSV (https://archive.ics.uci.edu/ml/datasets/banknote+authentication)
- [forestfires](./data/forestfires.csv): categorical values (mon, day) are converted to numeric values, all values larger than 0 are converted to 1 in burned area column (https://archive.ics.uci.edu/ml/datasets/Forest+Fires)
- [iris](./data/iris.csv): categorical result value are converted to numeric values (https://archive.ics.uci.edu/ml/datasets/Iris)
- [lung-cancer](./data/lung-cancer.csv): moved target values to the last column, missed values replaced by -1 (https://archive.ics.uci.edu/ml/datasets/Lung+Cancer)
- [phishing-websites](./data/phishing-websites.csv): nothing changed, converted to CSV without header (https://archive.ics.uci.edu/ml/datasets/Phishing+Websites)

The main source for the code is the following tutorial: [Develop k-Nearest Neighbors in Python From Scratch](http://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/)

In [None]:
from operator import itemgetter
from utility import display, euclidean, load_dataset, split_dataset

## Locate the most similar neighbors

In [None]:
def get_neighbors(training, test, k):
    distances = {}
    for x in range(len(training)):
        dist = euclidean(test, training[x])
        distances[x] = dist
    distances = sorted(distances.items(), key=itemgetter(1))
    neighbors = []
    for _ in range(k):
        neighbors.append(distances.pop()[0])
    return neighbors

## Make a classification prediction with neighbors

In [None]:
def predict(neighbors, target):
    class_votes = {}
    for x in neighbors:
        response = target[x]
        if response in class_votes:
            class_votes[response] += 1
        else:
            class_votes[response] = 1
    sorted_votes = sorted(class_votes.items(),
                          key=itemgetter(1), reverse=True)
    return sorted_votes[0][0]

## Load data

In [None]:
dataset, target = load_dataset("data/forestfires.csv")

## Split data

In [None]:
train_x, train_y, test_x, test_y = split_dataset(dataset, target, 0.8)
print("Training set size: %d" % (len(train_x)))
print("Testing set size: %d" % (len(test_x)))

## Predict

In [None]:
predictions = []
actual = []
for x in range(len(test_x)):
    neighbors = get_neighbors(train_x, test_x[x], 5)
    result = predict(neighbors, train_y)
    predictions.append(result)
    actual.append(test_y[x])

## Calculate and display scores

In [None]:
display(actual, predictions)