# Nearest Neighbors
In this notebook, I will create a simple nearest neighbors model to classify the leaves dataset as provided by Kaggle. I make use of euclidean distances, k nearest neighbors, log loss as a metric and cross validation.

* **Proccessing Data**
    * Loading Dada
    * Extract features & labels
    * Normalize Features
    * Training & Validation Split
* **Nearest Neighbors**
    * Euclidean Distances
    * K Nearest Neighbors
    * Probability Dataframe
    * Predictions
    * Log Loss Metric
    * Cross-Validation
    * Probabilities Testing Set

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Proccess Data
## Loading Data
First, the train and test data is loaded into a pandas dataframe.

In [None]:
data = pd.read_csv('../input/train.csv', index_col=0)
testData = pd.read_csv('../input/test.csv', index_col=0)
data.head(6)

## Extract features & labels
We shuffle the data in this early stadium, to avoid index influence when splitting into training and validation sets. Next, the training labels and features are being separated.

In [None]:
data = data.sample(frac=1)
features = data[data.columns[1:193]]
labels = data['species']

In [None]:
features.head()

## Normalize Features
The training and testing features are being normalized.

In [None]:
features = features / np.linalg.norm(features, axis=0)
testData = testData / np.linalg.norm(testData, axis=0)
features.head()

## Training & Validation Split
To test some later helper functions, we split the training data into a train and validation set. Later, we'll do this again via cross-validation. For now, we put aside one tenth for validation. As we shuffled already, we can just split via indeces.

In [None]:
trainFeatures = features.iloc[:891,:]
validFeatures = features.iloc[891:,:]
trainLabels =  labels.iloc[:891]
validLabels = labels.iloc[891:]

# Nearest Neighbors
## Euclidean Distances
Let's say we want to get the closest neighbor in the training set for the datapoint with index 1 in the validation set. We create a function to calculate all euclidean distances between this query on the one hand, and all training points on the other. Then, we take a look at the label of the training point with the lowest distance. The label turns out to be the same as the label of the query.

In [None]:
#Computes distance(s) between a query and training point(s)
def computeDistances(trainData, query):
    distances = np.sqrt(np.sum((trainData - query)**2, axis=1))
    return distances
distances = computeDistances(trainFeatures, validFeatures.iloc[1,:])
trainLabels.loc[distances.argmin()]

In [None]:
validLabels.iloc[1]

## K Nearest Neighbors
Now, we would like a given number (k) of close neighbors, representing a probability distribution. All k nearest neighbors are equally weighted, each with a probability of 1/k. We create a function that, for a given query, returns a dictionary with the closest neighbors and their probabilities as the value.

In [None]:
def getNearestNeighbors(trainFeatures, trainLabels, query, k):
    distances = computeDistances(trainFeatures, query)
    lowestTen = distances.sort_values().head(k).index
    results = dict()
    for x in trainLabels[lowestTen]:
        if x not in results:
            results[x] = 1/float(k)
        else:
            results[x] += 1/float(k)
    return results

In [None]:
getNearestNeighbors(trainFeatures, trainLabels, validFeatures.iloc[1,:], 10)

So in this instance with k=10, each 0.1 acts as a 'vote' for that leaf class.

## Probability Dataframe
The purpose of the following, is to generate dictionaries as above for all queries in a validation or test set, and to collect the probabilities in one dataframe. First, we store all the unique leaf classes in alphabetical order for this dataframe. We initiate this probability dataframe with all zeros. Then, we loop per query through the closest neighbors, adjusting the appropriate position in the probability dataframe.

In [None]:
def getProbabilities(trainFeatures, trainLabels, testFeatures, k):
    leaves = np.unique(data.species.sort_values().values)
    probabilities = pd.DataFrame(0, index=testFeatures.index, columns=leaves)
    for index, query in zip(testFeatures.index, testFeatures.values):
        for key, value in getNearestNeighbors(trainFeatures, trainLabels, query, k).items():
            probabilities.loc[index, key] = value
    return probabilities

In [None]:
getProbabilities(trainFeatures, trainLabels, validFeatures, 100).head()

Here we used a massive k-value, just to view some probabilities. Note that each row will sum to one, as it forms a probability distribution of the 99 leaf classes.

## Predictions
Based on the probabilites, we are now interested in the leaf class with the highest probabilty, as this one would of course be our prediction if we had to make one. So we generate a probability dataframe, and return the leaf class with the highest probability per row.

In [None]:
def getPredictions(trainFeatures, trainLabels, testFeatures, k):
    testProbabilities = getProbabilities(trainFeatures, trainLabels, testFeatures, k)
    return testProbabilities.idxmax(axis = 1)

In [None]:
validPredictions = getPredictions(trainFeatures, trainLabels, validFeatures, 10)
validPredictions.head()

The above leaf classes are our predictions for the first five validation leaves, based on ten nearest neighbors. We'll see next that the log loss is a more interesting metric, but at the point, we could calculate the accuracy as well. We just check how many times the our prediction corresponds with the label, and divide this number by the total number of labels.

In [None]:
accuracy = sum(validPredictions==validLabels)/float(len(validLabels))
accuracy

## Log Loss Metric
The log loss is the metric uses at Kaggle, and accounts for (un)certainty. This is the formula, where L is the number of query leaves, C is the number of leaf classes, y is a binary value indicating whether leaf l actually belongs to class c (1 of so, 0 if not), and p is the probability that leaf l belongs to class c.

$$\text{logloss} = -\frac{1}{L}\sum_{l=1}^L\sum_{c=1}^C{y_{lc}log(p_{lc})}$$

Here we implement the log loss function for a given training and validation set. Note that we substitute extreme values (0 and 1) in the probability matrix by very close values (0.0...01 and 0.99...) to make the log working.

In [None]:
def getLogLoss(trainFeatures, trainLabels, validFeatures, validLabels, k):
    leaves = np.unique(data.species.sort_values().values)
    validProbabilities = getProbabilities(trainFeatures, trainLabels, validFeatures, k)
    totalLoss = 0
    for index, row in zip(validProbabilities.index, validProbabilities.values):
        bools = np.zeros(99)
        bools[np.where(leaves==validLabels.loc[index])] = 1
        probs = np.zeros(len(row))
        for i, x in enumerate(row):
            probs[i] = np.log(max(min(x,1-10**-15),10**-15))
        totalLoss += sum(bools*probs)
    logLoss = totalLoss / -len(validProbabilities.values)
    return logLoss

In [None]:
for i in range(15):
    print('k = ' + str(i+1) + ': ' + str(getLogLoss(trainFeatures, trainLabels, validFeatures, validLabels, i+1)))

Here we calculated the logloss for 10 different values of k.

## Cross-Validation
The log loss as calculated above is quite heavily influenced by the choice of the test and validations split, though. By cross-validation, we reduce this variation. For a given number of folds, we run a logLoss on shifting validation sets. We then take the average of theses losses as our final logloss. The function takes the original features and labels as arguments.

In [None]:
def crossValidation(features, labels, folds, k):
    n = len(features)
    totalLoss = 0
    for i in range(folds):
        start = int((n*i)/folds)
        end = int((n*(i+1))/folds)
        validFeatures = features.iloc[start:end,:]   
        validLabels = labels.iloc[start:end]   
        trainFeatures = features.iloc[0:start,:].append(features.iloc[end:n,:])        
        trainLabels = labels.iloc[0:start].append(labels.iloc[end:n])        
        totalLoss += getLogLoss(trainFeatures, trainLabels, validFeatures, validLabels, k)
    averageLoss = totalLoss / folds
    return averageLoss

In [None]:
lossAll = np.zeros(15)
for i in range(15):
    lossAll[i] = crossValidation(features, labels, folds=10, k=i+1)
    print('k = ' + str(i+1) + ': ' + str(lossAll[i]))

Here we did the same as above, but with cross-validation (folding the data 10 times). The differenses are: (1) the values are much more stable and will be similar when rerun, (2) the calculation takes longer as, per k-value, the logLoss is calculated ten times. We can simply visualize these numbers to see that a k-value of about 5 results in the lowest logloss.

In [None]:
plt.plot(range(1, 16),lossAll, 'r')
plt.xlabel('# Nearest Neighbors = k')
plt.ylabel('logLoss')
plt.show()

## Probabilities Testing Set
Given the loglosses calculated above, we create a probability dataframe with k=5, based on all of the training data, for the test set.

In [None]:
submission = getProbabilities(features, labels, testData, 5)
submission.head()