# KNN (K-Nearest Neighbors) Lab

<img src="assets/Data_Science_VD.png" width="400px">

We have been diving into both Hacking Skills and Math and Statistics Knowledge. 

Substantive Expertise (also called Domain Knowledge) is something that unfortunately can not be taught. 

Over the next few lesson, we will look at some machine learning techniques that can be employed to help overcome this short-coming. 

---

## Domain & Data

### Domain

Prepared for the Neural Information Processing Symposium 2003 Feature Extraction Workshop

http://clopinet.com/isabelle/Projects/NIPS2003

### Data 

MADELON is an artificial dataset, which was part of the NIPS 2003 feature selection challenge. This is a two-class classification problem with continuous input variables. The difficulty is that the problem is multivariate and highly non-linear.

![](assets/Large171.jpg)

MADELON is an artificial dataset containing data points grouped in 32 clusters placed on the vertices of a five dimensional hypercube and randomly labeled +1 or -1. The five dimensions constitute 5 informative features. 15 linear combinations of those features were added to form a set of 20 (redundant) informative features. Based on those 20 features one must separate the examples into the 2 classes (corresponding to the +-1 labels). We added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized. 



## Problem Statement

The NIPS 2003 challenge in feature selection is to find feature selection algorithms that significantly outperform methods using all features in performing a binary classification task.

![](assets/180px-Binary-classification-labeled.png)



## Solution Statement

We will develop a binary classification model using a K Nearest Neighbors classifier.

<img src="assets/2012-10-26-knn-concept.png" width="400px">


## Metric 

Today, we are largely exploring the dataset. We will use 
the default metric included with the classifier.

## Benchmark

We will be assessing after our work today what an appropriate benchmark might be.

In [None]:
from __future__ import print_function
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline

### Load the Dataset into A DataFrame

Unfortunately, the features are nameless. Remember, this is a synthetic dataset. The features are nameless, because they actually have no meaning. As such visualization is going to present a greater challenge in terms of gaining insight into our data before we begin.

In [None]:
feature_datafile_location = "../../../data/madelon_data.csv"

In [None]:
madelon_feature_df = pd.read_csv(feature_datafile_location,
                                 sep=' ',
                                 header=None)
madelon_feature_df.columns = ['feat_' + str(col) 
                              for col in madelon_feature_df.columns]
madelon_feature_df.head(1)

In [None]:
fig = plt.figure(figsize=(12,4))

fig.add_subplot(131)
i = str(np.random.randint(500))
j = str(np.random.randint(500))
sns.swarmplot(x='feat_'+i, 
              y='feat_'+j,
              data=madelon_feature_df)
                            
fig.add_subplot(132)
i = str(np.random.randint(500))
j = str(np.random.randint(500))
sns.swarmplot(x='feat_'+i, 
              y='feat_'+j,
              data=madelon_feature_df)
                            
fig.add_subplot(133)
i = str(np.random.randint(500))
j = str(np.random.randint(500))
sns.swarmplot(x='feat_'+i, 
              y='feat_'+j,
              data=madelon_feature_df)                          

Looking at randomly selected features show very little. 

We can look at the shape of the data.

In [None]:
madelon_feature_df.shape

In [None]:
target_datafile_location = "../../../data/madelon_labels.csv"

In [None]:
madelon_target_df = pd.read_csv(target_datafile_location,
                                header=None)
madelon_target_df.columns = ['trgt_' + str(col) 
                             for col in madelon_target_df.columns]
madelon_target_df.head(1)

We can look the unique target values.

In [None]:
madelon_target_df['trgt_0'].unique()

We can distribution of the target set.

In [None]:
madelon_target_df.groupby(['trgt_0']).size()

It appears as though our target set is evenly distributed.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, \
    X_test, \
    y_train, \
    y_test = train_test_split(madelon_feature_df,
                              madelon_target_df,
                              random_state=42)


In [1]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

In [10]:
knn.__str__()

"KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n           metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n           weights='uniform')"

In [None]:
knn.fit(X_train, y_train)


It appears as though we are getting a warning from the KNN fit. It is associate with the shape of our target data.

In [None]:
y_train.shape, y_test.shape

In [None]:
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

In [None]:
y_train.shape, y_test.shape

In [None]:
knn.fit(X_train, y_train)

In [None]:
print("Train Score: {}".format(knn.score(X_train, y_train)))
print(" Test Score: {}".format(knn.score(X_test, y_test)))

### The Standard Sklearn Template For Classification

1. Load the data
1. Split the data into training and testing sets
1. Create a new model
1. Fit the model
1. Score the model

In [None]:
def standard_classification_knn(X, y, n_neighbors, random_state):
    X_train,     \
        X_test,  \
        y_train, \
        y_test = train_test_split(X, y, 
                                  random_state=random_state)

    y_train = np.ravel(y_train)
    y_test = np.ravel(y_test)
    
    knn = KNeighborsClassifier(n_neighbors=n_neighbors, n_jobs=4)
    knn.fit(X_train, y_train)
    
    train_score = knn.score(X_train, y_train)
    test_score = knn.score(X_test, y_test)
    
    print("{} ".format(n_neighbors), end='')
    
    return {'n_neighbors': n_neighbors,
            'train_score' : train_score,
            'test_score' : test_score}

In [None]:
standard_classification_knn(madelon_feature_df,
                            madelon_target_df,
                            3,
                            random_state=42) 

In order to get some context into what this means, we should look at multiple values of train and testing score.

In [None]:
results = [standard_classification_knn(madelon_feature_df,
                                       madelon_target_df,
                                       n_neighbors,
                                       random_state=42) 
           for n_neighbors in range(2,20)]

results_df = pd.DataFrame(results)
results_df.head(2)

In [None]:
plt.plot(results_df['n_neighbors'], results_df['test_score'], label='test score')
plt.plot(results_df['n_neighbors'], results_df['train_score'], label='training score')
plt.legend()

In [None]:
results = [standard_classification_knn(madelon_feature_df,
                                       madelon_target_df,
                                       n_neighbors,
                                       random_state=21) 
           for n_neighbors in range(2,20)]

results_df = pd.DataFrame(results)
plt.plot(results_df['n_neighbors'], results_df['test_score'], label='test score')
plt.plot(results_df['n_neighbors'], results_df['train_score'], label='training score')
plt.legend()


### Why would our performance suffer when we evaluate based upon an even number of neighbors?

In [None]:
def plot_standard_knn_classification_for_odd_values_of_n(X, y, random_state):    
    results = [standard_classification_knn(madelon_feature_df,
                                       madelon_target_df,
                                       n_neighbors,
                                       random_state=random_state) 
           for n_neighbors in range(2,30) if n_neighbors % 2 == 1]

    print()
    
    results_df = pd.DataFrame(results)
    plt.plot(results_df['n_neighbors'], results_df['test_score'], label='test score')
    plt.plot(results_df['n_neighbors'], results_df['train_score'], label='training score')
    plt.legend()

In [None]:
fig = plt.figure(figsize=(20,6))
plt.title('Plots of Error v Number of Neighbors for Various Random States')

fig.add_subplot(141)
plot_standard_knn_classification_for_odd_values_of_n(madelon_feature_df, madelon_target_df, 12)

fig.add_subplot(142)
plot_standard_knn_classification_for_odd_values_of_n(madelon_feature_df, madelon_target_df, 18)

fig.add_subplot(143)
plot_standard_knn_classification_for_odd_values_of_n(madelon_feature_df, madelon_target_df, 25)

fig.add_subplot(144)
plot_standard_knn_classification_for_odd_values_of_n(madelon_feature_df, madelon_target_df, 42)
