# KNN - K Nearest Neighbors with Online retail data

This section will be using same data set that was used in the K Means notebook. In summary the dataset is the following.

http://archive.ics.uci.edu/ml/datasets/online+retail<br>

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.

KNN is a simple concept: define some distance metric between the items in your dataset, and find the K closest items. You can then use those items to predict some property of a test item, by having them somehow "vote" on it. 

<br><b>Firstly lets define the diffence between K-Means and K-nearest neighbors<br></b>
<br>
<b>K-means</b> is a clustering algorithm that tries to partition a set of points into K sets (clusters) such that the points in each cluster tend to be near each other. It is unsupervised because the points have no external classification.
<br><br>
<b>K-nearest neighbors</b> is a classification (or regression) algorithm that in order to determine the classification of a point, combines the classification of the K nearest points. It is supervised because you are trying to classify a point based on the known classification of other points.
<br>

So for this analysis we are going to use the classifed data we alleady created in the last notebook. That was the online customers and their assigned classifications.<br>
With KNN you need to provide the lables for the dataset - in other words the answers....Then this is used to decide how to fit some new points into the model.

So we will import the classifed data we exported in the last notebook. First the standard packages and then Sklearn packages we will need 

In [10]:
import numpy as np
from sklearn import preprocessing, cross_validation, neighbors
import pandas as pd


Now we read the CSV file into a Pandas Dataframe

In [11]:
df = pd.read_csv('C:\Users\Glandore\Desktop\Github\Filtered_data.csv', sep='\t', encoding='utf-8')
df.head()


Unnamed: 0.1,Unnamed: 0,CustomerID,Quantity,UnitPrice,Classfication
0,0,17850,6,2.55,4
1,1,17851,6,3.39,4
2,2,17852,8,2.75,4
3,3,17853,6,3.39,4
4,4,17854,6,3.39,4


We will drop some unessary columns in the dataframe.

In [12]:
#df = df.drop(df.columns[[0, 1]], axis=1)

Next, we define our features (X) and labels (y), we drop it from the main data set and create a smaller one for the classifications

In [13]:
X = np.array(df.drop(['Classfication'], 1))
y = np.array(df['Classfication'])

The features X are everything except for the class. The labels, y, are just the class column.

Now we create training and testing samples, using Scikit-Learn's cross_validation.train_test_split:

In [14]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)

Now we have to define the Classifier

In [15]:
clf = neighbors.KNeighborsClassifier()

In this case, we're using the <A href ='https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier'>Nearest Neighbors classifier</A> from Sklearn.
<br><br>
<b>Train the classifier on the data</b>

In [16]:
clf.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

Now we need to test the model

In [17]:
accuracy = clf.score(X_test, y_test)
print(accuracy)

0.99


99% accurancy that is pretty impressive. This basically means that if we had some new values come in - we can assign them based upon the features of the data with a high degree of accuracy.

<b>So how can i use to to find the classifcation of incoming data and customers.</b><br><br>To do this we will make up a line of data that will minic a new customer.

In [34]:
example_client = np.array([111,17851,4,6.49])

All we do is pass this array into the clf model and ask it to predict the class.

In [35]:
example_client = example_client.reshape(1, -1) # need to reshape for a single sample
prediction = clf.predict(example_client)
print(prediction)

[4]


This result 4 is the nearest neighbouring group / class to those points.

<b>What if we have a group of clients we need to get classified.</b><br><br>We can pass in many as an nupy array and the model will predict outcomes for all.

In [46]:
example_clients = np.array([[111,17851,45,11122.77],[132,17321,6,2.42],[112,11121,1330,546.29],[144,13461,1222,88800],[161,17544,2,7.79],
                            [151,17221,5,77] ])
example_clients = example_clients.reshape(len(example_clients), -1)
prediction = clf.predict(example_clients)
print(prediction)

[2 4 4 3 4 4]
