<a href="https://www.bigdatauniversity.com"><img src = "https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png" width = 400, align = "center"></a>
# <center>K-Nearest Neighbors</center>


In this Lab you will load the Skulls dataset, fit the data, and use K-Nearest Neighbors to predict a data point. But what is **K-Nearest Neighbors**?

**K-Nearest Neighbors** is an algorithm for supervised learning. Where the data is 'trained' with data points corresponding to their classification. Once a point is to be predicted, it takes into account the 'K' nearest points to it to determine it's classification.

### Here's an visualization of the K-Nearest Neighbors algorithm.

<img src = "https://ibm.box.com/shared/static/mgkn92xck0z05v7yjq8pqziukxvc2461.png">

In this case, we have data points of Class A and B. We want to predict what the star (test data point) is. If we consider a k value of 3 (3 nearest data points) we will obtain a prediction of Class B. Yet if we consider a k value of 6, we will obtain a prediction of Class A.

In this sense, it is important to consider the value of k. But hopefully from this diagram, you should get a sense of what the K-Nearest Neighbors algorithm is. It considers the 'K' Nearest Neighbors (points) when it predicts the classification of the test point.

---
## <u>Train/Test Split on the Skulls Dataset with K-Nearest Neighbors</u>

### Import Libraries
Import the Following Libraries:
<ul>
    <li> numpy (as np) </li>
    <li> pandas </li>
    <li> KNeighborsClassifier from sklearn.neighbors </li>
</ul>

In [1]:
import numpy as np 
import pandas 
from sklearn.neighbors import KNeighborsClassifier

Next, a little information about the dataset. We are using a dataset called skulls.csv, which contains the measurements made on Egyptian skulls from five epochs.

<img src="https://ibm.box.com/shared/static/02z8krlr99hwrqa2ecx3ycuiwqkcuzjv.png" align="left">



<b>epoch</b> - The epoch the skull as assigned to, a factor with levels c4000BC c3300BC, c1850BC, c200BC, and cAD150, where the years are only given approximately.

<b>mb</b> - Maximal Breadth of the skull.

<b>bh</b> - Basiregmatic Heights of the skull.

<b>bl</b> - Basilveolar Length of the skull.

<b>nh</b> - Nasal Heights of the skull.

---

Using my_data as the <b>skulls.csv</b> data read by panda, declare variables <b>X</b> as the <b>Feature Matrix</b> (<i>data of my_data</i>) and <b>y</b> as the <b>response vector</b> (<i>target</i>)<br>

<i>Note: Use the <b>target</b> function for the <b>response vector</b> and the <b>removeColumns</b> function for the <b>Feature Matrix</b> </i>

In [4]:
def target(numpyArray, targetColumnIndex):
    target_dict = dict()
    target = list()
    count = -1
    for i in range(len(my_data.values)):
        if my_data.values[i][targetColumnIndex] not in target_dict:
            count += 1
            target_dict[my_data.values[i][targetColumnIndex]] = count
        target.append(target_dict[my_data.values[i][targetColumnIndex]])
    return np.asarray(target)

In [5]:
# Remove the column containing the target name since it doesn't contain numeric values.
# Also remove the column that contains the row number
# axis=1 means we are removing columns instead of rows.
# Function takes in a pandas array and column numbers and returns a numpy array without
# the stated columns
def removeColumns(pandasArray, *column):
    return pandasArray.drop(pandasArray.columns[[column]], axis=1).values

Now we can clear our data:

In [8]:
X = removeColumns(my_data, 0, 1)
y = target(my_data, 1)

Now to perform <b>train/test split</b> we have to split the <b>X</b> and <b>y</b> into two different sets: The <b>training</b> and <b>testing</b> set. Luckily there is a sklearn function for just that!

Import the <b>train_test_split</b> from <b>sklearn.cross_validation</b>

In [12]:
from sklearn.model_selection import train_test_split

Now <b>train_test_split</b> will return <b>4</b> different parameters. We will name this <b>X_trainset</b>, <b>X_testset</b>, <b>y_trainset</b>, <b>y_testset</b>. The <b>train_test_split</b> will need the parameters <b>X</b>, <b>y</b>, <b>test_size=0.3</b>, and <b>random_state=7</b>. The <b>X</b> and <b>y</b> are the arrays required before the split, the <b>test_size</b> represents the ratio of the testing dataset, and the <b>random_state</b> ensures we obtain the same splits.

In [13]:
X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=7)

### Practice
Now, do a practice, and  print the shape of the training sets to see if they match.

In [15]:
# write your code here



Double click __here__ to see the solution

<!-- Your answer is below:
    
print (X_trainset.shape)    
print (y_trainset.shape )  

-->

Let's check the same with the testing sets! They should both match up!

In [16]:
print (X_testset.shape)
print (y_testset.shape)

(45, 4)
(45,)


### Modeling
Now, let's create declarations of <b>KNeighborsClassifier</b>. 
<b>neigh</b>   -> <b>n_neighbors = 1</b> <br>

In [26]:
neigh = KNeighborsClassifier(n_neighbors = 1)

Now we will fit the instance of <b>KNeighborsClassifier</b> with the <b>X_trainset</b> and <b>y_trainset</b>

In [27]:
neigh.fit(X_trainset, y_trainset)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

### Prediction
You are able to predict with <b>multiple</b> datapoints. We can do this by just passing in the <b>y_testset</b> which contains multiple test points into a <b>predict</b> function of <b>KNeighborsClassifier</b>.

Let's pass the <b>y_testset</b> in the <b>predict</b> function each instance of <b>KNeighborsClassifier</b> but store it's returned value into <b>pred</b> (corresponding to each of their names)


In [29]:
pred = neigh.predict(X_testset)

### Evaluation
Awesome! Now let's compute neigh's <b>prediction accuracy</b>. We can do this by using the <b>metrics.accuracy_score</b> function

In [21]:
from sklearn import metrics
print("Neigh's Accuracy: ", metrics.accuracy_score(y_testset, pred))

Neigh's Accuracy:  0.2222222222222222


Interesting! Let's do the same for the other instances of KNeighborsClassifier.

### Different K ?

Now similarly with the last part, let's create declarations of <b>KNeighborsClassifier</b>. Except we will create <b>2</b> different ones:<br>
<b>neigh23</b> -> <b>n_neighbors = 23</b> <br>
<b>neigh90</b> -> <b>n_neighbors = 90</b> <br>

In [None]:
neigh23 = KNeighborsClassifier(n_neighbors = 23)
neigh90 = KNeighborsClassifier(n_neighbors = 90)

Now we will fit each instance of <b>KNeighborsClassifier</b> with the <b>X_trainset</b> and <b>y_trainset</b>

In [28]:
neigh23.fit(X_trainset, y_trainset)
neigh90.fit(X_trainset, y_trainset)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=90, p=2,
           weights='uniform')

Let's pass the <b>y_testset</b> in the <b>predict</b> function each instance of <b>KNeighborsClassifier</b> but store it's returned value into, <b>pred23</b>, <b>pred90</b> (corresponding to each of their names)


In [30]:
pred23 = neigh23.predict(X_testset)
pred90 = neigh90.predict(X_testset)

Now let's compute neigh's <b>prediction accuracy</b>

In [23]:
print("Neigh23's Accuracy: ", metrics.accuracy_score(y_testset, pred23))
print("Neigh90's Accuracy: ", metrics.accuracy_score(y_testset, pred90))

Neigh23's Accuracy:  0.24444444444444444
Neigh90's Accuracy:  0.13333333333333333


As shown, the accuracy of <b>neigh23</b> is the highest. When <b>n_neighbors = 1</b>, the model was <b>overfit</b> to the training data (<i>too specific</i>) and when <b>n_neighbors = 90</b>, the model was <b>underfit</b> (<i>too generalized</i>). In comparison, <b>n_neighbors = 23</b> had a <b>good balance</b> between <b>Bias</b> and <b>Variance</b>, creating a generalized model that neither <b>underfit</b> the data nor <b>overfit</b> it.

## Want to learn more?

IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: [SPSS Modeler](http://cocl.us/ML0101EN-SPSSModeler).

Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at [Watson Studio](https://cocl.us/ML0101EN_DSX)

### Thanks for completing this lesson!

Notebook created by: <a href = "https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a>

<hr>

Copyright &copy; 2018 [Cognitive Class](https://cocl.us/DX0108EN_CC). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).​