---
# Machine Leaning Methods: K-Neighbors Classification
---

We will use scikit-learn for clustering in the following example. Scikit-learn is the go-to package for machine learning in Python. It is built on top of the other packages we've discussed (i.e. numpy, SciPy, matplotlib, etc.). 

For this module, we'll use the familiar [Iris Dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) and attempt to classify flowers' species basd on their petal and sepal sizes.

In [1]:
from sklearn import datasets
iris = datasets.load_iris()
type(iris)

sklearn.utils.Bunch

For a little more information about the dataset, print the `DESCR` attribute:

In [2]:
print(iris.DESCR)

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

Since the dataset is a `Bunch` object, rather than a normal dataframe or numpy array, its dependent and indepndent variables are stored in separate attributes within the variable, like this:

In [3]:
X = iris.data
Y = iris.target

print(type(X))
print(type(Y))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


Viewed in a more user-friendly format, the data looks look like this:

In [4]:
import pandas as pd
pd.DataFrame(iris.data).head(10)

Unnamed: 0,0,1,2,3
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


Looking back to the description above, we see that the columns correspond to the sepal width, sepal length, petal width, and petal lenght. The `targets`, then, contain the types of flowers:

In [5]:
Y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

Again, from the description, we glean that these three values correspond to Setosa, Versicolour, and Virginica flowers respectively. 

With an understanding of the datset, we're ready to import SciKitLearn and train a basic [K-Nearest Neighbor classification model](http://scikit-learn.org/stable/modules/neighbors.html).

Before we do, though, we have to split the data into a **training set** and a **test set**.

In [6]:
from sklearn.cross_validation import train_test_split

#Let's split it into 4: train, test, X, Y

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.5)



One of the simplest classification models available, the K-Nearest Neighbor algorithm takes continuous variables as inputs to train the model, so the Iris dataset is an ideal input. 

For each new test value, the algorithm simply finds the datapoint in the training set that is the closest to the test value in [Euclidian distance](https://en.wikipedia.org/wiki/Euclidean_distance), and classifies the test value as whatever that nearest neighbor is classified as.

It's extremely simple, but surprisingly accurate. Let's test it - first we train the model:

In [7]:
from sklearn.neighbors import KNeighborsClassifier

my_classifier = KNeighborsClassifier()

my_classifier.fit(X_train, Y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

Then, we use it to predict the classifications (flower species) of the test data:

In [8]:
predictions = my_classifier.predict(X_test)

And finally, we'll assess the accuracy:

In [9]:
from sklearn.metrics import accuracy_score

print(accuracy_score(Y_test, predictions))

0.96


Around 95% accurate (depending on how you partitioned you train/test data - it will be slightly different every time).

Not bad, especially considering that it only required about five lines of code.

Training other classification models is with SciKitLearn is just as easy. Another popular, widely used training model is the [decision tree](https://en.wikipedia.org/wiki/Euclidean_distance), which forms a logical "trees" to classify variables. 

To use this classifier, we would repeat the exact same steps:

In [10]:
from sklearn.tree import DecisionTreeClassifier
#Let's reshuffle our train/test data:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.5)

my_decision_tree = KNeighborsClassifier()
my_decision_tree.fit(X_train, Y_train)

tree_predictions = my_decision_tree.predict(X_test)

print(accuracy_score(Y_test, tree_predictions))

0.946666666667


Broadly speaking, algorithms like K Nearest Neighbor are better at classifying based on _continuous_ data, whereas algorithms like Decision Trees are better at classifying based on _categorical_ data. But clearly both work well on this dataset.

This course won't go into depth on the differences or statistical underpinnings of each algorithm, but learning about them in greater detail will help you better apply these powerful techniques to client work.

### Other Useful Resources
- [Google Classification Tutorial](https://www.youtube.com/watch?v=AoeEHqVSNOw&t=21s)
- [Google Decision Tree Series](https://www.youtube.com/watch?v=tNa99PG8hR8)
- [Udemy: Machine Learning in Python](What if we used another classic classifier - the decision tree? We would repeat the exact same steps:)