<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

#  Classifiers & K-Nearest Neighbors with `scikit-learn`

_Authors: Alex Sherman (DC)_, Steven Longstreet (DC)

<a id="learning-objectives"></a>
## Learning Objectives

1. Utilize the KNN model on the iris data set.
2. Implement scikit-learn's KNN model.
3. Assess the fit of a KNN Model using scikit-learn.
4. Optimize KNN using GridSearchCV

### Lesson Guide
- [Learning Objectives](#learning-objectives)
- [Loading the Iris Data Set](#overview-of-the-iris-dataset)
	- [Terminology](#terminology)
- [Exercise: "Human Learning" With Iris Data](#exercise-human-learning-with-iris-data)
- [Human Learning on the Iris Data Set](#human-learning-on-the-iris-dataset)
- [K-Nearest Neighbors (KNN) Classification](#k-nearest-neighbors-knn-classification)
	- [Using the Train/Test Split Procedure (K=1)](#using-the-traintest-split-procedure-k)
- [Tuning a KNN Model](#tuning-a-knn-model)
	- [What Happens If We View the Accuracy of our Training Data?](#what-happen-if-we-view-the-accuracy-of-our-training-data)
	- [Training Error Versus Testing Error](#training-error-versus-testing-error)
- [Standardizing Features](#standardizing-features)
	- [Use `StandardScaler` to Standardize our Data](#use-standardscaler-to-standardize-our-data)
- [Comparing KNN With Other Models](#comparing-knn-with-other-models)

## Welcome to Classifiers!

Previously we looked at linear models to predict **continuous numbers** using regression. What if we need to predict for **discrete values**?

We have a few options:

1. Build a classifier ourselves using a basic hueristic
2. Write a complex classifier
3. Leverage a Machine Learning Classifier

Our approach may surprise you. From Google's [Rules of Machine Learning](https://developers.google.com/machine-learning/rules-of-ml/) there are a few steps to look at before implementing a machine learning algorithm. Often it's better to start with a simple hueristic to solve a goal, understand your data or serve as a baseline. Eventually that becomes too complex and hard to maintain so abiding by **rule three** we move onto machine learning and choose the right algorithm. But which model should we pick? There are a few available from SkLearn

* [AdaBoostClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)
* [BaggingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)
* [BernoulliNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html)
* [CalibratedClassifierCV](http://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html)
* [DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
* [ExtraTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.ExtraTreeClassifier.html)
* [ExtraTreesClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)
* [GaussianNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)
* [GaussianProcessClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessClassifier.html)
* [GradientBoostingClassifier](http://scikit-learn.org/0.15/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)
* [KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
* [LabelPropagation](http://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.LabelPropagation.html)
* [LabelSpreading](http://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.LabelSpreading.html)
* [LinearDiscriminantAnalysis](http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html)
* [LinearSVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
* [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
* [LogisticRegressionCV](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html)
* [MLPClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)
* [MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)
* [NearestCentroid](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestCentroid.html)
* [NuSVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html)
* [PassiveAggressiveClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PassiveAggressiveClassifier.html)
* [Perceptron](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html)
* [QuadraticDiscriminantAnalysis](http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html)
* [RadiusNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighborsClassifier.html)
* [RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [RidgeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html)
* [RidgeClassifierCV](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifierCV.html)
* [SGDClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)
* [SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

Whew! - That's alot of classifiers? And we're only using a smaller list in this class. However you may want to review these if your capstone project is using a classification algorithm. Here are a few terms to get you started in understanding the models. As your data science lexicon grows you should be able to return to this list and have a high level understanding of what each classifier is doing from title alone!

* NB- Uses Naive Bayes
* CV - Uses cross validation
* SV - Uses a support vector (Support Vector Classification)

#### Models often used in the Data Science Part Time Series
* [DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
* [GaussianNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)
* [KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
* [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
* [LogisticRegressionCV](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html)
* [MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)
* [NearestCentroid](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestCentroid.html)
* [RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [RidgeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html)
* [SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

## What about today?

Using your brilliant data science minds I believe your powers of deduction have helped you answer this question.

In this lesson, we will get an intuitive and practical feel for the **k-Nearest Neighbors** model. kNN is a **non-parametric model** which means that it does not make any assumptions on the underlying data distribution. So, the model is not represented as an equation with parameters (e.g. the $\beta$ values in linear regression).

We'll introduce it to you later - but first lets go back to Google's Rules of Machine Learning. 

>**Before Machine Learning - Rule #2 - First, design and implement metrics**
> As you work to understand your data a simple hueristic is easier to implement

>**Rule #3: Choose machine learning over a complex hueristic**
> After you get a basic understanding we move onto machine learning. 

>```A simple heuristic can get your product out the door. A complex heuristic is unmaintainable. Once you have data and a basic idea of what you are trying to accomplish, move on to machine learning. As in most software engineering tasks, you will want to be constantly updating your approach, whether it is a heuristic or a machine­learned model, and you will find that the machine­-learned model is easier to update and maintain - Google Machine Learning Rules```

First, we will make a model by hand to classify iris flower data. Next, we will automatedly make a model using kNN.

> You may have heard of the clustering algorithm **k-Means Clustering**. These techniques have nothing in common, aside from both having a parameter k!

<a id="overview-of-the-iris-dataset"></a>
## Loading the Iris Data Set
---

#### Read the iris data into a pandas DataFrame, including column names.

In [None]:
# Read the iris data into a DataFrame.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Display plots in-notebook
%matplotlib inline

# Increase default figure and font sizes for easier viewing.
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

data = 'data/iris.data'
iris = pd.read_csv(data)

In [None]:
iris.head(30)


<a id="terminology"></a>
### Terminology

- **150 observations** (n=150): Each observation is one iris flower.
- **Four features** (p=4): sepal length, sepal width, petal length, and petal width.
- **Response**: One of three possible iris species (setosa, versicolor, or virginica)
- **Classification problem** because response is categorical.

### Sometimes a picture helps us understand the problem

Knowing that plants **will** have variations - telling species apart from the image seems challenging. Luckily biology labs have endless supplies of High School, College and Master's students mixed in with PhD Candidates to spend their days measuring plants in a field. DNA sequencing for flowers is expensive so let's see if we can help them analyze their data

![15NN classification map](./assets/iris-machinelearning.png)



<a id="exercise-human-learning-with-iris-data"></a>
## Guided Practice: "Human Learning" With Iris Data

**Question:** Can we predict the species of an iris using petal and sepal measurements? Together, we will:

1. Read the iris data into a Pandas DataFrame, including column names.
2. Gather some basic information about the data.
3. Use sorting, split-apply-combine, and/or visualization to look for differences between species.
4. Write down a set of rules that could be used to predict species based on iris measurements.

**BONUS:** Define a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data and check the accuracy of your predictions.

#### Gather some basic information about the data.

In [None]:
# 150 observations, 5 columns (the 4 features & response)
iris.shape

In [None]:
# What kinds of data are we working with?
iris.dtypes

In [None]:
# Verify the basic stats look appropriate
iris.describe()

In [None]:
# Test for imbalanced classes (hint - see what happens when you switch the normalize parameter to true)
iris.species.value_counts()

In [None]:
# Verify we are not missing any data
iris.isnull().sum()

#### Use sorting, split-apply-combine, and/or visualization to look for differences between species.

In [None]:
iris.head()

In [None]:
# Sort the DataFrame by petal_width.
iris.sort_values(by='petal_width', ascending=True, inplace=True)
iris.head(50)

#### Split-apply-combine: Explore the data while using a `groupby` on `'species'`.

In [None]:
# Mean of sepal_length, grouped by species.
iris.groupby(by='species')['sepal_length'].mean()

In [None]:
# Mean of all numeric columns, grouped by species.
iris.groupby('species').mean()

In [None]:
# describe() of all numeric columns, grouped by species.
iris.groupby('species').describe()

In [None]:
# Box plot of petal_width, grouped by species.
iris.boxplot(column=['petal_width', 'petal_length'], by='species', figsize=(10,8));

In [None]:
# Box plot of all numeric columns, grouped by species.
# I believe this is the first time we've used rot before. Remove it and see if you understand why
iris.boxplot(by='species', rot=45, figsize=(10,8));

In [None]:
# Map species to a numeric value so that plots can be colored by species.
iris['species_num'] = iris.species.map({'Iris-setosa':0,
                                        'Iris-versicolor':1,
                                        'Iris-virginica':2})

# Alternative method:
#iris['species_num'] = iris.species.factorize()[0]

In [None]:
iris.head()

In [None]:
# Scatterplot of petal_length vs. petal_width, colored by species
iris.plot(kind='scatter', x='petal_length', y='petal_width', c='species_num', colormap='brg');

In [None]:
# Scatter matrix of all features, colored by species.
pd.plotting.scatter_matrix(iris.drop('species_num', axis=1), c=iris.species_num, figsize=(12, 10));

#### Class Exercise: Using the graphs above, can you write down a set of rules that can accurately predict species based on iris measurements?

In [None]:
# Feel free to do more analysis if needed to make good rules!

#### Bonus: If you have time during the class break or after class, try to implement these rules to make your own classifier!

Write a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data and check the accuracy of your predictions.

In [None]:
def predict_flower(df):
    preds = ['Iris-setosa'] * len(df)   # temporary!
    
    # for each row of df, make a prediction

    # add a column to the DataFrame with the predictions
    df['prediction'] = preds
    
    
predict_flower(iris)

In [None]:
iris.head()

In [None]:
# Let's see what percentage your manual classifier gets correct!
# 0.3333 means 1/3 are classified correctly

sum(iris.species == iris.prediction) / 150.

<a id="human-learning-on-the-iris-dataset"></a>
## Human Learning on the Iris Data Set
---

How did we (as humans) predict the species of an iris?

1. We observed that the different species had (somewhat) dissimilar measurements.
2. We focused on features that seemed to correlate with the response.
3. We created a set of rules (using those features) to predict the species of an unknown iris.

We assumed that if an **unknown iris** had measurements similar to **previous irises**, then its species was most likely the same as those previous irises.

In [None]:
# Allow plots to appear in the notebook.
%matplotlib inline
import matplotlib.pyplot as plt

# Increase default figure and font sizes for easier viewing.
plt.rcParams['figure.figsize'] = (10, 8)
plt.rcParams['font.size'] = 14

# Create a custom color map.
from matplotlib.colors import ListedColormap
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

<a id="k-nearest-neighbors-knn-classification"></a>
## Introducing K-Nearest Neighbors (KNN) Classification
---

K-nearest neighbors classification is (as its name implies) a classification model that uses the "K" most similar observations in order to make a prediction.

KNN is a supervised learning method; therefore, the training data must have known target values.


The process of of prediction using KNN is fairly straightforward:

1. Pick a value for K.
2. Search for the K observations in the data that are "nearest" to the measurements of the unknown iris.
    - Euclidian distance is often used as the distance metric, but other metrics are allowed.
3. Use the most popular response value from the K "nearest neighbors" as the predicted response value for the unknown iris.



### Understanding the Model

Conceptually, KNN is very simple. Given a dataset for which class labels are known, you want to predict the class of a new data point.

The strategy is to compare the new observation to those observations already labeled. The predicted class will be based on the known classes of the nearest k neighbors (i.e. based on the class labels of the other data points most similar to the one you're trying to predict)

#### Example of KNN in practice
Imagine a bunch of widgets which belong to either the "Blue" or the "Red" class. Each widget has 2 variables associated with it: **x** and **y**. We can plot our widgets in 2D.

![Basic Graph](./assets/knn_reds_and_blues.png)


#### Adding a new point
Now let's say I get another widget (**black dot**) that also has **x** and **y** variables. How can we tell whether this widget is blue or red?

![New Point](./assets/knn_new_point.png)

#### Visualizing KNN in action
The KNN approach to classification calls for comparing this new point to the other nearby points. If we were using KNN with 3 neighbors, we'd grab the 3 nearest dots to our black dot and look at the colors. The nearest dots would then "vote", with the more predominant color being the color we'll assign to our new black dot

![New Point](./assets/knn_new_point_with_circle.png)

#### Presto - Chango: We get a red dot!

![New Point](./assets/knn_new_point_pred.png)


#### Let's try a prediction with 5 neighbors

If we had chosen 5 neighbors instead of 3 neighbors, things would have turned out differently. Looking at the plot below, we can see that the vote would tally blue: 3, red: 2. So we would classify our new dot as blue.

##### New Circle
![New Point](./assets/knn_5_circle.png)

##### New Color
![New Point](./assets/knn_new_point_pred_blue.png)

[source](http://blog.yhat.com/posts/classification-using-knn-and-python.html)


#### How does KNN make it's choice? It's democratic

![New Point](./assets/knn2.jpg)

[source](https://www.kdnuggets.com/2016/01/implementing-your-own-knn-using-python.html)

### Choosing the right value of k

KNN requires us to choose a k value. Logically, your next question is "how do we choose k?" Brilliant Question!

It turns out that choosing the right number of neighbors matters. A lot. But, choosing the right k is as challenging as it is important.

>...choosing the best value for k is ["not easy but laborious"](https://books.google.com/books?id=Vhw8KlcYaxQC&pg=PA128&lpg=PA128&dq=%22Choosing+the+best+value+for+k+is+not+easy+but+laborious%22&source=bl&ots=UtjE33ajQT&sig=eAAJdj0zazrJatCHy7ZC3-r03WI&hl=en&sa=X&ei=fFXxUfKPFLT_4APbjoHICg&ved=0CCoQ6AEwAA#v=onepage&q=%22Choosing%20the%20best%20value%20for%20k%20is%20not%20easy%20but%20laborious%22&f=false)

> Phillips and Lee, Mining Positive Associations of Urban Criminal Activities

With KNN our goal is to maximize the accuracy of the prediction. 
- Part of that is understanding what we can glean from our dataset. 
- Part is understanding what error can tell us. 
- Part is starting with some good understandings of k.

Generally it's good to try an odd number for k to start out. This helps avoid situations where your classifier "ties" as a result of having the same number of votes for two different classes. This is particularly true if your dataset has only two classes (i.e. if k=4 and an observation has nearest neighbors ['blue', 'blue', 'red', 'red'], you've got a tie on your hands.

### Let's look at it with our IRIS data

The visualizations below show how a given area can change in its prediction as K changes.

- Colored points represent true values and colored areas represent a **prediction space**. (This is called a Voronoi Diagram.)
- Each prediction space is where the majority of the "K" nearest points are the color of the space.
- To predict the class of a new point, we guess the class corresponding to the color of the space it lies in.

<a id="knn-classification-map-for-iris-k"></a>
### KNN Classification Map for Iris (K=1)

![1NN classification map](./assets/iris_01nn_map.png)

### KNN Classification Map for Iris (K=5)

![5NN classification map](./assets/iris_05nn_map.png)

### KNN Classification Map for Iris (K=15)

![15NN classification map](./assets/iris_15nn_map.png)

<a id="knn-classification-map-for-iris-k"></a>
### KNN Classification Map for Iris (K=50)

![50NN classification map](./assets/iris_50nn_map.png)

We can see that, as K increases, the classification spaces' borders become more distinct. However, you can also see that the spaces are not perfectly pure when it comes to the known elements within them.

**How are outliers affected by K?** As K increases, outliers are "smoothed out". Look at the above three plots and notice how outliers strongly affect the prediction space when K=1. When K=50, outliers no longer affect region boundaries. This is a classic bias-variance tradeoff -- with increasing K, the bias increases but the variance decreases.

**Question:** What's the "best" value for K in this case?

**Answer:** ...

## Guided Intro to KNN: NBA Position KNN Classifier

For the rest of the lesson, we will be using a dataset containing the 2015 season statistics for ~500 NBA players. This dataset leads to a nice choice of K, as we'll see below. The columns we'll use for features (and the target 'pos') are:


| Column | Meaning |
| ---    | ---     |
| pos | C: Center. F: Front. G: Guard |
| ast | Assists per game | 
| stl | Steals per game | 
| blk | Blocks per game |
| tov | Turnovers per game | 
| pf  | Personal fouls per game | 

For information about the other columns, see [this glossary](https://www.basketball-reference.com/about/glossary.html).

In [None]:
# Read the NBA data into a DataFrame.
import pandas as pd

path = 'data/NBA_players_2015.csv'
nba = pd.read_csv(path, index_col=0)

In [None]:
# Map positions to numbers
nba['pos_num'] = nba.pos.map({'C':0, 'F':1, 'G':2})

In [None]:
# Create feature matrix (X).
feature_cols = ['ast', 'stl', 'blk', 'tov', 'pf']
X = nba[feature_cols]

In [None]:
# Create response vector (y).
y = nba.pos_num

In [None]:
print(X.shape)
print(y.shape)
X.head()

<a id="using-the-traintest-split-procedure-k"></a>
### Using the Train/Test Split Procedure (K=1)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

#### Step 1: Split X and y into training and testing sets (using `random_state` for reproducibility).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)

#### Step 2: Train the model on the training set (using K=1).

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

#### Step 3: Test the model on the testing set and check the accuracy.

In [None]:
y_pred_class = knn.predict(X_test)
print((metrics.accuracy_score(y_test, y_pred_class)))

In [None]:
knn.score(X_test,y_test)

**Question:** If we had trained on the entire dataset and tested on the entire dataset, using 1-KNN what accuracy would we likely get? If the resulting accuracy is not this number, what must some data points look like?

**Answer:** ...

#### Repeating for K=50.

In [None]:
knn = KNeighborsClassifier(n_neighbors=50)
knn.fit(X_train, y_train)
y_pred_class = knn.predict(X_test)
print((metrics.accuracy_score(y_test, y_pred_class)))

**Question:** Suppose we again train and test on the entire data set, but using 50-KNN. Would we expect the accuracy to be higher, lower, or the same as compared to 1-KNN?

**Answer:** ...

#### Comparing Testing Accuracy With Null Accuracy

Null accuracy is the accuracy that can be achieved by **always predicting the most frequent class**. For example, if most players are Centers, we would always predict Center.

The null accuracy is a benchmark against which you may want to measure every classification model.

#### Examine the class distribution from the training set.

Remember that we are comparing KNN to this simpler model. So, we must find the most frequent class **of the training set**.

**Reminder** - we mapped positions as {'C':0, 'F':1, 'G':2}

In [None]:
most_freq_class = y_train.value_counts().index[0]

print(y_train.value_counts())
most_freq_class

#### Compute null accuracy.

In [None]:
y_test.value_counts()[most_freq_class] / len(y_test)

<a id="tuning-a-knn-model"></a>
## Tuning a KNN Model
---

In [None]:
# Instantiate the model (using the value K=5).
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the model with data.
knn.fit(X, y)

# Store the predicted response values.
y_pred_class = knn.predict(X)

**Question:** Which model produced the correct predictions for the two unknown irises?

**Answer:** ...

**Question:** Does that mean that we have to guess how well our models are likely to do?

**Answer:** ...

In [None]:
# Calculate predicted probabilities of class membership.
# Each row sums to one and contains the probabilities of the point being a 0-Center, 1-Front, 2-Guard.
knn.predict_proba(X)

In [None]:
# How can we use these predictions?
# Calculate predicted probabilities of class membership.
# Each row sums to one and contains the probabilities of the point being a 0-Center, 1-Front, 2-Guard.

preds = pd.DataFrame(knn.predict_proba(X))
X_withpreds = pd.concat([X.reset_index(), preds.reset_index()], axis=1)
X_withpreds.head()

In [None]:
#Show the predicted class
preds = pd.DataFrame(knn.predict(X), columns=['pos'])
X_withpreds = pd.concat([X.reset_index(), preds.reset_index()], axis=1)
X_withpreds.head()

<a id="what-happen-if-we-view-the-accuracy-of-our-training-data"></a>
### What Happens If We View the Accuracy of our Training Data?

In [None]:
scores = []
for k in range(1,101):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X,y)
    pred = knn.predict(X)
    score = float(sum(pred == y)) / len(y)
    scores.append([k, score])

In [None]:
data = pd.DataFrame(scores,columns=['k','score'])
data.plot.line(x='k',y='score');

**Question:** As K increases, why does the accuracy fall?

**Answer:** ...

#### Search for the "best" value of K.

In [None]:
# Calculate TRAINING ERROR and TESTING ERROR for K=1 through 100.

k_range = list(range(1, 101))
training_error = []
testing_error = []

# Find test accuracy for all values of K between 1 and 100 (inclusive).
for k in k_range:

    # Instantiate the model with the current K value.
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    
    # Calculate training error (error = 1 - accuracy).
    y_pred_class = knn.predict(X)
    training_accuracy = metrics.accuracy_score(y, y_pred_class)
    training_error.append(1 - training_accuracy)
    
    # Calculate testing error.
    y_pred_class = knn.predict(X_test)
    testing_accuracy = metrics.accuracy_score(y_test, y_pred_class)
    testing_error.append(1 - testing_accuracy)

In [None]:
# Allow plots to appear in the notebook.
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-deep')

In [None]:
# Create a DataFrame of K, training error, and testing error.
column_dict = {'K': k_range, 'training error':training_error, 'testing error':testing_error}
df = pd.DataFrame(column_dict).set_index('K').sort_index(ascending=True)
df.head()

In [None]:
# Plot the relationship between K (HIGH TO LOW) and TESTING ERROR.
df.plot(y='testing error', linewidth=2, figsize=(10, 8));
plt.xlabel('Value of K for KNN');
plt.ylabel('Error (lower is better)');

In [None]:
# Find the minimum testing error and the associated K value.
df.sort_values('testing error').head()

In [None]:
# Alternative method:
min(list(zip(testing_error, k_range)))

<a id="training-error-versus-testing-error"></a>
### Training Error Versus Testing Error

In [None]:
# Plot the relationship between K (HIGH TO LOW) and both TRAINING ERROR and TESTING ERROR.
df.plot();
plt.xlabel('Value of K for KNN');
plt.ylabel('Error (lower is better)');

- **Training error** decreases as model complexity increases (lower value of K)
- **Testing error** is minimized at the optimum model complexity

Evaluating the training and testing error is important. For example:

- If the training error is much lower than the test error, then our model is likely overfitting. 
- If the test error starts increasing as we vary a hyperparameter, we may be overfitting.
- If either error plateaus, our model is likely underfitting (not complex enough).

![Training_testing_error](./assets/training_testing_error.png)

#### What is the best value of k for this model?

#### Making Predictions on Out-of-Sample Data

Given the statistics of a (truly) unknown NBA player, how do we predict his position?

In [None]:
#import numpy as np

# Instantiate the model with the best-known parameters.
knn = KNeighborsClassifier(n_neighbors=14)

# Re-train the model with X and y (not X_train and y_train). Why?
knn.fit(X, y)

# Make a prediction for an out-of-sample observation.
knn.predict(np.array([2, 1, 0, 1, 2]).reshape(1, -1))


In [None]:
#What's the probability of that prediction?
np.amax(knn.predict_proba([[2, 1, 0, 1, 2]]))


In [None]:
min(list(zip(testing_error, k_range)))

What could we conclude?

- When using KNN on this data set with these features, the **best value for K** is likely to be around 14.
- Given the statistics of an **unknown player**, we estimate that we would be able to correctly predict his position about 74% of the time.

## Introducing GridSearch

[GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

#### What is GridSearch?

We did alot of work above to find and refine for the best value of **k**. Bad news - that wasn't the only parameter we could optimize!

Luckily GridSearch was designed for the intellectually curious and the realistically lazy practicioners such as ourselves. I love the official description from Sklearn:

> *Exhaustive search over specified parameter values for an estimator.*

GridSearch allows you to define a grid of parameters that will be searched using K-fold cross-validation. Essentially it's going to automate our **for** loop from above.



In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV 

## Two ways we can explore this section below

1. Let's start with the Iris Dataset
2. Use the second cell on your own to review against the NBA dataset


In [None]:
# If you chose the Iris dataset let's run this in
# read in the iris data
iris = load_iris()

# create X (features) and y (response)
X = iris.data
y = iris.target


In [None]:
# If you're using the Basketball dataset then run this cell - otherwise skip for now. Commented out to protect it

# Create feature matrix (X).
#feature_cols = ['ast', 'stl', 'blk', 'tov', 'pf']

#X = nba[feature_cols]
#y = nba.pos_num  # Create response vector (y).

#### Let's run a KNN using a cross-validation of 10 with k=5

**Notes**
- We'll instantiate an object called 'scores' to store our cross validation scoress
- Our scoring method will be 'accuracy' since it's a classification problem
- Cross validation will take care of our splitting so we don't need a train_test_split. We'll pass our whole X and y into the model

In [None]:

# instantiate model
knn = KNeighborsClassifier(n_neighbors=5)

# Capture our cross validation scores
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print(scores)

#### We use average accuracy as an estimate of out-of-sample accuracy

In [None]:
# What kind of object is scores?
type(scores)

In [None]:
# It's a numpy array! We know those! Let's calculate the mean to give us an estimate of out-of-sample accuracy
print(scores.mean())

#### Good - well to understand where we're going we have to understand where we've been. Let's run through this quickly the old way

In the below we'll capture the accuaracy of k from 1 to 30. To do so we'll
- establish a range to pass to k
- Create a list to hold our scores
- Design a for loop to do the heavy lifting for us
- Look at our scores


In [None]:
# list of integers 1 to 30
k_range = range(1, 31)
# list of scores from k_range
k_scores = []
# 1. we will loop through reasonable values of k
for k in k_range:
    # 2. run KNeighborsClassifier with k neighbours
    knn = KNeighborsClassifier(n_neighbors=k)
    # 3. obtain cross_val_score for KNeighborsClassifier with k neighbours
    scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
    # 4. append mean of scores for k neighbors to k_scores list
    k_scores.append(scores.mean())
print(k_scores)

In [None]:
k_dict=dict(zip(k_range,k_scores))
k_dict

In [None]:
# What value of k gave us our highest score?
max(k_dict, key= k_dict.get)

#### Now we'll plot it to see the accuracy over time

How is this different than our NBA model?

In [None]:
# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')

#### Even with some practice this isn't as  intuitive. Let's take a step forward

In [None]:
# We have our k_range let's use that to build a parameter grid

param_grid=dict(n_neighbors=list(k_range))
print(param_grid)

#### Now we're ready to cook with gas

Let's instantiate a GridSearchCV algorithm. From the documentation we see what we need to pass into it

>```class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)```

- estimator = KNN
- param_grid = pass in our newly minted param_grid
- scoring = 'accuracy' (since it's a classification)
- cv=10
- Everything else (leave as a default for now)

In [None]:
# instantiate a GridSearch object - here a classifier (clf)
clf = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=True)

In [None]:
type(clf)

In [None]:
clf.fit(X,y)

#### What just happened?

The 'grid' object is ready to do 10-fold cross validation on a KNN model using classification accuracy as the evaluation metric
- In addition, there is a parameter grid to repeat the 10-fold cross validation process 30 times
- Each time, the n_neighbors parameter should be given a different value from the list
- We can't give GridSearchCV just a list. We had to specify a particular parameter for our model, KNN, and give it values. In this case we told it n_neighbors should take a value of 1 through 30
- You can also set n_jobs = -1 to run computations in parallel (if supported by your computer and OS). This is called parallel programming

#### Well if it's ready - Let's kick the tires and light the fires!

In [None]:
# Want to know everything about what we just tested? Warning! It's exhaustive!
clf.cv_results_

In [None]:
# You can slice by the key/value pairs you want to make this a little easier to understand

list(zip(clf.cv_results_['mean_test_score'],clf.cv_results_['params']))

In [None]:
# We can find ways to make this easier on ourselves
knn_score=list(zip(clf.cv_results_['mean_test_score'],clf.cv_results_['params']))

sorted(knn_score,key=lambda x: x[0],reverse=True)[0]

In [None]:
# You can also skip to the end and stand on the shoulders of the people who came before us
print(clf.best_score_)
print(clf.best_params_)
knn=clf.best_estimator_

In [None]:
# plot the results
# this is identical to the one we generated above
mean_scores=clf.cv_results_['mean_test_score'].tolist()
plt.plot(k_range, mean_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')

### If you're not convinced of the power of GridSearch yet - strap yourself in

We just tuned KNN for the k value - but what else can we pass to [KNN](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)?

>```KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)```


We could try them all but for now let's increase to two parameters
1. **n_neighbors**
2. **weights**

Here's what you need to know about weights from the documentation

>**weights** : str or callable, optional (default = ‘uniform’)

>weight function used in prediction. Possible values:
> - ‘uniform’ : uniform weights. All points in each neighborhood are weighted equally.
> - ‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a > - greater influence than neighbors which are further away.
> - [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.

We won't make our own callable function this time but it's important to know we could pass in our own if needed.

In [None]:
#Let's setup our param_grid

#First the k's
k_range=list(range(1,31))

#Now we'll add in weights
weight_options=['uniform','distance']

#Remember - we need to pass these to the param_grid as a dictionary
param_grid=dict(n_neighbors=k_range, weights=weight_options)

In [None]:
#Never hurts to check our work
print(param_grid)
print(type(param_grid))

In [None]:
#Now to instantiate our grid

clf=GridSearchCV(knn,param_grid,cv=10, scoring='accuracy')

#Next we fit the grid

clf.fit(X,y)

In [None]:
#Using our cheat sheet to pull out the accuracy and parameters values from the grid
knn_score=list(zip(clf.cv_results_['mean_test_score'],clf.cv_results_['params']))

sorted(knn_score,key=lambda x: x[0],reverse=True)[0]

In [None]:
# Finally - what worked best?

print(clf.best_score_)
print(clf.best_params_)
print(clf.best_estimator_)

### Fantastic! - GridSearch found the best parameters for our model

#### Next - let's use them to make predictions

Steps
 - Import the model (already done)
 - Instantiate the model with the best parameters
 - Fit the model
 - Predict

In [None]:
# instantiate model with best parameters
knn = KNeighborsClassifier(n_neighbors=13, weights='uniform')



#### Quick note on fitting

We are going to make predictions with **new** data. At this point I don't need cross validation or train_test_split. If I train with either method I may throw away potentially valuable data given how KNN works. We want to use all known observations for our classification

In [None]:
#Fit the model
knn.fit(X,y)

In [None]:
# Make a prediction. Here we need our four measurements

knn.predict([[1,2,3,4]])

### We've taken our ideal parameters and used them in our model to make the more accurate prediction

#### But... is there an easier way?

Remember how I said GridSearch was for the intellectually curious but... lazy? Well there's another shortcut.

Since we passed our model, data and parameter ranges to GridSearch - it has everything it needs to make a prediction on the best model

In [None]:
#Here's your shortcut

clf.predict([[1,2,3,4]])

In [None]:
# What does that mean? Well either you look at the data dictionary for Iris - or you inspect the data. To keep it easy - let's convert iris back to a dataframe

iris.target_names

# What was our prediction?

## Other Great Tools

1. [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV) Eventually you'll want to search across numerous parameters ranging with len(values)>100 then you may reach the limits of time or your machine. Instead we use something like RandomizedSearchCV
> In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.

2. [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) From our preprocessing toolkit this package allows you to address the many machine learning models that are sensitive to scale. More information is outlined in the notebook below as this **WILL** be important for many of your projects
    - [PreProcessing Tools](http://scikit-learn.org/stable/modules/preprocessing.html) Actually - it's worth looking through the whole preprocessing toolset



3. [LabelProgatation](http://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.LabelPropagation.html#sklearn.semi_supervised.LabelPropagation) Label Propagation is a semi-supervised machine learning algorithm that assigns labels to previously unlabeled data points. At the start of the algorithm, a (generally small) subset of the data points have labels (or classifications). These labels are propagated to the unlabeled points throughout the course of the algorithm.

<a id="standardizing-features"></a>
## Standardizing Features
---

There is one major issue that applies to many machine learning models: They are sensitive to feature scale. 

> KNN in particular is sensitive to feature scale because it (by default) uses the Euclidean distance metric. To determine closeness, Euclidean distance sums the square difference along each axis. So, if one axis has large differences and another has small differences, the former axis will contribute much more to the distance than the latter axis.

This means that it matters whether our feature are centered around zero and have similar variance to each other.

Unfortunately, most data does not naturally start at a mean of zero and a shared variance. Other models tend to struggle with scale as well, even linear regression, when you get into more advanced methods such as regularization.

Fortuantely, this is an easy fix.

<a id="use-standardscaler-to-standardize-our-data"></a>
### Use `StandardScaler` to Standardize our Data

StandardScaler standardizes our data by subtracting the mean from each feature and dividing by its standard deviation.

#### Separate feature matrix and response for scikit-learn.

In [None]:
# Create feature matrix (X).
feature_cols = ['ast', 'stl', 'blk', 'tov', 'pf']

X = nba[feature_cols]
y = nba.pos_num  # Create response vector (y).

#### Create the train/test split.

Notice that we create the train/test split first. This is because we will reveal information about our testing data if we standardize right away.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)

#### Instantiate and fit `StandardScaler`.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Let's investigate what we did
pd.DataFrame(X_train).describe()

#### Fit a KNN model and look at the testing error.
Can you find a number of neighbors that improves our results from before?

In [None]:
# Calculate testing error.
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X_train, y_train)

y_pred_class = knn.predict(X_test)
testing_accuracy = metrics.accuracy_score(y_test, y_pred_class)
testing_error = 1 - testing_accuracy

print(testing_error)

<a id="comparing-knn-with-other-models"></a>
## Comparing KNN With Other Models
---

**Advantages of KNN:**

- It's simple to understand and explain.
- Lazy Learner so model training is fast.
- It can be used for classification and regression (for regression, take the average value of the K nearest points!).
- Being a non-parametric method, it is often successful in classification situations where the decision boundary is very irregular.

**Disadvantages of KNN:**

- It must store all of the training data.
- Its prediction phase can be slow when n is large.
- It is sensitive to irrelevant features.
- It is sensitive to the scale of the data. (checkout our standardization method at the end of this notebook)
- Accuracy is (generally) not competitive with the best supervised learning methods.

## Bonus Exercise - Predicting Customer Churn

In [None]:
# read in data

cell = pd.read_csv('./data/cell_phone_churn.csv')

In [None]:
cell.head()

**Assign X and Y variables.**

In [None]:
y = cell['churn']
X = cell.iloc[:, 0:19]

**Convert y variable if needed and create dummy variables as needed**

Learning a new tools with [Label Encoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

In [3]:
from sklearn.preprocessing import LabelEncoder

# Convert the y variable to a binary variable
le = LabelEncoder()
y = le.fit_transform(y)

In [None]:
# Dummify Categorical Variables in X dataframe, so it can be input into model
X = pd.get_dummies(X)
X.head()

**Split into train test using 70/30 split. The solutions use random_state=1234.  You can select any value but just note the results may not match, but should hopefully be close.**

In [None]:
X_train, X_test, y_train, y_test =  train_test_split(X,y, test_size=0.3, random_state=1984)

#### Instantiate and fit `StandardScaler`.

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#### Calculate Baseline Accuracy

In [None]:
cell.churn.value_counts(normalize=True)

In [None]:
pd.Series(y_test).value_counts(normalize=True)

#### Try Cross-Validation

In [None]:
from sklearn.model_selection import GridSearchCV

k = numpy.arange(20) + 1
knn = KNeighborsClassifier()

params = {'n_neighbors': k}
gs = GridSearchCV(    
    estimator= knn,
    param_grid= params,
    cv=5)

X= scaler.transform(X)

gs.fit(X, y)
gs.cv_results_['mean_test_score'] #Outputs mean accuracy

#### Which K is best?

In [None]:
k[gs.best_index_]

**Use that K with the train/test split.**

In [None]:
# Calculate testing error.
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, y_train)

train_pred = knn.predict(X_train)
test_pred = knn.predict(X_test)
# Calculate train and test accuracy
print('Training Accuracy Score:',accuracy_score(y_train, train_pred))
print('Testing Accuracy Score:',accuracy_score(y_test, test_pred))


#### 2. Repeat Using Feature Selection

Reassign X and y variable from start

In [None]:
y = cell['churn']
X = cell.iloc[:, 0:19]

**Convert y variable if needed and create dummy variables as needed**

In [None]:
# Convert the y variable to a binary variable
le = LabelEncoder()
y = le.fit_transform(y)

In [None]:
# Dummify Categorical Variables in X dataframe, so it can be input into model
X = pd.get_dummies(X)
X.head()

**Split into train test using 70/30 split.**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=1984)

#### Instantiate and fit `StandardScaler`.

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
from sklearn.ensemble import ExtraTreesRegressor

# feature extraction
model = ExtraTreesRegressor(random_state=99)
model.fit(X_train, y_train)
print(model.feature_importances_) #Higher scores are better.

In [None]:
[x for _,x in sorted(zip(model.feature_importances_,X), reverse=True)][0:10]

**Reassign X with new columns and resplit**

In [None]:
X=X[[x for _,x in sorted(zip(model.feature_importances_,X), reverse=True)][0:10]]

**Split into train test using 70/30 split.**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=1984)

#### Instantiate and fit `StandardScaler`.

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#### Try Cross-Validation

In [None]:
from sklearn.model_selection import GridSearchCV

k = numpy.arange(20) + 1
knn = KNeighborsClassifier()

params = {'n_neighbors': k}
gs = GridSearchCV(    
    estimator= knn,
    param_grid= params,
    cv=5)

X= scaler.transform(X)

gs.fit(X, y)
gs.cv_results_['mean_test_score'] #Outputs mean accuracy

#### Which K is best?

In [None]:
k[gs.best_index_]

**Using that K run with the train/test split.**

In [None]:
# Calculate testing error.
knn = KNeighborsClassifier(n_neighbors=11)
knn.fit(X_train, y_train)

train_pred = knn.predict(X_train)
test_pred = knn.predict(X_test)
# Calculate train and test accuracy
print('Training Accuracy Score:',accuracy_score(y_train, train_pred))
print('Testing Accuracy Score:',accuracy_score(y_test, test_pred))

**Append a prediction column to the original dataset.**

In [None]:
pred=knn.predict(X)
cell['predicted_churn']=pred

In [None]:
cell.head(20)

**As a manager, how would you use this model and what strategies would you put in place?**