# K-Nearest Neighbors and Feature Scaling

Goals:

- Learn about the K-Nearest Neighbors machine learning algorithm. How it works and how to use it.
- Use the KNN model on the 2016 Democratic dataset.
- Feature engineering continued: scaling data with standard and minmax scalers.
- How and when to use scaling for you data.
- Class work: compare and contrast KNN and decision tree models on classification supervised learning datasets

## K-Nearest Neighbors

- Known as the "easy" machine learning model
- Classifies an event based on its closest relatives in the data the model has been trained on. Hence the term "Nearest Neighbors". K = number of neighbors.
- Known as a voting classifier because n neighbors vote for the classification.
- Uses Euclidean Distance to calculate similarity.
- Pros: Fast, intuitive, easy to interpret, ability to make probabilities.
- Cons: Poor at handling many features, especially "noisy" features because it treats every feature equally. Not good with small sample sizes. Usually requires scaling.

![ED](https://4.bp.blogspot.com/-UDuXTjw5pbw/WkZ_Yt7qrWI/AAAAAAAAARw/BWh39dRCPzwP1jowVg9lSOH8yfHvrv1lQCLcBGAs/s1600/euclidian.PNG)

Source: Sumit Jha

In [None]:
#Imports
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("fivethirtyeight");

In [None]:
#Fake data time
data = make_classification(n_samples=200,
                           n_features=2,
                           n_classes=2,
                           n_informative=2,
                           n_redundant=0,
                            class_sep=.35,
                           random_state=5)

In [None]:
#Slice the features and target variable from data

X = 
y = 

In [None]:
#Plot the data with its color-encodings


Time to use K-Nearest Neighbors (KNN) to model this data.

Train a KNN model using 3 neighbors

In [None]:
#intialize model and set n_neighbors equal to 3

#Fit the model on the "fake data"

#Find the accuracy score of the model on the data

print ("The model accurately labelled {:.2f} percent of the data".format())

Now with 5 neighbors

In [None]:
#intialize model and set n_neighbors equal to 5

#Fit the model on the "fake data"

#Find the accuracy score of the model on the data

print ("The model accurately labelled {:.2f} percent of the data".format())

Apply model on a new point

In [None]:
#New data point
new_data = np.asarray([0.18,0.15]).reshape(1,-1)

#Make predictions on new_data using both models
pred3 = 
pred5 = 

#Call those predictions
print ("The knn3 model thinks new_data belongs to class {}".format(pred3[0]))
print ("The knn5 model thinks new_data belongs to class {}".format(pred5[0]))

Look at class probabilities

In [None]:
#Use predict_proba to find class probabilities on new_data


In [None]:
#For 5 neighbors


These probabilites are the vote percentages.

Visualize new point in relation to data

In [None]:
plt.figure(figsize=(15,11))
plt.xlim(0,0.4)
plt.ylim(0,.4)
plt.scatter(X[:,0], X[:,1], c=y, cmap = "RdBu", s=350)
#Plot of new_data point
plt.scatter([0.18], [0.15], c="purple", cmap = "RdBu",marker="*", s= 2500)
plt.xlabel("Feature One")
plt.ylabel("Feature Two");

Classify the purple star using the KNN method.

Visualizing KNN

In [None]:
#Load in the plot_decision_boundary function
def plot_decision_boundary(model, X, y, n_neighbors):
    X_max = X.max(axis=0)
    X_min = X.min(axis=0)
    xticks = np.linspace(X_min[0], X_max[0], 100)
    yticks = np.linspace(X_min[1], X_max[1], 100)
    xx, yy = np.meshgrid(xticks, yticks)
    ZZ = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = ZZ >= 0.5
    Z = Z.reshape(xx.shape)
    fig, ax = plt.subplots()
    ax = plt.gca()
    ax.contourf(xx, yy, Z, cmap="RdBu", alpha=0.2)
    ax.scatter(X[:,0], X[:,1], c=y, cmap = "RdBu",s=40, alpha=0.4)
    plt.title("Plot of {} neighbors".format(n_neighbors))
    plt.xlabel("Feature One")
    plt.ylabel("Feature Two")

In [None]:
#Visualize the knn3 model


In [None]:
#Visualize the knn5 model


13 neighbors

In [None]:
#Plot 13 neighbors
#Fit model first before you plot

25 neighbors! 

In [None]:
#Plot 25 neighbors
#Fit model first before you plot

## 2016 Democratic Primary Data

### Data cleaning

Dataset: County-level results of 2016 Democratic Primary and county demographic information.

Kaggle page: https://www.kaggle.com/benhamner/2016-us-election

In [None]:
#Load in data files
primary = pd.read_csv("../../data/primary_data//primary_results.csv")
county = pd.read_csv("../../data/primary_data/county_facts.csv")
county_dict = pd.read_csv("../../data/primary_data/county_facts_dictionary.csv")

Before we can clean model, we have to clean first, but I've already done the work on that already.

In [None]:
#Data cleaning 

subset_col_index = [0,3,5,9,10,12,18,20,23,25,33,34,53]

county = county.iloc[:,subset_col_index].copy()

subset_cols = ["fips","population", "pop_change", "senior_pop_per", "female_pop_per", "black_pop_per",
               "white_pop_per", "foreign_pop_per", "college_degree_pop_", "commute_time", "median_income",
               "poverty_rate", "pop_density"]

col_dict = dict(zip(county.columns, subset_cols))
#Use dictionary to rename the columns
county.rename(columns=col_dict, inplace=True)
primary.dropna(inplace=True)
bern = primary[primary.candidate== "Bernie Sanders"]
hill = primary[primary.candidate== "Hillary Clinton"]
bern = bern[["fips", "candidate", "votes"]]
dem = pd.merge(hill, bern, on="fips")
dem.rename(columns={"votes_x":"clinton_votes", "votes_y":"sanders_votes"}, inplace=True)
dem["winner"] = dem.clinton_votes - dem.sanders_votes
def vote_winner(x):
    if x >0:
        return "H"
    elif x == 0:
        return "TIE"
    else:
        return "B"
    
dem["winner"] = dem.winner.apply(vote_winner)

dem = dem[dem.winner!= "TIE"]
dem = dem[["fips", "winner"]]
df = pd.merge(county, dem, on="fips")
df.set_index("fips", inplace=True)
df.head()

Time for some modeling. We're going to use KNN to classify counties as Hillary or Bernie.

In [None]:
#Check null accuracy


In [None]:
#Assign X and y

X = 
y = 

In [None]:
#Fit model using a single neighbor



Perfect model!!!!

Oh wait

In [None]:
#Fit model using three neighbors


What happened here?

What about 7 neighbors?

In [None]:
#Fit model using seven neighbors

knn7 = 

Let's try something much higher

In [None]:
#Fit model using 29 neighbors
knn29 =

Try it out on a testing set

In [None]:
#Make a train/test split. Set test_size = .25
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=.25, 
                                                    random_state=42)

Fit model with 5 neighbors on training data and test model on testing data

In [None]:
#Fit model on training data


In [None]:
#Call confusion_matrix 



Good or bad?
<br><br>
Let's increase number of neighbors

In [None]:
#Fit model on training data


Big difference?

**Cross validation.**
<br><br>
Class exercise time: Make a validation curve plot of neighbors vs the 5-fold cross validated accuracy score of a KNN model. Use odd numbers from 3 to 39.

In [None]:
#Answer


Which neighbor value/s produces the best accuracy score?
<br><br>
How does that compare to the null accuracy?

What is the issue here? We can't seem to build a model that can significantly beat our null accuracy.
<br><br>
Think about the features and how they differ from each other.


## Scaling Data
<br><br>
[Feature scaling:](https://en.wikipedia.org/wiki/Feature_scaling) A method used to standardize the range of independent variables or features of data

Let's take a look at this sample data set.

In [None]:
#Initialize data
sample = {"income":[30000, 55000, 36000], 
          "white_pop":[50, 85, 95], 
          "college_deg":[15, 40, 50], 
          "class":["A","B", "X"]}

sample= pd.DataFrame(sample)
sample

Which class is row 2 closer to? A or B?
Let's use euclidean distance to figure that out.

In [None]:
#Assign rows in data to variables
class_A = sample.iloc[0, 1:].values
class_B = sample.iloc[1, 1:].values
class_X = sample.iloc[2, 1:].values

In [None]:
#Euclidean distance between class A and class x


In [None]:
#Euclidean distance between class B and class x


Which class should class_X be assigned to based on this calculation? Do you agree or disagree

This example demonstrates the necessity of feature scaling.

From [Sebastian Raschka](http://sebastianraschka.com/Articles/2014_about_feature_scaling.html)

<b>Standardization</b>: "The result of standardization (or Z-score normalization) is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with μ=0μ=0  and σ=1

Where μ is the mean (average) and σσ is the standard deviation from the mean; standard scores (also called z scores) of the samples are calculated as follows:"
![e](http://www.statisticshowto.com/wp-content/uploads/2016/11/alternate-z-score.png)
<br><br>
<b>MinMax Scaling</b>: "An alternative approach to Z-score normalization (or standardization) is the so-called Min-Max scaling (often also simply called “normalization” - a common cause for ambiguities).
In this approach, the data is scaled to a fixed range - usually 0 to 1.
The cost of having this bounded range - in contrast to standardization - is that we will end up with smaller standard deviations, which can suppress the effect of outliers.

A Min-Max scaling is typically done via the following equation:"
![d](https://qph.ec.quoracdn.net/main-qimg-0d692d88876aeb26b1f1a578d1c5a94e)

Let's scale the features using StandardScaler and MinMaxScaler

In [None]:
#Imports
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [None]:
#Intialize scalers


#Fit data on scalers


We don't have any data yet, we need then transform the data using the fit scalers

In [None]:
#Use ss and mm to transform X


We can fit and transform at the same time

In [None]:
#Intialize scalers
ss = StandardScaler()
mm = MinMaxScaler()

#Fit and transform data using scalers
X_ss = 
X_mm = 

In [None]:
#Make data frames from scaled data. Use columns from X

X_ss = 
X_mm = 

In [None]:
#Take a look at both data frames
X_ss.head()

In [None]:
X_mm.head()

In [None]:
#What happens when you call .describe() on X_ss


What do you notice about the means and standard deviations?

What happens when we receive new data? How do we scale it using the scale of our previous data?

In [None]:
#Select San Francisco and Santa Cruz counties
ba = county[(county.fips==6075) | (county.fips==6087)].drop("fips", axis=1)
ba

In [None]:
#Use the ss scaler object used to fit and transform X to transform ba.
ba_ss = 
ba_ss

In [None]:
#Use the mm scaler object used to fit and transform X to transform ba.
ba_mm = 
ba_mm

Class exercise time: 

Work with partner to investigate whether or not our model significantly improves when using scaled data. Which scaler improves our modeling more? Use cross validation and charts and examine as many neighbors as possible.


In [None]:
#Answer

### Resources:
<br>
KNN:
- https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/
- https://www.dataquest.io/blog/k-nearest-neighbors-in-python/
- https://saravananthirumuruganathan.wordpress.com/2010/05/17/a-detailed-introduction-to-k-nearest-neighbor-knn-algorithm/
- http://people.revoledu.com/kardi/tutorial/KNN/index.html
<br><br>

Feature scaling:

- https://machinelearningmastery.com/scale-machine-learning-data-scratch-python/
- https://www.datacamp.com/community/tutorials/preprocessing-in-data-science-part-1-centering-scaling-and-knn
- https://pythonprogramming.net/preprocessing-machine-learning/
- http://benalexkeen.com/feature-scaling-with-scikit-learn/

## In-class lab.
<br><br>
For the rest of class work on modeling one of the following datasets: primary, spotify, employee churn (HR_comma_sep.csv), iris, titanic, or use fake data from sklearn.
<br><br>
Compare and contrast decision trees with KNN. Drop and transform features. Play around as much as possible with the data and see if that improves your model.
<br><br>
Check out bonus lesson in which I use KNN-like algorithm to determine similarities between soccer players and decide which city Amazon should choose for their new headquarters.

### Spotify data

A dataset of songs I like and dislike and their attributes from Spotify. 1 = like, 0 = dislike<br><br>



<b>Attributes:</b>

        Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
        
        Danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
        
        Instrumentalness: Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
        
        Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
        
        Energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
        
More details: https://developer.spotify.com/web-api/get-audio-features/

My article using this dataset: https://opendatascience.com/blog/a-machine-learning-deep-dive-into-my-spotify-data/

In [None]:
# music = pd.read_pickle("../data/Spotify_Data.pkl")
# music