# GA Data Science 10 (DAT10) - Lab4

### KNN, Bokeh

Justin Breucop

### Last Time:

- #### Numpy
- #### Pandas
- #### Bokeh Overview

#### Questions?

### Agenda

1. Scikit Learn & Bokeh Templates: Prepping for ML
2. KNN: Proximal Vox Populi

Appendix: A note on `bokeh.charts`

# 1. Scikit Learn & Bokeh Templates

Where do we find Machine Learning algorithms in python?

    sklearn - http://scikit-learn.org/stable/

Scikit Learn is a large collection of tools for data mining & data analysis. It contains the base algorithms for many machine learning strategies and also has very developed data processing and model selection capabilities. A large amount of complex products can be built using sklearn.

In [None]:
# from the datasets load the iris data into a variable called iris
from sklearn import datasets

sk_iris = datasets.load_iris()

What does the sk_iris data set look like?

In [None]:
type(sk_iris)
# sk_iris

In [None]:
# Since this is a new data type, we need to understand what methods we can use on it. 
help(sk_iris)

In [None]:
#Since it's like a dict object, we can explore it like we would a python dictionary (keys!!!)
sk_iris

## Inspecting Data Visually

In [None]:
from bokeh.plotting import figure,output_notebook,show,VBox,HBox,gridplot 
import numpy as np
import pandas as pd

%matplotlib inline #display matplotlib items in the notebook (used with pd.DataFrame.plot())
output_notebook() #display bokeh visuals within the notebook 

In [None]:
sk_iris['data']

In [None]:
sk_iris['feature_names']

In [None]:
sk_iris['target_names']

Some thoughts about dataframes: They are very easy to use, work with, & plot. If you can, convert your data to one during your exploration. A fun trick to constructing dataframes when you have an array and column names is using the core constructor function `pd.DataFrame`.

As a dataframe we can do some quick exploration and understand more about our data. Throughout this lab, we'll use `sk_iris` to denote the sklearn dataset and `iris` to refer to our dataframe.

In [None]:
iris = pd.DataFrame(sk_iris['data'],columns=sk_iris['feature_names'])
iris

In [None]:
iris.plot(kind="scatter",x=1,y=2,c='r',title="Base Visual")

But something is missing here. We have what we'd call **labelled data**. So even though our data all exists in one column, some of this data has been labelled with scientific names for irises. We can also call this the "target" data, or the target label we use to classify our data. To work with labels, we need to utilize the targets column of our original dataset.

In [None]:
sk_iris['target']

In [None]:
# We have three possible values: 0, 1, or 2. We can construct a vector of colors to 
# make our plot easier to read.

colors = []
for target in sk_iris['target']:
    if target == 0:
        colors.append('red')
    elif target == 1:
        colors.append('orange')
    elif target == 2:
        colors.append('blue')
print colors

In [None]:
#another way to build that list, using list comprehensions:
colorMap = {0:'red',1:'orange',2:'blue'}
colors = [ colorMap[x] for x in sk_iris['target'] ]
print colors

In [None]:
#We can pass our list of colors to the plot like so to get a better visual of what's going on.

iris.plot(kind='scatter',x=3,y=1,c=colors)

Great start but if we want a more advanced, prettier visualization, let's use Bokeh.

In [None]:
feat_x = iris.columns[1]
feat_y = iris.columns[3]

p1 = figure(plot_width=400, plot_height=400, 
            x_axis_label=feat_x, y_axis_label=feat_y)
p1.circle(iris[feat_x], iris[feat_y], line_width=1, color=colors, alpha=0.4,size=8)

show(p1)

Notice that with alpha setting a transparency on our data, we can see where some data overlaps and also bolder colors represent more data points at that spot.

This is only 1 of many plots we can make. Let's generate the entire set programmatically!

In [None]:
plots = []
for feat_x in iris.columns:
    for feat_y in iris.columns:
        
        temp_p = figure(plot_width=200, 
                        plot_height=200, 
                        x_axis_label=feat_x, 
                        y_axis_label=feat_y
                       )
        temp_p.circle(iris[feat_x], 
                      iris[feat_y], 
                      line_width=1, 
                      color=colors, 
                      alpha=0.4,
                      size=5)
        
        temp_p.xaxis.axis_label_text_font_size = '9pt'
        temp_p.yaxis.axis_label_text_font_size = '9pt'

        plots.append(temp_p)

# gridplot takes nested lists of bokeh figures and arranges them on the grid in the positions given. 
# Passing None inserts a blank.

sqrt = len(plots)**0.5
gplots = np.array(plots).reshape(sqrt,sqrt)

# To convert to a square, we reshape the array into a grid with the # of rows equal to the # of columns. 

#REMEMBER: gridplot takes a list of lists, so we convert gplots with the .tolists() method
a = gridplot(gplots.tolist())
show(a)


This is a very quick way to visually inspect your data.

# 2. KNN

Part of the big step with this lab is understanding general sklearn syntax. Each family of algorithms have various knobs and levers to tune it appropriately but there is a general overall structure to these models that will help you as you move forward.
1. All models need to be trained. Sklearn models have a `.fit` method for doing so.
2. We need to use the model to make a guess. the `.predict` method takes data and returns the model's guess for the value. Stipulations around this pertain to the specific model.

Let's re-assign the data to standard named variables

In [None]:
X = sk_iris.data
y = sk_iris.target
Names = sk_iris.target_names

Split the data into training set and test set

In [None]:
# is there a function to do that in sklearn?
from sklearn.cross_validation import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=0)

In [None]:
# Train KNN classifier defined function on the train data
from sklearn.neighbors import KNeighborsClassifier

In [None]:
myknn = KNeighborsClassifier(3).fit(X_train,y_train)

Let's figure out how good our model is. The traditional score is what percentage of my labels did I correctly identify. This is called **Precision**. There are other types of statistical scores but we will start here. We'll ask our model to predict what the labels for our test set are, then generate a score.

In [None]:
myknn.predict(X_test)

In [None]:
correct = 0

# We'll build this together.

print "Number correct:",correct
print "Score:",float(correct)/len(y_test)

That was easy enough. Sklearn also has an easy method for generating a score. 

In [None]:
myknn.score(X_test, y_test)

Sklearn also has a way of showing more information about the prediction. Here, we're using sklearn.metrics.classification_report to generate a more informative picture. The wikipedia pages for recall, f1-score, and support are also informative if you're looking to understand more.

https://en.wikipedia.org/wiki/Precision_and_recall

In [None]:
from sklearn import metrics

print metrics.classification_report([sk_iris['target_names'][label] for label in y_test], 
                                    [sk_iris['target_names'][label] for label in myknn.predict(X_test)])

## Exercise #1
### How does the model perform when you increase the number of neighbors?  

### Can you plot the score as a function of the number of neighbors?

###How much do the scores vary each time you shuffle and split?


##Appendix: Bokeh.charts & Bar Chart Example

Some of you have noticed the `bokeh.charts` area so I'll discuss that here a bit. Bokeh.charts is high level, meaning it abstracts a lot of details and generates plots easier than using `bokeh.plotting`. If you're still with me, this translates to less control but faster visuals. Stick with `bokeh.plotting` for now but if you find that you really need bar charts, this is an example of how to generate it.

In [None]:
flowers = []
for val in sk_iris['target']:
    flowers.append(sk_iris['target_names'][val])

iris['target'] = flowers
iris_agg = iris.groupby('target').mean()
iris_agg

In [None]:
data = {}
for target in iris_agg.index:
    data[target] = iris_agg.loc[target].values
data

In [None]:
from bokeh.charts import Bar, show

p=Bar(data, cat=list(iris_agg.columns), title="Bar example",
        xlabel='Flowers', ylabel='Average Length (cm)', width=600, height=600, legend="top_right")
show(p)