# GA Data Science 18 (DAT18) - Lab5

### Plotting and KNN



# 1. Plotting Apple Stock Prices

In [None]:
import pandas.io.data
import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
aapl = pd.io.data.get_data_yahoo('AAPL', 
                                 start=datetime.datetime(2015, 4, 1), 
                                 end=datetime.datetime(2015, 4, 28))
aapl.head()

In [None]:
fig = plt.figure(figsize=(20,16))

ax = fig.add_subplot(2,2,1)
ax.plot(aapl.index, aapl['Close'])

##Wait... what just happened: 

we got some stock price info from yahoo directly through some [pandas magic](http://pandas.pydata.org/pandas-docs/stable/remote_data.html)! And then plotted a simple line graph of the close price by referencing a single column
 


In [None]:
#use shift tab to find out more
pd.io.data.get_data_yahoo()

In [None]:
#Whats that datetime business
datetime.datetime(2015, 4, 1)

### Some subplots to show whats possible

In [None]:
fig = plt.figure(figsize=(20,16))

ax = fig.add_subplot(2,2,1)
ax.plot(aapl.index, aapl['Close'])
ax.set_title('Line plots', size=24)

ax = fig.add_subplot(2,2,2)
ax.plot(aapl['Close'], 'o')
ax.set_title('Scatter plots', size=24)

ax = fig.add_subplot(2,2,3)
ax.hist(normal_dist, bins=50)
ax.set_title('Histograms', size=24)
ax.set_xlabel('count', size=16)


mu, sigma = 0, 0.1
normal_dist = np.random.normal(mu, sigma, 1000)
ax = fig.add_subplot(2,2,4)
ax.boxplot(normal_dist)
ax.set_title('Boxplots', size=24)

-------------
#### Now let's check out bokeh

Bokeh is built by the same people that created Anaconda (Continuum Analytics) and is designed out of the box for web display, making it nice for creating presentation ready, interactive visuals quickly. Labs in this course will be shown in Bokeh. Checkout http://bokeh.pydata.org/en/latest/docs/quickstart.html#concepts to see some of the range of capabilities.

In [None]:
from bokeh.plotting import figure, output_notebook,show
output_notebook()

In [None]:
# prepare some data
x = aapl.Low
y = aapl['High']

# create a new plot with a title and axis labels
p = figure(title="Stock High vs. Low", x_axis_label='Low', y_axis_label='High')

# add a line renderer with legend and line thickness
p.circle(x, y, legend="High vs. Low", line_width=2)

# show the results
show(p)

### Exercise 1: 

On your own (using Bokeh or Matplotlib). Do the following: 

1. Get the open and close dates for facebook's stock price (Ticker=FB) for the same date range as we used for APPL
2. Join the close prices for each stock into a single dataframe
3. Use a scatter plot to see if there is a relationship between apples close price and facebook's close price. 

In [None]:
#Your code here:









______________
# 2. SkLearn: Using datasets and KNN

Where do we find Machine Learning algorithms in python?

    sklearn - http://scikit-learn.org/stable/

Scikit Learn is a large collection of tools for data mining & data analysis. It contains the base algorithms for many machine learning strategies and also has very developed data processing and model selection capabilities. A large amount of complex products can be built using sklearn.

In [None]:
# load the data
iris = pd.read_csv('data/iris.csv')

What does the sk_iris data set look like?

In [None]:
iris.head()

In [None]:
#How many different types of iris are there? 
iris.iris_type.unique()

### Exercise:

Using the data cleaning we have learned thus far. do the following:

Create a new column called 'target' that holds the value 0 if the row is a setosa, 1 if its versicolor, 2 if its virginica. 
hint: Write a custom function and use apply

In [None]:
def convert_type(x):
    #your code
    return x

iris['target'] = #use the apply method

#### Inspecting Data Visually

In [None]:
from bokeh.plotting import figure,output_notebook,show,VBox,HBox,gridplot 
import numpy as np
import pandas as pd

#display matplotlib items in the notebook (used with pd.DataFrame.plot())
%matplotlib inline 
output_notebook() #display bokeh visuals within the notebook 

As a dataframe we can do some quick exploration and understand more about our data. 

In [None]:
iris.plot(kind="scatter",x=1,y=2,c='r',title="Base Visual")

But something is missing here. We have what we'd call **labelled data**. So even though our data all exists in one column, some of this data has been labelled with scientific names for irises. We can also call this the "target" data, or the target label we use to classify our data. To work with labels, we need to utilize the targets column of our original dataset.

In [None]:
iris['target']

In [None]:
# We have three possible values: 0, 1, or 2. We can construct a vector of colors to 
# make our plot easier to read.

colors = []
for target in iris['target']:
    if target == 0:
        colors.append('red')
    elif target == 1:
        colors.append('orange')
    elif target == 2:
        colors.append('blue')
print colors

In [None]:
#another way to build that list, using list comprehensions:
colorMap = {0:'red',1:'orange',2:'blue'}
colors = [ colorMap[x] for x in iris['target'] ]
print colors

In [None]:
#We can pass our list of colors to the plot like so to get a better visual of what's going on.

iris.plot(kind='scatter',x=3,y=1,c=colors)

Great start but if we want a more advanced, prettier visualization, let's use Bokeh.

In [None]:
feat_x = iris.columns[1]
feat_y = iris.columns[3]

p1 = figure(plot_width=400, plot_height=400, 
            x_axis_label=feat_x, y_axis_label=feat_y)
p1.circle(iris[feat_x], iris[feat_y], line_width=1, color=colors, alpha=0.4,size=8)

show(p1)

Notice that with alpha setting a transparency on our data, we can see where some data overlaps and also bolder colors represent more data points at that spot.

This is only 1 of many plots we can make. Let's generate the entire set programmatically!

In [None]:
plots = []
for feat_x in iris.columns:
    for feat_y in iris.columns:
        
        temp_p = figure(plot_width=200, 
                        plot_height=200, 
                        x_axis_label=feat_x, 
                        y_axis_label=feat_y
                       )
        temp_p.circle(iris[feat_x], 
                      iris[feat_y], 
                      line_width=1, 
                      color=colors, 
                      alpha=0.4,
                      size=5)
        
        temp_p.xaxis.axis_label_text_font_size = '9pt'
        temp_p.yaxis.axis_label_text_font_size = '9pt'

        plots.append(temp_p)

# gridplot takes nested lists of bokeh figures and arranges them on the grid in the positions given. 
# Passing None inserts a blank.

sqrt = len(plots)**0.5
gplots = np.array(plots).reshape(sqrt,sqrt)

# To convert to a square, we reshape the array into a grid with the # of rows equal to the # of columns. 

#REMEMBER: gridplot takes a list of lists, so we convert gplots with the .tolists() method
a = gridplot(gplots.tolist())
show(a)


This is a very quick way to visually inspect your data.

# 3. KNN

Part of the big step with this lab is understanding general sklearn syntax. Each family of algorithms have various knobs and levers to tune it appropriately but there is a general overall structure to these models that will help you as you move forward.
1. All models need to be trained. Sklearn models have a `.fit` method for doing so.
2. We need to use the model to make a guess. the `.predict` method takes data and returns the model's guess for the value. Stipulations around this pertain to the specific model.

Let's re-assign the data to standard named variables

In [None]:
iris.ix[:,:-2].values

In [None]:
X = iris.ix[:,:-2].values
y = iris.target.values
Names = iris.iris_type

Split the data into training set and test set

In [None]:
# is there a function to do that in sklearn?
from sklearn.cross_validation import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=0)

In [None]:
# Train KNN classifier defined function on the train data
from sklearn.neighbors import KNeighborsClassifier

In [None]:
myknn = KNeighborsClassifier(3).fit(X_train,y_train)

Let's figure out how good our model is. The traditional score is what percentage of my labels did I correctly identify. This is called **Precision**. There are other types of statistical scores but we will start here. We'll ask our model to predict what the labels for our test set are, then generate a score.

In [None]:
myknn.predict(X_test)

In [None]:
correct = 0

# We'll build this together.

print "Number correct:",correct
print "Score:",float(correct)/len(y_test)

That was easy enough. Sklearn also has an easy method for generating a score. 

In [None]:
myknn.score(X_test, y_test)

Sklearn also has a way of showing more information about the prediction. Here, we're using sklearn.metrics.classification_report to generate a more informative picture. The wikipedia pages for recall, f1-score, and support are also informative if you're looking to understand more.

https://en.wikipedia.org/wiki/Precision_and_recall

In [None]:
from sklearn import metrics

print metrics.classification_report([sk_iris['target_names'][label] for label in y_test], 
                                    [sk_iris['target_names'][label] for label in myknn.predict(X_test)])

## Exercise #3

### How does the model perform when you increase the number of neighbors?  

### Can you plot the score as a function of the number of neighbors?

In [None]:
#your code here






###How much do the scores vary each time you shuffle and split?


In [None]:
#Your Code here




