Lesson 1 - Open Data Science for everyone
=========================================

1.1 Use Anaconda Repository for data science artifacts
------------------------------------------------------

Open Data Science is about powerful, easy access to the best numerical computing, data processing, and visualization tools today.

With [Anaconda](http://continuum.io/downloads) installed you can use Navigator to start a Jupyter Notebook session and get started right away, which is exactly what we're going to do.

<center>
<img src=http://ijstokes-public.s3.amazonaws.com/dspyr/img/AnacondaCIO_Logo width=400 />
</center>

This first lesson is to quickly demonstrate how easy it is to perform key aspects of the data science workflow:

* data injest
* data manipulation
* data visualization
* data analysis
* data modeling

You aren't expected to follow exactly every step.  We'll go into details later.  But this should help motivate you to stick with this course to get the most out of the exciting world of Open Data Science using the Anaconda platform.

Navigate to [Anaconda Cloud at http://anaconda.org](http://anaconda.org), which provides a public instance of Anaconda Repository, and go to the course account `datasciencepythonr`.  From there you can get a template version of the Notebook for each lesson that will allow you to follow along:
<p>
<font size=+1>
<a href=https://anaconda.org/datasciencepythonr/1-open-data-science-for-everyone>https://anaconda.org/datasciencepythonr/1-open-data-science-for-everyone</a>
</font>
</p>

1.2 Use Anaconda Navigator to open and run Jupyter Notebooks
------------------------------------------------------------

Start Anaconda Navigator and click the *"Launch"* button to start Jupyter Notebook:

![Anaconda Navigator](http://ijstokes-public.s3.amazonaws.com/dspyr/img/screenshot_anaconda_navigator.png)

### Open Data Science = Your Toolbox, Your Tools

Anaconda hands you a pretty well stocked tool box for Open Data Science, but more importantly it gives you agility to work quickly and easily on many different problems, and flexibility to customize the tools you have on hand through the Conda package management system.

Windows, Mac, and Linux: everyone can work from the same starting point.  No administrator rights required, no interfering with software and tools you already have installed.

1.3 Perform fundamental Jupyter operations
------------------------------------------

More tips under the *Help* menu entry *Keyboard shortcuts*

* Cells
* Execute
    * `SHIFT-ENTER` to execute a cell and go to the next cell
    * `CTRL-ENTER` to execute a cell but keep focus on the same cell
    * `ALT-ENTER` to execute a cell and insert a new cell after it
* Stop
* Insert new cell
* Re-order cells
* Clear output
* Restart kernel

1.4 Injest, analyze, and clean data with Pandas
-----------------------------------------------
<center>
<img src=http://pandas.pydata.org/_static/pandas_logo.png />
</center>
Reference: [Pandas documentation](http://pandas.pydata.org)

### Get Your Hands Dirty, and Code First

Once upon a time the coding part was complicated, time consuming, and only possible by experts who were often far removed from the line-of-business objectives they are trying to serve.  Today you need to move beyond spreadsheets for data insights and need to be empowered to find answers yourself, with little or no reference to a distant analytics team.

Anaconda was created to empower anyone who can use Excel to have in their hands the tools they need to injest data, analyze it, visualize it, build interactive apps, and deploy production analysis models at scale.

So I will start by demonstrating a simple analysis workflow of a well known sample data set of 400 cars from the 1970s.

In [1]:
import pandas as pd

pd.set_option("display.max_rows",10)

Normally we wouldn't read data this way, but this is some sample data that is bundled with Bokeh for convenience

In [2]:
from bokeh.sampledata.autompg import autompg

In Python just typing the name of a variable will show you a "representation" of that variable. In the case of a `pandas.DataFrame` we'll see a data table.

In [3]:
autompg

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
387,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl
388,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
389,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage
390,28.0,4,120.0,79,2625,18.6,82,1,ford ranger


Look at the data, sorted by fuel efficiency dimension `mpg`

In [4]:
autompg.sort_values(by='mpg')

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
28,9.0,8,304.0,193,4732,18.5,70,1,hi 1200d
26,10.0,8,307.0,200,4376,15.0,70,1,chevy c20
25,10.0,8,360.0,215,4615,14.0,70,1,ford f250
27,11.0,8,318.0,210,4382,13.5,70,1,dodge d200
123,11.0,8,350.0,180,3664,11.0,73,1,oldsmobile omega
...,...,...,...,...,...,...,...,...,...
324,43.4,4,90.0,48,2335,23.7,80,2,vw dasher (diesel)
388,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
323,44.3,4,90.0,48,2085,21.7,80,2,vw rabbit c (diesel)
327,44.6,4,91.0,67,1850,13.8,80,3,honda civic 1500 gl


Notice the first word in each name entry is the *make*.  We'll use this to add a `make` field to the `pandas.DataFrame` object:

In [5]:
autompg['make'] = pd.Series((n.split()[0] for n in autompg.name), index=autompg.index)

Let's look at a sorted list of this new dimension 

In [6]:
sorted(autompg.make.unique())

['amc',
 'audi',
 'bmw',
 'buick',
 'cadillac',
 'capri',
 'chevroelt',
 'chevrolet',
 'chevy',
 'chrysler',
 'datsun',
 'dodge',
 'fiat',
 'ford',
 'hi',
 'honda',
 'maxda',
 'mazda',
 'mercedes',
 'mercedes-benz',
 'mercury',
 'nissan',
 'oldsmobile',
 'opel',
 'peugeot',
 'plymouth',
 'pontiac',
 'renault',
 'saab',
 'subaru',
 'toyota',
 'toyouta',
 'triumph',
 'vokswagen',
 'volkswagen',
 'volvo',
 'vw']

Several makes have spelling errors or inconsistencies. Pandas can help us fix those in place:

In [8]:
autompg.loc[autompg.make == 'chevroelt', 'make'] = 'chevrolet'
autompg.loc[autompg.make == 'chevy',     'make'] = 'chevrolet'
autompg.loc[autompg.make == 'maxda',     'make'] = 'mazda'
autompg.loc[autompg.make == 'mercedes',  'make'] = 'mercedes-benz'
autompg.loc[autompg.make == 'toyouta',   'make'] = 'toyota'
autompg.loc[autompg.make == 'vokswagen', 'make'] = 'volkswagen'
autompg.loc[autompg.make == 'vw',        'make'] = 'volkswagen'

In [11]:
with pd.option_context('display.max_rows', 999):
    print(autompg.groupby('make').size())

[1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 6, 6, 7, 8, 8, 10, 11, 12, 13, 16, 17, 22, 23, 26, 27, 28, 31, 47, 48]


1.5 Visualize data with Bokeh
-----------------------------
<center>
<img src=https://bokeh.github.io/images/logo.svg />
</center>
Reference: [Bokeh documentation](http://bokeh.pydata.org)

In [12]:
from bokeh.charts import Scatter, Histogram
from bokeh.models import HoverTool
from bokeh.io import output_notebook, show

output_notebook()

What trend do we observe in this data set?
What was happening through the 1970s that might cause this?

In [13]:
s = Scatter(data=autompg, x='yr', y='mpg', height=400)
show(s)

Let's just look at one make from each of the US, Germany, and Japan

In [14]:
s = Scatter(data=autompg[autompg.make.isin('ford volkswagen honda'.split())],
            x='yr', y='mpg', color='make', height=400)
show(s)

How did we do that? With a sub-select on the `autompg` object using something called *"fancy indexing"* to select just the rows that have a make that is one of `ford volkswagen honda`.

In [15]:
autompg[autompg.make.isin('ford volkswagen honda'.split())]

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name,make
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,ford
5,15.0,8,429.0,198,4341,10.0,70,1,ford galaxie 500,ford
17,21.0,6,200.0,85,2587,16.0,70,1,ford maverick,ford
19,26.0,4,97.0,46,1835,20.5,70,2,volkswagen 1131 deluxe sedan,volkswagen
25,10.0,8,360.0,215,4615,14.0,70,1,ford f250,ford
...,...,...,...,...,...,...,...,...,...,...
378,32.0,4,91.0,67,1965,15.7,82,3,honda civic (auto),honda
383,22.0,6,232.0,112,2835,14.7,82,1,ford granada l,ford
387,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl,ford
388,44.0,4,97.0,52,2130,24.6,82,2,vw pickup,volkswagen


Use Bokeh to add some simple interactivity

In [17]:
s = Scatter(data=autompg[autompg.make.isin('ford volkswagen honda'.split())],
            x='yr', y='mpg', color='make',
            height=400, width=800,
            title='Fuel efficiency of selected vehicles from 1970-1982',
            tools='hover, box_zoom, lasso_select, save, reset',
            tooltips=[
                ('Make','@make'),
                ('MPG','@mpg'),
                ('hp','@hp')]
           )

show(s)

1.6 Create machine learning and predictive models with Scikit-Learn
--------------------------------------------------------------------
<center>
<img src=http://scikit-learn.org/stable/_images/scikit-learn-logo-notext.png />
</center>
Reference: [Scikit-Learn documentation](http://scikit-learn.org/)

In [18]:
from sklearn.svm          import SVC                    as support_vector_classifier
from sklearn.ensemble     import RandomForestClassifier as random_forest_classifier
from sklearn.neighbors    import KNeighborsClassifier   as knn_classifier
from sklearn.linear_model import LinearRegression       as linear_regression_classifier

from sklearn.cross_validation import train_test_split



In [19]:
train, test = train_test_split(autompg, train_size=0.80, random_state=123)
train = train.copy()
test  = test.copy()

### Linear Regression Model

In [20]:
model = linear_regression_classifier()
model.fit(train['cyl displ hp weight accel yr'.split()],
          train['mpg'].values.ravel())

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [21]:
model.score(test['cyl displ hp weight accel yr'.split()],
#            test['mpg'].astype(int).values.ravel()) # for discrete predictions
            test['mpg'].values.ravel())              # for continuous predictions

0.75077542748160853

In [22]:
predictions = model.predict(test['cyl displ hp weight accel yr'.split()])
predictions

array([ 16.93513471,  31.44386166,  13.92513989,  25.33100232,
        30.86517173,  16.12912297,  29.84316806,  20.79424901,
        17.51682663,  33.5962678 ,  15.16484689,  24.05681003,
        13.43861853,  30.4849891 ,  15.96175622,  22.14058281,
        29.18249959,   7.26444002,  12.34592112,  14.11414099,
        22.31888799,  28.78553786,  29.70553983,  35.29035859,
        34.53708373,  16.04079521,  26.86016519,  31.90293837,
        22.70010858,  27.39754987,  23.10237142,  31.48471994,
        17.0945224 ,  20.83682113,  27.73349153,  28.37221631,
        27.18683665,  27.98877592,  26.11857502,  11.08719175,
        17.4036681 ,  23.45466615,  25.0567287 ,  21.985068  ,
        20.55618207,  20.54065738,  28.84321132,  29.74538267,
        24.29508899,  20.39074464,  20.98390062,  33.4709278 ,
        16.27399269,  22.37692882,  22.76363708,  14.67741727,
         7.73764472,  24.38242977,  19.44958426,  30.11714922,
        16.5131058 ,   6.38634806,  24.59175118,  27.26

In [23]:
test['mpg']

220    17.0
245    39.4
134    16.0
147    24.0
390    28.0
       ... 
274    21.6
318    37.0
380    25.0
114    15.0
189    22.0
Name: mpg, dtype: float64

In [24]:
h = Histogram(predictions - test.mpg, height=400)
show(h)

In [25]:
delta          = predictions - test.mpg
test['mpg_lr'] = predictions
test['lr']     = delta
test[delta.abs() > 5]

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name,make,mpg_lr,lr
245,39.4,4,85.0,70,2070,18.6,78,3,datsun b210 gx,datsun,31.443862,-7.956138
327,44.6,4,91.0,67,1850,13.8,80,3,honda civic 1500 gl,honda,33.596268,-11.003732
273,17.0,6,163.0,125,3140,13.6,78,2,volvo 264gl,volvo,24.056810,7.056810
240,21.5,4,121.0,110,2600,12.8,77,2,bmw 320i,bmw,26.860165,5.360165
307,41.5,4,98.0,76,2144,14.7,80,2,vw rabbit,volkswagen,31.902938,-9.597062
...,...,...,...,...,...,...,...,...,...,...,...,...
42,13.0,8,400.0,170,4746,12.0,71,1,ford country squire (sw),ford,7.737645,-5.262355
41,12.0,8,383.0,180,4955,11.5,71,1,dodge monaco (sw),dodge,6.386348,-5.613652
272,20.3,5,131.0,103,2830,15.9,78,2,audi 5000,audi,26.233178,5.933178
274,21.6,4,121.0,115,2795,15.7,78,2,saab 99gle,saab,26.963536,5.363536


### Random Forest Classifier

In [26]:
model = random_forest_classifier(n_estimators=100)
model.fit(train['cyl displ hp weight accel yr'.split()],
          train['mpg'].astype(int).values.ravel())

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

So how does Random Forest compare to Linear Regression?

In [27]:
model.score(test['cyl displ hp weight accel yr'.split()],
            test['mpg'].astype(int).values.ravel()) # for discrete predictions
#            test['mpg'].values.ravel())              # for continuous predictions

0.13924050632911392

In [28]:
predictions = model.predict(test['cyl displ hp weight accel yr'.split()])
predictions

array([15, 31, 14, 23, 31, 18, 23, 18, 15, 31, 16, 20, 15, 30, 15, 20, 26,
       13, 14, 15, 20, 26, 26, 39, 40, 15, 21, 33, 23, 26, 20, 27, 19, 19,
       19, 27, 24, 32, 23, 11, 17, 20, 21, 30, 18, 15, 26, 28, 21, 20, 25,
       38, 16, 17, 20, 16, 13, 21, 23, 31, 15, 13, 24, 24, 17, 39, 34, 20,
       20, 19, 24, 35, 27, 13, 19, 23, 22, 13, 20])

In [29]:
delta = predictions - test.mpg
test['mpg_rf'] = predictions
test['rf'] = delta
test[delta.abs() > 5]

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name,make,mpg_lr,lr,mpg_rf,rf
245,39.4,4,85.0,70,2070,18.6,78,3,datsun b210 gx,datsun,31.443862,-7.956138,31,-8.4
327,44.6,4,91.0,67,1850,13.8,80,3,honda civic 1500 gl,honda,33.596268,-11.003732,31,-13.6
307,41.5,4,98.0,76,2144,14.7,80,2,vw rabbit,volkswagen,31.902938,-9.597062,33,-8.5
381,38.0,6,262.0,85,3015,17.0,82,1,oldsmobile cutlass ciera (diesel),oldsmobile,27.733492,-10.266508,19,-19.0
329,29.8,4,89.0,62,1845,15.3,80,2,vokswagen rabbit,volkswagen,33.779161,3.979161,39,9.2
318,37.0,4,119.0,92,2434,15.0,80,3,datsun 510 hatchback,datsun,30.353193,-6.646807,23,-14.0


In [30]:
h = Histogram(delta, height=300)
show(h)

In [31]:
s = Scatter(data=test, x='rf', y='lr', height=300, width=700,
            tools='hover, box_zoom, save, reset',
            title='Comapring model predictions',
            xlabel='Random Forest',
            ylabel='Linear Regression',
            tooltips = [
              ('Make','@make'),
              ('MPG', '@mpg'),
              ('hp',  '@hp')
            ])

hover = s.select(dict(type=HoverTool))

hover.tooltips = [
    ('Make', '@make'),
    ('MPG', '@mpg'),
    ('hp', '@hp')
]

show(s)

SyntaxError: invalid syntax (<ipython-input-31-038df5d0c81b>, line 14)

Summary
=======
In this first lesson we explored a simple data analysis workflow and made use of three of the most popular packages in your Anaconda Open Data Science tool box:

* Pandas for data import and processing
* Bokeh for data visualization
* Scikit-Learn for machine learning and predictive modeling

Go Further
==========
1. Run this notebook one cell at a time (by pressing CTRL-ENTER) and try experimenting with parameters. Interact with the visualizations.

2. Modify the scatter plot to examine different dimensions, such as *Year* vs. *Horsepower*, or *Displacement* vs. *Horsepower*.

3. Modify which features the `HoverTool` displays.

4. Try training the model with fewer features (e.g. try just *Displacement* and *Weight*) -- remember that when you exercise it to generate predictions you can only provide those features.

5. Try using the *K-Nearest Neighbor* or *Support Vector Classifier*.