![](https://d0.awsstatic.com/asset-repository/products/Amazon%20Machine%20Learning/MachineLearning_VideoThumbnail.png)

# Machine Learning

> The illiterate of the 21st century will not be those who cannot read and write, but those who cannot learn, unlearn, and relearn.

<footer>~ Alvin Toffler</footer>

![break](assets/agenda.png)

1. [What Is Machine Learning?](#What-is-Machine-Learning?)
1. [Machine Learning Problems](#Machine-Learning-Problems)
1. [Python Libraries](#Python-Libraries)

**Labs**:
1. [Numpy, Scipy, And Pandas](#)

![break](assets/theory.png)

## What _is_ Machine Learning?

> “Machine learning, a branch of artificial intelligence, is about the construction and study of
systems that can learn from data.”

<footer>~ Wikipedia</footer>

With machine learning we use _input_ variables to predict _output_ variables. That's it. But that's easier said than done …

Let's say we are data analysts working for Samsung and we are trying to figure out what's the best way to spend our stupedous marketing budget. How can we best predict _smartphone sales_ from the spend on our various channels? Samsung buys online advertising, runs billboard campaigns, and pays for product placement in TV shows. 

Let $Y$ be the output variable (e.g. sales), and $X$ the _vector_ of input variables $X1,X2,X3,\ldots$, then

$$ Y = f(X)+\epsilon$$

We want to work out what $f$ is. $\epsilon$ is unavoidable noise that is independent of $X$.

How do we estimate $f$ from the data, and how do we evaluate our estimate? That is a question machine learning can answer.

### Prediction and its limits

Once we have an estimate $\hat{f}$ for $f$, we can predict unavailable values of $Y$ for known values of $X$:

$$ \hat{Y} = \hat{f}(X) $$

How good an estimate of $Y$ is $\hat{Y}$? The difference between the two values can be partitioned into **reducible** and **irreducible** errors:

$$E(Y−\hat{Y})^{2}=[f(X)−\hat{f}(X)]^{2}+\sigma^{2}_{\epsilon}$$

where $[f(X)−f^(X)]^{2}$ is the reducible error.

### How to estimate $f$

Two main approaches:

#### Parametric

An assumption is made about the form of $f$. For example, the **linear model** states that

$$f^(X)=β_{0}+β_{1}X_{1}+β_{2}X_{2}+ \ldots +β_{p}X_{p}$$

Then we use the training data to choose the values of $β_{0},β_{1}, \ldots ,β^{p}$, the parameters.

* **Advantage**: Much easier to estimate parameters than whole function.
* **Disadvantage**: Our choice of the form of $f$ might be wrong, or even very wrong.

We can try to make our parametric form more **flexible** in order to reduce the risk of choosing the wrong $f$, but this also makes $\hat{f}$ more complex and potentially following the noise too closely, thereby **overfitting**.

#### Non-parametric

Just get $f$ as close as possible to the data points, subject to not being too wiggly or too unsmooth.

* **Advantage**: More likely to get $f$ right, especially if $f$ is weird.
* **Disadvantage**: Far more data is needed to obtain a good estimate for $f$.

#### Supervised vs unsupervised learning

What if we are only given input variables and no outputs? Then our learning will be **unsupervised**; we are blind.

What can we do? We can try to understand the relationship between the input variables or between the observations. One example is to cluster observations or variables together into groups.

> “The core of machine learning deals with representation and generalization...”

* representation – extracting structure from data
* generalization – making predictions from data

#### Regression vs classification

![](http://ipython-books.github.io/images/ml.png)

Variables can be either _quantitative_ or _qualitative_. We care about this because different variable types affect the class of values we can make predictions from and therefore the possible nature of $f$, and also how to measure the size of an error from a wrong prediction.

When the output variable is quantitative, prediction is called _regression_.
When the output variable is qualitative, prediction is _classification_.

### Mapping is to our class outline

    ------------ INTRODUCTION ------------
    01 - Data Science Toolkit
    02 - Linear Algebra
    03 - Machine Learning
    04 - Exploratory Data Analysis
    ------------ ESTIMATION ------------
    05 - Linear Regression
    06 - Polynomial Regression
    07 - Data Wrangling 
    08 - Intro to R
    09 - Group Presentations
    ------------ MACHINE LEARNING ------------
    10 - Logistic Regression
    11 - kNN 
    12 - Decision Trees
    13 - Random Forests
    14 - Network Analysis
    15 - Feature Engineering I (PCA)
    16 - Feature Engineering II (Clustering)
    ------------ REAL WORLD APPLICATIONS ------------
    17 - Kaggle I (EDA)
    18 - Kaggle II (Feature Engineering)
    19 - Big Data I (MapReduce)
    20 - Big Data II (Spark)
    21 - Project Labs
    22 - Final Presentations

#### Remember This? 

![](assets/DS_venn_diagram.png)

### How do you get from 'Machine Learning' to 'Data Science'

![reference](http://tonyfisherpuzzles.net/images/1200%20res%206th%20april%202009b.jpg)
### Problem Solving!!!

Implementing solutions to ML problems is the focus of this course.

### What is the goal of Machine Learning?

The goal is determined by the type of problem.

|Type   	|Use   	|Method |
|:-:	|:-:	|:-:	|
|Supervised   	|Making Predictions   	|Generalisation |
|Unsupervised   	|Extracting Structure   	|Representation |


### How do you determine the right approach?

The right approach is determined by the desired solution.

![](assets/ml_algorithms.png)

But what approaches are available to you will always be dependent on the type and quality of the data you have!

### What do you do with your results?

Interpret them and react accordingly.

![](assets/benfry-workflow-recursion.png)

This also relies on your problem solving skills!

![break](assets/code.png)

## Python Libraries

### import

Python libraries are imported into scripts using the

> **import** statement.

The import statement can be used in three ways:

In [30]:
%%sh
conda

usage: conda [-h] [-V] [--debug] command ...

conda is a tool for managing and deploying applications, environments and packages.

Options:

positional arguments:
  command
    info         Display information about current conda install.
    help         Displays a list of available conda commands and their
                 help strings.
    list         List linked packages in a conda environment.
    search       Search for packages and display their information. The
                 input is a regular expression. To perform a search
                 with a search string that starts with a -, separate
                 the search from the options with --, like 'conda
                 search -- -h'. A * in the results means that package
                 is installed in the current environment. A . means
                 that package is not installed but is cached in the
                 pkgs directory.
    create       Create a new conda environment from a list of
                 spe

In [31]:
%%sh
conda install docopt

Fetching package metadata: ....
Solving package specifications: ..............
# All requested packages already installed.
# packages in environment at /home/io/.tools/anaconda/envs/ds:
#
docopt                    0.6.2                    py27_0  


In [32]:
from sys import maxsize

from os.path import curdir

9223372036854775807

In [43]:
import os

from operator import itemgetter

from os.path import expanduser

from sys import *

In [None]:
from sys import 

In [44]:
expanduser('~/Documents')

'/home/io/Documents'

In [48]:
platform

'linux2'

In [58]:
import os 
 
from os import path

path.islink('')

False

In [52]:
pwd

u'/home/io/ga/ds/DS_HK_7/notebooks'

The differences have to do with how each import statement interacts with the local namespace.

### Namespaces

Python has three types of namespaces:
    
> **local**, **global**, and **built-in**

For our purposes, namespaces are important because they control how imported code can be accessed:


In [61]:
x = 1 

def func()
    x = 2
    return x

func()

SyntaxError: invalid syntax (<ipython-input-61-3fb3feb9b224>, line 3)

In [60]:
x

1

In [62]:
import os
os.path.expanduser('~')

'/home/io'

In [63]:
path.expanduser('~')

'/home/io'

In [64]:
from os import *
from os import path

In [65]:
path.expanduser('~')

'/home/io'

In [66]:
import numpy as np

### NumPy

We’ll be using four external libraries that help us structure our data accordingly.
> Numpy offers the ability to create arrays (matrices and vectors), as well as some linear algebra functions!

In [None]:
from numpy import 

In [68]:
import numpy

In [69]:
dir(numpy)

['ALLOW_THREADS',
 'BUFSIZE',
 'CLIP',
 'DataSource',
 'ERR_CALL',
 'ERR_DEFAULT',
 'ERR_IGNORE',
 'ERR_LOG',
 'ERR_PRINT',
 'ERR_RAISE',
 'ERR_WARN',
 'FLOATING_POINT_SUPPORT',
 'FPE_DIVIDEBYZERO',
 'FPE_INVALID',
 'FPE_OVERFLOW',
 'FPE_UNDERFLOW',
 'False_',
 'Inf',
 'Infinity',
 'MAXDIMS',
 'MachAr',
 'NAN',
 'NINF',
 'NZERO',
 'NaN',
 'PINF',
 'PZERO',
 'PackageLoader',
 'RAISE',
 'SHIFT_DIVIDEBYZERO',
 'SHIFT_INVALID',
 'SHIFT_OVERFLOW',
 'SHIFT_UNDERFLOW',
 'ScalarType',
 'Tester',
 'True_',
 'UFUNC_BUFSIZE_DEFAULT',
 'UFUNC_PYVALS_NAME',
 'WRAP',
 '_NoValue',
 '__NUMPY_SETUP__',
 '__all__',
 '__builtins__',
 '__config__',
 '__doc__',
 '__file__',
 '__git_revision__',
 '__name__',
 '__package__',
 '__path__',
 '__version__',
 '_import_tools',
 '_mat',
 'abs',
 'absolute',
 'absolute_import',
 'add',
 'add_docstring',
 'add_newdoc',
 'add_newdoc_ufunc',
 'add_newdocs',
 'alen',
 'all',
 'allclose',
 'alltrue',
 'alterdot',
 'amax',
 'amin',
 'angle',
 'any',
 'append',
 'apply_alo

In [94]:
from numpy import *
A = matrix('1 2; 3 4; 5 6')
A

matrix([[1, 2],
        [3, 4],
        [5, 6]])

In [79]:
np.matrix

array([1, 2, 3])

### SciPy

Scipy extends numpy by offering additional linear algebra functions, signal processing, Fourier transforms, and other statistics functions

In [96]:
from scipy import *
A = array([[1,2],[3,4]])
A

array([[1, 2],
       [3, 4]])

Compute the inverse of a matrix.

In [98]:
A

array([[1, 2],
       [3, 4]])

In [97]:
linalg.inv(A)

array([[-2. ,  1. ],
       [ 1.5, -0.5]])

In [101]:
linalg??

A dot product of a matrix with its inverse results in a identity matrix.

In [None]:
dir(A)

In [None]:
A.dot(linalg.inv(A))

### Pandas

PANDAS (python data analysis) provides more rigid data structures more attune to other stats languages, like R or matlab. It's a Python package providing fast, flexible, and expressive data structures designed to work with *relational* or *labeled* data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python. 

pandas is well suited for:

- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure


Key features:
    
- Easy handling of **missing data**
- **Size mutability**: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
- Powerful, flexible **group by functionality** to perform split-apply-combine operations on data sets
- Intelligent label-based **slicing, fancy indexing, and subsetting** of large data sets
- Intuitive **merging and joining** data sets
- Flexible **reshaping and pivoting** of data sets
- **Hierarchical labeling** of axes
- Robust **IO tools** for loading data from flat files, Excel files, databases, and HDF5
- **Time series functionality**: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

The author provides a [10 minutes to Pandas](http://pandas.pydata.org/pandas-docs/dev/10min.html) with an overview of some of the key features

### SciKit-Learn

Scikit-learn is a library which contains the majority of our machine learning algorithms. We will be primarily using scikit learn in class to experiment and learn various ML functionality.

There are a lot of other libraries out there that enable you to do some incredibly great things. We definitely won’t explore all of them here, but don’t be afraid to use our best friend (Google) to help you find libraries that do things you want to get done.



![](assets/scikit_learn_cheat_sheet.png)

![break](assets/code.png)

## Lab: Numpy, Scipy, And Pandas

* Build our numpy and PANDAS repertoire: array, matrix, dataframes
* Compare the work from last week with these libraries

### Confirm our theory!

We have our inputs features, and we're trying to predict the response vector

In [104]:
from numpy import array, dot
from numpy.linalg import inv

X = array([[1, 1], [1, 2], [1, 3], [1, 4]])
y = array([[1], [2], [3], [4]])
print X
print y

[[1 1]
 [1 2]
 [1 3]
 [1 4]]
[[1]
 [2]
 [3]
 [4]]


A linear regression in its simplest form:

$$y = α + βx + ε$$

but we can assume that our α is either 0 or 1, and ε is zero! So really we're trying to capture this relaitonship, and determine $\beta$

$$y = βx$$

but we want to solve for β, which means our new equation looks more like this:

$$β = ( X^TX)^{-1} X^Ty$$

How did we get there?

$$β = \frac{y}x$$

That's problematic, as we cannot divide by a matrix! So we first square the matrix.

$$\frac{xy}{x^2}$$

In [106]:
X

array([[1, 1],
       [1, 2],
       [1, 3],
       [1, 4]])

In [105]:
print dot(X.T, X)

[[ 4 10]
 [10 30]]


That's how we avoid division by a matrix

$$\frac{1}{x{^2}} * \frac{xy}1$$

By using inversion; since raising $x$ to the power of negative 1 is equal to $1$ over $x$

$$(XX)^{-1}XY$$

In [107]:
n = inv(dot(X.T, X))
print n

[[ 1.5 -0.5]
 [-0.5  0.2]]


and

In [108]:
k = dot(X.T, y)
k

array([[10],
       [30]])

And finally to make it programmer friendly

$$β = ( X^TX)^{-1} X^TY$$

In [109]:
coef_ = dot(n, k)
coef_

array([[ 0.],
       [ 1.]])

And thus we've created a nice short and sweet regression function with no accounting for error:

In [110]:
def regression(input, response):
    return dot(inv(dot(input.T, input)), dot(input.T, response))

In [111]:
regression(X,y)

array([[ 0.],
       [ 1.]])

### Practice: NumPy

Last class we spent a huge portion of our lab time working on some basic linear algebra functions. Thankfully, NumPy and SciPy offer a lot of this already for us and more.

In [113]:
arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [119]:
from numpy import *

arrayOne = arange(15).reshape(3, 5)
arrayOne

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [118]:
arrayTwo = arange(15).reshape(5, 3)
arrayTwo

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [120]:
vector = array([10, 15, 20])
vector

array([10, 15, 20])

numpy as a standard uses arrays, which can be of any n-dimensions, though numpy has a specific subclass of arrays called matrices.

In [121]:
matrixOne = matrix('1 2 3; 4 5 6')
matrixOne

matrix([[1, 2, 3],
        [4, 5, 6]])

In [122]:
matrixTwo = matrix('1 2; 3 4; 5 6')
matrixTwo

matrix([[1, 2],
        [3, 4],
        [5, 6]])

Bare in mind that these are still two different structures, and therefore interact differently in Python. For example: which of these produces the answer we'd expect from last class? What does the other one actually do?

In [123]:
a1 = array([ [1, 2], [3, 4] ])
a2 = array([ [1, 3], [2, 4] ])
m1 = matrix('1 2; 3 4')
m2 = matrix('1 3; 2 4')

In [124]:
type(a1)

numpy.ndarray

In [126]:
print a1
print a2
a1 * a2

[[1 2]
 [3 4]]
[[1 3]
 [2 4]]


array([[ 1,  6],
       [ 6, 16]])

In [127]:
m1 * m2

matrix([[ 5, 11],
        [11, 25]])

Note that we can easily get around this issue using the _dot_ function:

In [128]:
dot(a1, a2)

array([[ 5, 11],
       [11, 25]])

In [129]:
dot(m1, m2)

matrix([[ 5, 11],
        [11, 25]])

That said, here's some other common functions that we built last week that are likely much more efficient as their numpy functions:

In [130]:
# .T Transposes the array or matrix:
a1.T

array([[1, 3],
       [2, 4]])

In [131]:
# .I returns the matrix inverse:
m1.I

matrix([[-2. ,  1. ],
        [ 1.5, -0.5]])

In [134]:
# eye(value) creates an identity matrix:
iFive = eye(5)
iFive

array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])

In [135]:
iFive = eye

Keep track of when you are using n-dimensional arrays and when you are using matrices in numpy: this can cause a lot of headaches!

### On your own

Verify the work from last week by building the matrices you tested last week and comparing your functions to numpys. Those of you more experience with Python, feel free to check speed with `timeit`

In [None]:
%%timeit
dot(a1, a2)

Numpy also has a lot of basic math functions built in which do not exist in python (but are common in stats languages). Feel free to go through these on your own:

In [None]:
exp(10) # e ^ value

In [None]:
log(10)

In [None]:
sqrt(4)

### Practice: PANDAS

In [None]:
import pandas as pd

In [None]:
# Run the curl command if you don't have the data already 
%%sh
curl http://stat.columbia.edu/~rachel/datasets/nyt1.csv > ../data/nytimes.csv

In [None]:
df = pd.read_csv('../data/nytimes.csv')

In [None]:
df

In [None]:
df.describe()

In [None]:
df[:10]

In [None]:
df[cols]

In [None]:
# Create the average impressions and clicks for each age.
cols  = ['Age', 'Impressions', 'Clicks']
dfg = df[cols].groupby(['Age']).agg([np.mean])

In [None]:
dfg[:10]

In [None]:
%matplotlib inline

In [None]:
# Likewise, we can create new variables:
df['log_impressions'] = df['Impressions'].apply(log)
df['log_impressions']

In [None]:
# Or even recluster our values into more specific age groups:

def map_age_category(x):
    """
    Function that groups users by age.
    """
    if x == 0:
        return 0
    elif x < 18:
        return 1
    elif x < 25:
        return 2
    elif x < 32:
        return 3
    elif x < 45:
        return 4
    else:
        return 5

In [None]:
df['age_categories'] = df['Age'].apply(map_age_category)

df[['age_categories']].describe()

### Classwork

The NYTimes data is hosted across 30 csv files:

```bash
# Replace # with anything between 1 and 30
http://stat.columbia.edu/~rachel/datasets/nyt1.csv
```

We'd like to use Pandas and numpy to have a simple script that aggregates all of this data into one dataframe. This time, let's just get the **click through rate** per **age**, **gender**, and **signed_in** (remember that CTR is calculated as clicks/impressions).

You can export the final dataframe using pandas to_csv:

In [None]:
df.to_csv('nytimes_aggregation.csv')

In [None]:
for n in range(1,31):
    !curl http://{n}

Explore plotting your new aggregated data in various forms to understand the feature space, and try using sklearn’s linear model function with your aggregate data to predict CTR per age.

You may have to use the `astype` function to help convert between ints and floats. Refer the to [numpy documentation](http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.astype.html) for more.

In [None]:
# inspect the dtypes
df.dtypes

Pandas dataframes has some built-in plotting methods that are wrappers for matplotlib. To see a histogram of your distribution, see the [hist()](http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.hist.html) documentation.

In [None]:
import matplotlib.pyplot as plt

df.hist(figsize=(18,12))
plt.show()

More generally, we'll be using matplotlib to plot our graphics. Play around with the following code to understand what it does, or consult the [matplotlib.plot() documentation](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot) for further details.

In [None]:
x = linspace(-15,15,100) # 100 linearly spaced numbers
y = sin(x)/x # computing the values of sin(x)/x

# compose plot
fig = plt.figure(figsize=(18, 8), dpi=300)
plt.plot(x,y) # sin(x)/x
plt.plot(x,y,'co') # same function with cyan dots
plt.plot(x,2*y,x,3*y) # 2*sin(x)/x and 3*sin(x)/x
plt.show() # show the plot

Once you've figured out how to use the plot method from matplotlib, try plotting the various relationships from your NYTimes DataFrame.

Once you're ready to submit yout work, keep the csv file on your computer, but send your code over to GitHub. This should be much simpler than anything else we've done so far!

**Send your scripts to Gist and send Dickson and myself the link!**

## Discussion : Final Project Ideas

1. Curate a list of potential final project ideas, as our goal is to answer a question using machine learning. for each question: which “problem” does it fall under?
2. We’ll discuss these in smaller groups first, and share some ideas together as a class.

![break](assets/resources.png)

## Resources

## Colofon

In [28]:
from utils import *
print_versions()

Python    2.7.10
IPython   4.0.0
numpy     1.10.1
pandas    0.17.0
sklearn   0.16.1
seaborn   0.6.0


In [29]:
%%html

<link rel="stylesheet" href="theme/custom.css">