# Introduction to Statistical Learning 
Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani is considered a canonical text in the field of statistical/machine learning and is an absolutely fantastic way to move forward in your analytics career. [The text is free to download](http://www-bcf.usc.edu/~gareth/ISL/) and an [online course by the authors themselves](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about) is currently available in self-pace mode, meaning you can complete it any time. Make sure to **[REGISTER FOR THE STANDFORD COURSE!](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about)** The videos have also been [archived here on youtube](http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/).

# How will Houston Data Science cover the course?
The Stanford online course covers the entire book in 9 weeks and with the R programming language. The pace that we cover the book is yet to be determined as there are many unknown variables such as interest from members, availability of a venue and general level of skills of those participating. That said, a meeting once per week to discuss the current chapter or previous chapter solutions is the target.


# Python in place of R
Although R is a fantastic programming language and is the language that all the ISLR labs are written in, the Python programming language, except for rare exceptions, contains analgous libraries that contain the same statistical functionality as those in R.

# Notes, Exercises and Programming Assignments all in the Jupyter Notebok
ISLR has both end of chapter problems and programming assignments. All chapter problems and programming assignments will be answered in the notebook.

# Replicating Plots
The plots in ISLR are created in R. Many of them will be replicated here in the notebook when they appear in the text

# Book Data
The data from the books was downloaded using R. All the datasets are found in either the MASS or ISLR packages. They are now in the data directory. See below

In [1]:
ls data

[31mAdvertising.csv[m[m* carseats.csv     khan_xtrain.csv  portfolio.csv
Credit.csv       college.csv      khan_ytest.csv   smarket.csv
auto.csv         default.csv      khan_ytrain.csv  usarrests.csv
boston.csv       hitters.csv      nci60_data.csv   [31mwage.csv[m[m*
caravan.csv      khan_xtest.csv   nci60_labs.csv   weekly.csv


# ISLR Videos
[All Old Videos](https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/)

# Chapter 9: Support Vector Machines
"New" learning method invented by Vladamir Vapnik in the 1990's. Likely best classifier at its time, now surpassed by gradient boosted trees and neural networks.

Three different but very closely related classifiers in this chapter
* Maximum margin classifier
* Support Vector classifier
* Support vector machine

## Maximum Margin Classifier
An optimal hyperplane that separates classes.  
**Hyperplane** - For any p dimensional space, it is a p-1 dimensional flat surface. A line in 2 dimensions, a plane in three dimensions. Mathematical definition in p dimensions: $\beta_0 + \beta_1 X_1 + ... + \beta_p X_p = 0$. It divides whatever your dimension is into two pieces.

## Linearly Separable Case
First and easiest we will look at a 2 dimensional data that is perfectly linearly separable. Here the hyperplane is a line. ![line](https://www.otexts.org/sites/default/files/resize/sfml/images/sep_hyp-600x486.png)

Many different lines can be drawn here to separate the data. For math simplification, lets let $y$ equal -1 for one class and the other 1, then if $\beta_0 + \beta_1 X_1 + ... + \beta_p X_p > 0$ we will classify the observation as 1 and if $\beta_0 + \beta_1 X_1 + ... + \beta_p X_p < 0$ we will classify it as -1. 

Multiplying both equations by $y$ yields $y(\beta_0 + \beta_1 X_1 + ... + \beta_p X_p) > 0$ for any correctly classified observation.

If the data is perfectly separable then an infinte number of hyperplanes will exist that can perfectly separate the data. A natural choice is to choose a hyperplane the maximizes the distance from each observation to the hyperplane - one that has a large margin - the maximum margin.

## What defines maximum margin?
In the linearly separable case we find the line that has the maximum margin between the two classes. The maximum margin is defined as the distance of the closet point to the separating hyperplane. So, we are maximizing the minimum distance from the hyperplane. All other points are of no consequence which is a bit scary but it happens to work well. These minimum distance points are called the support vectors.

## Non-Separable Data
If the data is not linearly separable then no hyperplane can separate the data and thus no margin can exist. This case is most common with real data. The maximum margin classifier is very sensitive to single data points. The hyperplane can change drastically with the addition of one new data point. To help combat this type of overfitting and to allow for non-separable classification we can use a soft margin. We allow some observation to be on the wrong side of the hyperplane or within the margin. This margin violation makes the margin 'soft'.

The problem formulation is tweaked such that we allow for some total amount of error, C. This total error acts as an allowance like a balance in the bank that you can spend on the amount of error you can make. The errors are called slack variables. C is chosen through cross-validation.

## Support Vector Machines
For data that has a non-linear seaparating hyperplane, something different must be done. We can transform the variables as in previous chapters - squaring them, creating interaction terms, etc... or we can use kernels. The support vector machine can enlarge the feature space without doing these transformations in an efficient manner using kernels.

The solution to SVM's involves only inner products of the observations. The decision boundary is just a weighted sum of the inner product between observations that are the support vectors. The inner product can be replaed with a kernel function. There are several different kernel functions. Linear kernel is just the standard inner product. Polynomial kernel is linear kernel taken to the power of a chosen polynomial. The radial basis funciton is proportional to the squared distance between points. All kernels measure a degree of closeness. So the further the two points in the kernel function are, the smaller the result of the kernel calculation.

Kernels allow for very high dimensional (infinte with radial basis function) feature space enlargement without actually going into that space.

## Multi-Class SVM
Two different approaches for K classes where K > 2. One vs One constructs a different SVM for every pair of classes that exist. Test observations are assigned to the class that gets the most votes. One vs All constructs K SVMs where all observations are used - each class is compared to all other K-1 classes. The class with the greatest distance from the hyperplane is chosen.

# for class
do simple linearly separable case (hard margin) with y = 1/2x + 3 or something.

Write data points (x1, x2), y where y is -1 or 1

Make data points in a manner that one additional point of one class close to another class has tremendous influence on the line.