# Introduction to Statistical Learning 
Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani is considered a canonical text in the field of statistical/machine learning and is an absolutely fantastic way to move forward in your analytics career. [The text is free to download](http://www-bcf.usc.edu/~gareth/ISL/) and an [online course by the authors themselves](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about) is currently available in self-pace mode, meaning you can complete it any time. Make sure to **[REGISTER FOR THE STANDFORD COURSE!](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about)** The videos have also been [archived here on youtube](http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/).

# How will Houston Data Science cover the course?
The Stanford online course covers the entire book in 9 weeks and with the R programming language. The pace that we cover the book is yet to be determined as there are many unknown variables such as interest from members, availability of a venue and general level of skills of those participating. That said, a meeting once per week to discuss the current chapter or previous chapter solutions is the target.


# Python in place of R
Although R is a fantastic programming language and is the language that all the ISLR labs are written in, the Python programming language, except for rare exceptions, contains analgous libraries that contain the same statistical functionality as those in R.

# Notes, Exercises and Programming Assignments all in the Jupyter Notebok
ISLR has both end of chapter problems and programming assignments. All chapter problems and programming assignments will be answered in the notebook.

# Replicating Plots
The plots in ISLR are created in R. Many of them will be replicated here in the notebook when they appear in the text

# Book Data
The data from the books was downloaded using R. All the datasets are found in either the MASS or ISLR packages. They are now in the data directory. See below

In [1]:
ls data

[31mAdvertising.csv[m[m* carseats.csv     khan_xtrain.csv  portfolio.csv
Credit.csv       college.csv      khan_ytest.csv   smarket.csv
auto.csv         default.csv      khan_ytrain.csv  usarrests.csv
boston.csv       hitters.csv      nci60_data.csv   [31mwage.csv[m[m*
caravan.csv      khan_xtest.csv   nci60_labs.csv   weekly.csv


# ISLR Videos
[All Old Videos](https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/)

# Chapter 8: Tree Based Methods
The fan favorite decision tree and its family the random forest and the kaggle champion gradient boosted tree will be discussed in this chapter. Decision trees are simple, easy to interpret and can give very good results when bagged and boosted.

## Decision Trees
Can be applied to both regression and classification. Decision trees are graphics where you can start at the "root" and traverse your way down by making decisions at the "branches" before finally ending up at a "leaf" that gives you the prediction. 

![tree image](http://image.slidesharecdn.com/decisiontree-151015165353-lva1-app6892/95/classification-using-decision-tree-12-638.jpg?cb=1444928106)

The book describes two general steps for building a (regression) decision tree
1. Divide the predictor space into j number of regions
2. For each region, find the mean response and use it as the predicted value. This will minimize the squared error for that region.

The regions in theory can be divided in any crazy manner you choose but in practice, they are divided into high dimensional rectangles. See image 8.1 from the book. For example, you could have used a line with a non-zero slope to parition the region below to get a more accurate fit, but simplicity wins here and we just split on horizontal and vertical lines - "high dimensional rectangles"

![rectangle](Images/decision.png)

### How do we get the branches?
We could try and just build every single tree imaginable and find the tree with the lowest squared error but this is computationally infeasible even for a relatively small number of predictors. Instead, a greedy approach is used by building the tree one branch at a time. The first branch is constructed by testing out many different binary splits of the data. 

For example, $X_1 < 5$ and  $X_1 >= 5$ would be one potential split. $X_2 = YES$ and  $X_2 = NO$ could be another binary split. Whichever split yields the lowest squared error would be considered the best split and that split would be chosen for the first branch. This process now continues for each branch interatively until some stopping criteria is met (maximum number of branches, minimum number of observations in a certain branch, etc...).

### Tree Pruning
It is possible to build a decision tree so specific (one with so many branches) that each observation can be predicted exactly. This would be complete memorization, ie overfitting, of the data. Because we want to have the tree work with unseen data, we can prune the tree.

One strategy would be some have some threshold for stopping a branch from splitting - it must have decreased RSS by a certain amount. Since this might miss a good split deeper in the tree, pruning is preferred.

Pruning works by.... 
1. growing a very large tree and stopping only when a minimum number of observations are left in each branch.
2. At each stage during the growing process add a penalty term $\alpha|T|$ to RSS where |T| is the number of terminal nodes.
3. This will give a function that maps $\alpha$ to a particular subtree. So $\alpha = 0$ would map to the original huge tree and for example $\alpha = 5$ could map to a tree that with only half of the terminal nodes.

Choose $\alpha$ through cross validation by...
1. Splitting training data into K folds
2. Grow a large tree and apply the penalty term exactly as above (map each $\alpha$ to a particular subtree.)
3. evaluate each $\alpha$ (subtree) on the left-out fold
4. Average all the $\alpha$ (subtrees) for each iteration of the K-folds

Then use this $\alpha$ to choose the tree from above.

### Classification Trees
Predict at each node, the most commonly occurring class.