# Visualizing CARTs with admissions data

Using the admissions data from earlier in the course, build CARTs, look at how they work visually, and compare their performance to more standard, parametric models.


---

### Install and load the packages required to visually show decision tree branching

You will need to first:

1. Install `graphviz` with homebrew (on OSX - not sure what linux uses). The command will be `brew install graphviz`
- Install `pydotplus` with `pip install pydotplus`
- Load the packages as shown below (you may need to restart the kernel after the installations.)

In [2]:
# REQUIREMENTS:
# pip install pydotplus
# brew install graphviz

# Use graphviz to make a chart of the regression tree decision points:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

---

### Load in admissions data and other python packages

In [25]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm

from sklearn.cross_validation import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier

plt.style.use('fivethirtyeight')

from ipywidgets import *
from IPython.display import display

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

  from pandas.core import datetools


In [14]:
admit = pd.read_csv('assets/admissions.csv')

In [31]:
admit.head()
admit = admit.drop_duplicates()
admit.dropna(inplace=True)

---

### Create regression and classification X, y data

The regression data will be:

    Xr = [admit, gre, prestige]
    yr = gpa
    
The classification data will be:

    Xc = [gre, gpa, prestige]
    yc = admit

In [35]:
Xr= admit[['admit', 'gre', 'prestige']]
yr = admit['gpa'] 
Xc = admit[['gre', 'gpa','prestige']]
yc = admit['admit']



0,1,2,3
Dep. Variable:,gpa,R-squared:,0.151
Model:,OLS,Adj. R-squared:,0.144
Method:,Least Squares,F-statistic:,22.93
Date:,"Mon, 24 Jul 2017",Prob (F-statistic):,1.09e-13
Time:,19:30:32,Log-Likelihood:,-142.11
No. Observations:,392,AIC:,292.2
Df Residuals:,388,BIC:,308.1
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.6558,0.109,24.433,0.000,2.442,2.870
admit,0.0813,0.040,2.049,0.041,0.003,0.159
gre,0.0012,0.000,7.561,0.000,0.001,0.001
prestige,0.0065,0.019,0.338,0.736,-0.031,0.045

0,1,2,3
Omnibus:,5.689,Durbin-Watson:,1.917
Prob(Omnibus):,0.058,Jarque-Bera (JB):,5.654
Skew:,-0.263,Prob(JB):,0.0592
Kurtosis:,2.737,Cond. No.,3700.0


---

### Cross-validate regression and logistic regression on the data

Fit a linear regression for the regression problem and a logistic for the classification problem. Cross-validate the R2 and accuracy scores.

In [37]:
Xr = sm.add_constant(Xr)
# Note the difference in argument order
model = sm.OLS(yr, Xr).fit()
predictions = model.predict(Xr)
model.summary()

0,1,2,3
Dep. Variable:,gpa,R-squared:,0.151
Model:,OLS,Adj. R-squared:,0.144
Method:,Least Squares,F-statistic:,22.93
Date:,"Mon, 24 Jul 2017",Prob (F-statistic):,1.09e-13
Time:,19:31:11,Log-Likelihood:,-142.11
No. Observations:,392,AIC:,292.2
Df Residuals:,388,BIC:,308.1
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.6558,0.109,24.433,0.000,2.442,2.870
admit,0.0813,0.040,2.049,0.041,0.003,0.159
gre,0.0012,0.000,7.561,0.000,0.001,0.001
prestige,0.0065,0.019,0.338,0.736,-0.031,0.045

0,1,2,3
Omnibus:,5.689,Durbin-Watson:,1.917
Prob(Omnibus):,0.058,Jarque-Bera (JB):,5.654
Skew:,-0.263,Prob(JB):,0.0592
Kurtosis:,2.737,Cond. No.,3700.0


In [40]:
Xc = sm.add_constant(Xc)
# Note the difference in argument order
model = sm.Logit(yc, Xc).fit()
predictions = model.predict(Xr)
model.summary()

Optimization terminated successfully.
         Current function value: 0.574528
         Iterations 6


0,1,2,3
Dep. Variable:,admit,No. Observations:,392.0
Model:,Logit,Df Residuals:,388.0
Method:,MLE,Df Model:,3.0
Date:,"Mon, 24 Jul 2017",Pseudo R-squ.:,0.07645
Time:,19:32:15,Log-Likelihood:,-225.22
converged:,True,LL-Null:,-243.86
,,LLR p-value:,4.001e-08

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-3.0647,1.146,-2.675,0.007,-5.311,-0.819
gre,0.0022,0.001,2.018,0.044,6.37e-05,0.004
gpa,0.6785,0.331,2.052,0.040,0.030,1.327
prestige,-0.5647,0.128,-4.406,0.000,-0.816,-0.314


---

### Building regression trees

With `DecisionTreeRegressor`:

1. Build 4 models with different parameters for `max_depth`: `max_depth=1`, `max_depth=2`, `max_depth=3`, and `max_depth=None`
2. Cross-validate the R2 scores of each of the models and compare to the linear regression earlier.

---

### Visualizing the regression tree decisions

Use the template code below to create charts that show the logic/branching of your four decision tree regressions from above.

#### Interpreting a regression tree diagram

- First line is the condition used to split that node (go left if true, go right if false)
- `samples` is the number of observations in that node before splitting
- `mse` is the mean squared error calculated by comparing the actual response values in that node against the mean response value in that node
- `value` is the mean response value in that node

In [None]:
# TEMPLATE CODE

# initialize the output file object
dot_data = StringIO() 

# my fit DecisionTreeRegressor object here is: dtr1
# for feature_names i put the columns of my Xr matrix
export_graphviz(dtr1, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,
                feature_names=Xr.columns)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())  

---

### Building classification trees

With `DecisionTreeClassifier`:

1. Again build 4 models with different parameters for `max_depth`: `max_depth=1`, `max_depth=2`, `max_depth=3`, and `max_depth=None`
2. Cross-validate the accuracy scores of each of the models and compare to the logistic regression earlier.

Note that now you'll be using the classification task where we are predicting `admit`.

---

### Visualize the classification trees

The plotting code will be the same as for regression, you just need to change the model you're using for each plot and the feature names.

The output changes somewhat from the regression tree chart. Earlier it would give the MSE of that node, but now there is a line called `value` that tells you the count of each class at that node.

---

### Using GridSearchCV to find the best decision tree classifier

Decision tree regression and classification models in sklearn offer a variety of ways to "pre-prune" (by restricting the how many times the tree can branch and what it can use).

Measure           | What it does
------------------|-------------
max_depth         | How many nodes deep can the decision tree go?
max_features      | Is there a cut off to the number of features to use?
max_leaf_nodes    | How many leaves can be generated per node?
min_samples_leaf  | How many samples need to be included at a leaf, at a minimum?  
min_samples_split | How many samples need to be included at a node, at a minimum?

It is not always best to search over _all_ of these in a grid search, unless you have a small dataset. Many of them while not redundant are going to have very similar effects on your model's fit.

Check out the documentation here:

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

---

## Switch over to the college stats dataset for the rest of the lab.

We are going to be predicting whether or not a college is public or private. Set up your `X`, `y` variables accordingly.

In [None]:
import pandas as pd
col = pd.read_csv('/assets/college.csv')

---

### Set up and run the GridSearch on the college data, building the best decision tree classifier.

### Build a bagging classifier using GridSearchCV.

### Build a Random Forest classifier using GridSearchCV.

### Build an ExtraTrees classifier using GridSearchCV.

### Compare the best models. Based on GridSearchCV, which is the best model to build? Build it.

---

### Print out the "feature importances" of this best model.

The model has an attribute called `.feature_importances_` which can tell us which features were most important vs. others. It ranges from 0 to 1, with 1 being the most important.

An easy way to think about the feature importance is how much that particular variable was used to make decisions. Really though, it also takes into account how much that feature contributed to splitting up the class or reducing the variance.

A feature with higher feature importance reduced the criterion (impurity) more than the other features.

Below, show the feature importances for each variable predicting private vs. not, sorted by most important feature to least.