# Decision Trees and Ensembles Lab

In this lab we will compare the performance of a simple Decision Tree classifier with a Bagging classifier. We will do that on few datasets, starting from the ones offered by Scikit Learn.

## 1. Breast Cancer Dataset
We will start our comparison on the breast cancer dataset.
You can load it directly from scikit-learn using the `load_breast_cancer` function.

### 1.a Simple comparison
1. Load the data and create X and Y.
- Initialize a Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds.
- Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [57]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [58]:
import sklearn.datasets as dataloader
bcancer = dataloader.load_breast_cancer()
X = bcancer.data
Y = bcancer.target

cols = bcancer.feature_names

In [59]:
X = pd.DataFrame(X)

X.columns = cols

In [60]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import BaggingClassifier

In [61]:
# Without Bagging
dtc = DecisionTreeClassifier(max_depth=None)

cross_val_score(dtc, X, Y, cv=5)

array([ 0.91304348,  0.93913043,  0.92035398,  0.95575221,  0.90265487])

In [62]:
# With Bagging
dtc = DecisionTreeClassifier(max_depth=None)

bag = BaggingClassifier(dtc, max_samples = 0.5, max_features = 0.5) 

cross_val_score(bag, X, Y, cv=5)

array([ 0.94782609,  0.96521739,  0.97345133,  0.95575221,  0.94690265])

### 1.b Scale (normalize) data
As you may have noticed the features are not normalized. Do the score improve with normalization?

1. Normalize the predictors.
2. Build a decision tree classifier and bagging decision tree classifier.
3. Are scores different from non-scaled data?


In [75]:
X_norm = X-X.mean()
X_norm = X_norm-X_norm.mean()
X_norm = X_norm / X.std()

#X_norm = (X-X.mean())/X.std()

mean radius                  3.524049
mean texture                 4.301036
mean perimeter              24.298981
mean area                  351.914129
mean smoothness              0.014064
mean compactness             0.052813
mean concavity               0.079720
mean concave points          0.038803
mean symmetry                0.027414
mean fractal dimension       0.007060
radius error                 0.277313
texture error                0.551648
perimeter error              2.021855
area error                  45.491006
smoothness error             0.003003
compactness error            0.017908
concavity error              0.030186
concave points error         0.006170
symmetry error               0.008266
fractal dimension error      0.002646
worst radius                 4.833242
worst texture                6.146258
worst perimeter             33.602542
worst area                 569.356993
worst smoothness             0.022832
worst compactness            0.157336
worst concav

In [71]:
from sklearn.preprocessing import StandardScaler

ssc = StandardScaler()

X_n = ssc.fit_transform(X)

In [72]:
X_n = pd.DataFrame(X_n)

In [73]:
X_n.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,2.255747,...,1.88669,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,-0.868652,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.24389,0.28119
2,1.579888,0.456187,1.566503,1.558884,0.94221,1.052926,1.363478,2.037231,0.939685,-0.398008,...,1.51187,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955,1.152255,0.201391
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,4.910919,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.93501
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.53934,1.371011,1.428493,-0.00956,-0.56245,...,1.298575,-1.46677,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.3971


In [74]:
X_norm.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,1.0961,-2.071512,1.268817,0.98351,1.567087,3.280628,2.650542,2.530249,2.215566,2.253764,...,1.885031,-1.358098,2.301575,1.999478,1.306537,2.614365,2.107672,2.294058,2.748204,1.935312
1,1.828212,-0.353322,1.684473,1.90703,-0.826235,-0.486643,-0.023825,0.547662,0.001391,-0.867889,...,1.80434,-0.368879,1.533776,1.888827,-0.375282,-0.430066,-0.14662,1.086129,-0.243675,0.280943
2,1.578499,0.455786,1.565126,1.557513,0.941382,1.052,1.36228,2.03544,0.938859,-0.397658,...,1.510541,-0.023953,1.346291,1.455004,0.526944,1.08198,0.854222,1.953282,1.151242,0.201214
3,-0.768233,0.253509,-0.592166,-0.763792,3.280667,3.399917,1.914213,1.450431,2.864862,4.906602,...,-0.281217,0.133866,-0.24972,-0.549538,3.391291,3.889975,1.987839,2.173873,6.040726,4.930672
4,1.748758,-1.150804,1.775011,1.824624,0.280125,0.538866,1.369806,1.427237,-0.009552,-0.561956,...,1.297434,-1.465481,1.337363,1.219651,0.220362,-0.313119,0.61264,0.728618,-0.86759,-0.396751


### 1.c Grid Search

Grid search is a great way to improve the performance of a classifier. Let's explore the parameter space of both models and see if we can improve their performance.

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier.
2. Search for few parameter values to try and improve the score of the classifier.
4. Check the best\_score\_ once you've trained it. Is it better than before?
5. How does the score of the Grid-searched DT compare with the score of the Bagging DT?
6. Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
7. Repeat the search
    - Note that you'll have to change parameter names for the base_estimator (see example).
    - Note that there are also additional parameters to change (see example).
    - Note that you may end up with a grid space to large to search in a short time - choose smaller ranges of parameters!
    - Make use of the n_jobs parameter to speed up your grid search (-1 uses all cores).
8. Does the score improve for the Grid-searched Bagging Classifier?
9. Which score is better? Are the score significantly different? How could/would you judge that?

---

**EXAMPLE**
```python
params = {"base_estimator__max_depth": [3,5,10,20],
          "base_estimator__max_features": [None, "auto"],
          "base_estimator__min_samples_leaf": [1, 3, 5, 7, 10],
          "base_estimator__min_samples_split": [2, 5, 7],
          'bootstrap_features': [False, True],
          'max_features': [0.5, 0.7, 1.0],
          'max_samples': [0.5, 0.7, 1.0],
          'n_estimators': [2, 5, 10, 20],
         }

bagged_decision_trees = BaggingClassifier(DecisionTreeClassifier())

gsbdt = GridSearchCV(bagged_decision_trees, params, n_jobs=-1, cv=5)
```

In [77]:
from sklearn.grid_search import GridSearchCV

In [89]:
params = {"base_estimator__max_depth" : [3,5,8],
         "base_estimator__max_features" : [None, "auto"],
         "base_estimator__min_samples_leaf" : [1,2,3],
         "base_estimator__min_samples_split" : [2,3,5],
         'bootstrap_features' : [False,True],
         'max_features': [0.25,0.5,0.75],
         'max_samples' : [0.25,0.5,0.75],
         'n_estimators' : [2,5]}

bagged = BaggingClassifier(DecisionTreeClassifier())

gs_bd_dt = GridSearchCV(bagged, params, n_jobs=2, cv=5)

In [90]:
gs_bd_dt.fit(X,Y)

GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, ...n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=2,
       param_grid={'n_estimators': [2, 5], 'max_samples': [0.25, 0.5, 0.75], 'base_estimator__min_samples_split': [2, 3, 5], 'base_estimator__max_depth': [3, 5, 8], 'bootstrap_features': [False, True], 'max_features': [0.25, 0.5, 0.75], 'base_estimator__min_samples_leaf': [1, 2, 3], 'base_estimator__max_features': [None, 'auto']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [91]:
gs_bd_dt.best_params_

{'base_estimator__max_depth': 8,
 'base_estimator__max_features': None,
 'base_estimator__min_samples_leaf': 3,
 'base_estimator__min_samples_split': 5,
 'bootstrap_features': False,
 'max_features': 0.75,
 'max_samples': 0.75,
 'n_estimators': 5}

In [92]:
gs_bd_dt.best_score_

0.96660808435852374

## 2 Diabetes and Regression

Scikit Learn has a dataset of diabetic patients obtained from this study:

http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

442 diabetes patients were measured on 10 baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements.

The target is a quantitative measure of disease progression one year after baseline.

Repeat the above comparison between a DecisionTreeRegressor and a Bagging Regressor instead of classifiers.

### 2.a Simple comparison
1. Load the data and create X and Y
2. Initialize a Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. What does the score mean (look at documentation!).
3. Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
4. Which score is better? Are the score significantly different? How could/would you judge that?

### 2.b Grid Search

Repeat Grid search as above:

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Regressor.
2. Search for few values of the parameters in order to improve the score of the regressor.
3. Check the best\_score\_ once you've trained it. Is it better than before?
4. How does the score of the Grid-searched DT compare with the score of the Bagging DT?
5. Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Regressor
6. Repeat the search
    - Note that you'll have to change parameter names for the base_estimator.
    - Note that there are also additional parameters to change.
    - Note that you may end up with a grid space to large to search in a short time.
    - Make use of the n_jobs parameter to speed up your grid search.
7. Does the score improve for the Grid-searched Bagging Regressor?
8. Which score is better? Are the score significantly different? How could/would you judge that?


## [BONUS]: Project 5 data

Repeat the appropriate analysis (classification/regression) for the Project 5 Dataset.