# Decision Trees and Ensembles Lab

In this lab we will compare the performance of a simple Decision Tree classifier with a Bagging classifier. We will do that on few datasets, starting from the ones offered by Scikit Learn.

## 1. Breast Cancer Dataset
We will start our comparison on the breast cancer dataset.
You can load it directly from scikit-learn using the `load_breast_cancer` function.

### 1.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds
- Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from sklearn import datasets

In [3]:
data = datasets.load_breast_cancer()

In [4]:
data.keys()

['target_names', 'data', 'target', 'DESCR', 'feature_names']

In [5]:
X = pd.DataFrame(data['data'], columns=data['feature_names'])
y = data['target']

In [6]:
X.head(2)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


In [7]:
from sklearn import tree, metrics, ensemble, cross_validation, linear_model

In [8]:
cv = cross_validation.KFold(len(y), n_folds=5, shuffle=True)

In [9]:
estimator = tree.DecisionTreeClassifier()

In [10]:
results = cross_validation.cross_val_score(estimator=estimator, X=X, y=y, cv=cv)
print results
print np.mean(results)

[ 0.93859649  0.95614035  0.93859649  0.96491228  0.92920354]
0.945489830772


In [11]:
bag_estimator = ensemble.BaggingClassifier(base_estimator=tree.DecisionTreeClassifier())

In [12]:
results = cross_validation.cross_val_score(estimator=bag_estimator, X=X, y=y, cv=cv)
print results
print np.mean(results)

[ 0.96491228  0.95614035  0.92982456  0.95614035  0.95575221]
0.95255395125


In [13]:
results = cross_validation.cross_val_score(estimator=linear_model.LogisticRegression(), X=X, y=y, cv=cv)
print results
print np.mean(results)

[ 0.96491228  0.93859649  0.92982456  0.97368421  0.94690265]
0.950784039745


In [14]:
bag_estimator = ensemble.BaggingClassifier(base_estimator=linear_model.LogisticRegression())
results = cross_validation.cross_val_score(estimator=bag_estimator, X=X, y=y, cv=cv)
print results
print np.mean(results)

[ 0.96491228  0.92105263  0.92982456  0.96491228  0.9380531 ]
0.943750970346


### 1.b Scaled pipelines
As you may have noticed the features are not normalized. Do the score improve with normalization?
By now you should be very familiar with pipelines and scaling, so:

1. Create 2 pipelines, with a scaling preprocessing step and then either a decision tree or a bagging decision tree.
- Which score is better? Are the score significantly different? How can you judge that?
- Are the scores different from the non-scaled data?

In [15]:
from sklearn import pipeline, preprocessing

In [16]:
pipe_decision_tree = pipeline.make_pipeline(preprocessing.StandardScaler(),
                                            tree.DecisionTreeClassifier())

In [17]:
results = cross_validation.cross_val_score(estimator=pipe_decision_tree, X=X, y=y, cv=cv)
print results
print np.mean(results)

[ 0.95614035  0.96491228  0.9122807   0.96491228  0.92035398]
0.943719919267


In [18]:
bag_estimator = ensemble.BaggingClassifier(base_estimator=tree.DecisionTreeClassifier())
pipe_decision_bag = pipeline.make_pipeline(preprocessing.StandardScaler(),
                                            bag_estimator)

In [19]:
results = cross_validation.cross_val_score(estimator=pipe_decision_bag, X=X, y=y, cv=cv)
print results
print np.mean(results)

[ 0.97368421  0.97368421  0.92982456  0.96491228  0.94690265]
0.957801583605


In [20]:
pipe_decision_lr = pipeline.make_pipeline(preprocessing.StandardScaler(),
                                          linear_model.LogisticRegression())

In [21]:
results = cross_validation.cross_val_score(estimator=pipe_decision_lr, X=X, y=y, cv=cv)
print results
print np.mean(results)

[ 0.99122807  0.98245614  0.99122807  0.97368421  0.96460177]
0.980639652228


In [22]:
bag_estimator = ensemble.BaggingClassifier(base_estimator=linear_model.LogisticRegression())
pipe_decision_tree = pipeline.make_pipeline(preprocessing.StandardScaler(),
                                            bag_estimator)

In [23]:
results = cross_validation.cross_val_score(estimator=pipe_decision_lr, X=X, y=y, cv=cv)
print results
print np.mean(results)

[ 0.99122807  0.98245614  0.99122807  0.97368421  0.96460177]
0.980639652228


### 1.c Grid Search

Grid search is a great way to improve the performance of a classifier. Let's explore the parameter space of both models and see if we can improve their performance.

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier
- search for few values of the parameters in order to improve the score of the classifier
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Classifier?
- Which score is better? Are the score significantly different? How can you judge that?

In [24]:
from sklearn import grid_search

In [25]:
pipe_decision_tree = pipeline.make_pipeline(preprocessing.StandardScaler(),
                                            tree.DecisionTreeClassifier())

In [26]:
pipe_decision_tree

Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('decisiontreeclassifier', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))])

In [27]:
decision_tree_parameters = {
    'decisiontreeclassifier__max_depth': [3, 5, 7],
    'decisiontreeclassifier__max_features': [1, 5, 10, 0.1, 0.3, 0.5, 0.9, "sqrt", "log2", None],
    'decisiontreeclassifier__criterion': ['gini', 'entropy']
}

In [28]:
gs = grid_search.GridSearchCV(estimator=pipe_decision_tree, param_grid=decision_tree_parameters, cv=cv, verbose=1, n_jobs=-1)

In [29]:
gs.fit(X, y)

Fitting 5 folds for each of 60 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    1.6s finished


GridSearchCV(cv=sklearn.cross_validation.KFold(n=569, n_folds=5, shuffle=True, random_state=None),
       error_score='raise',
       estimator=Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('decisiontreeclassifier', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'decisiontreeclassifier__max_depth': [3, 5, 7], 'decisiontreeclassifier__criterion': ['gini', 'entropy'], 'decisiontreeclassifier__max_features': [1, 5, 10, 0.1, 0.3, 0.5, 0.9, 'sqrt', 'log2', None]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

In [30]:
gs.best_estimator_

Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('decisiontreeclassifier', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=10, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))])

In [31]:
bag_estimator = ensemble.BaggingClassifier(base_estimator=tree.DecisionTreeClassifier())
pipe_decision_bag = pipeline.make_pipeline(preprocessing.StandardScaler(),
                                            bag_estimator)

In [32]:
pipe_decision_bag

Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('baggingclassifier', BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            ...estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False))])

In [33]:
sorted(pipe_decision_bag.get_params().keys())

['baggingclassifier',
 'baggingclassifier__base_estimator',
 'baggingclassifier__base_estimator__class_weight',
 'baggingclassifier__base_estimator__criterion',
 'baggingclassifier__base_estimator__max_depth',
 'baggingclassifier__base_estimator__max_features',
 'baggingclassifier__base_estimator__max_leaf_nodes',
 'baggingclassifier__base_estimator__min_samples_leaf',
 'baggingclassifier__base_estimator__min_samples_split',
 'baggingclassifier__base_estimator__min_weight_fraction_leaf',
 'baggingclassifier__base_estimator__presort',
 'baggingclassifier__base_estimator__random_state',
 'baggingclassifier__base_estimator__splitter',
 'baggingclassifier__bootstrap',
 'baggingclassifier__bootstrap_features',
 'baggingclassifier__max_features',
 'baggingclassifier__max_samples',
 'baggingclassifier__n_estimators',
 'baggingclassifier__n_jobs',
 'baggingclassifier__oob_score',
 'baggingclassifier__random_state',
 'baggingclassifier__verbose',
 'baggingclassifier__warm_start',
 'standardscal

In [34]:
decision_tree_bag_parameters = {    
    'baggingclassifier__base_estimator__max_depth': [3, 5, 7],
    'baggingclassifier__base_estimator__max_features': [1, 5, 10, 0.1, 0.3, 0.5, 0.9, "sqrt", "log2", None],
    'baggingclassifier__base_estimator__criterion': ['gini'],
    'baggingclassifier__bootstrap': [False, True],
    'baggingclassifier__n_estimators': [5, 10, 100, 200]
}


In [35]:
gs = grid_search.GridSearchCV(estimator=pipe_decision_bag, param_grid=decision_tree_bag_parameters, 
                              cv=cv, verbose=1, n_jobs=-1)

In [36]:
gs.fit(X, y)

Fitting 5 folds for each of 240 candidates, totalling 1200 fits


[Parallel(n_jobs=-1)]: Done  92 tasks      | elapsed:   11.4s
[Parallel(n_jobs=-1)]: Done 242 tasks      | elapsed:   33.1s
[Parallel(n_jobs=-1)]: Done 492 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 842 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 1200 out of 1200 | elapsed:  3.2min finished


GridSearchCV(cv=sklearn.cross_validation.KFold(n=569, n_folds=5, shuffle=True, random_state=None),
       error_score='raise',
       estimator=Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('baggingclassifier', BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            ...estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'baggingclassifier__bootstrap': [False, True], 'baggingclassifier__base_estimator__max_depth': [3, 5, 7], 'baggingclassifier__base_estimator__criterion': ['gini'], 'baggingclassifier__n_estimators': [5, 10, 100, 200], 'baggingclassifier__base_estimator__max_features': [1, 5, 10, 0.1, 0.3, 0.5, 0.9, 'sqrt', 'log2', None]},
       pre_dispatch='2*n_jobs', refit=True, scoring=Non

In [37]:
gs.best_params_

{'baggingclassifier__base_estimator__criterion': 'gini',
 'baggingclassifier__base_estimator__max_depth': 7,
 'baggingclassifier__base_estimator__max_features': 0.3,
 'baggingclassifier__bootstrap': False,
 'baggingclassifier__n_estimators': 10}

In [38]:
decision_tree_bag_parameters = {    
    #'baggingclassifier__base_estimator__max_depth': [1, 2, 7, 10, 20, None],
    #'baggingclassifier__base_estimator__max_features': [1, "sqrt", None],
    #'baggingclassifier__n_estimators': [10, 100, 200, 400]
    'baggingclassifier__base_estimator__max_depth': [10],
    'baggingclassifier__base_estimator__max_features': ["sqrt"],
    'baggingclassifier__n_estimators': [600, 1000, 2000]
}


In [39]:
gs = grid_search.GridSearchCV(estimator=pipe_decision_bag, param_grid=decision_tree_bag_parameters, 
                              cv=cv, verbose=1, n_jobs=-1)

In [40]:
gs.fit(X, y)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   25.5s finished


GridSearchCV(cv=sklearn.cross_validation.KFold(n=569, n_folds=5, shuffle=True, random_state=None),
       error_score='raise',
       estimator=Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('baggingclassifier', BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            ...estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'baggingclassifier__base_estimator__max_depth': [10], 'baggingclassifier__n_estimators': [600, 1000, 2000], 'baggingclassifier__base_estimator__max_features': ['sqrt']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

In [41]:
gs.best_params_

{'baggingclassifier__base_estimator__max_depth': 10,
 'baggingclassifier__base_estimator__max_features': 'sqrt',
 'baggingclassifier__n_estimators': 1000}

In [42]:
gs.best_score_

0.96309314586994732

In [43]:
gs.best_params_

{'baggingclassifier__base_estimator__max_depth': 10,
 'baggingclassifier__base_estimator__max_features': 'sqrt',
 'baggingclassifier__n_estimators': 1000}

In [44]:
# Now with more trees!
gs.best_score_

0.96309314586994732

## 2 Diabetes and Regression

Scikit Learn has a dataset of diabetic patients obtained from this study:

http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

442 diabetes patients were measured on 10 baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements.

The target is a quantitative measure of disease progression one year after baseline.

Repeat the above comparison between a DecisionTreeRegressor and a Bagging version of the same.

### 2.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. Which score will you use?
- Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [45]:
from sklearn.datasets import load_diabetes

data = load_diabetes()
X = data['data']
y = data['target']

In [46]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor

def do_cross_val(model):
    scores = cross_validation.cross_val_score(model, X, y, cv=5, n_jobs=-1, scoring='r2')
    return scores.mean(), scores.std()

In [47]:
dtr = tree.DecisionTreeRegressor()
do_cross_val(dtr)

(-0.1381552312490964, 0.099290837989405342)

In [48]:
bdtr = ensemble.BaggingRegressor(DecisionTreeRegressor())
do_cross_val(bdtr)

(0.38933386026646832, 0.057585796826444401)

### 2.b Grid Search

Repeat Grid search as above:

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Regressor
- Search for few values of the parameters in order to improve the score of the regressor
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Regressor
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Regressor?
- Which score is better? Are the score significantly different? How can you judge that?


In [49]:
dtr = DecisionTreeRegressor()

In [50]:
params = {"splitter": ['best', 'random'],
          "max_depth": [3,5,10,20],
          "max_features": [None, "auto"],
          "min_samples_leaf": [1, 3, 5, 7, 10],
          "min_samples_split": [2, 5, 7]
         }
    

gsdtr = grid_search.GridSearchCV(dtr, params, n_jobs=-1, cv=5)

In [51]:
gsdtr.fit(X, y)

GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best'),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_features': [None, 'auto'], 'splitter': ['best', 'random'], 'min_samples_split': [2, 5, 7], 'max_depth': [3, 5, 10, 20], 'min_samples_leaf': [1, 3, 5, 7, 10]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [52]:
gsdtr.best_params_

{'max_depth': 3,
 'max_features': 'auto',
 'min_samples_leaf': 5,
 'min_samples_split': 5,
 'splitter': 'random'}

In [53]:
gsdtr.best_score_

0.38043699046389667

In [54]:
params = {"base_estimator__splitter": ['best', 'random'],
          "base_estimator__max_depth": [3,5,10,20],
          "base_estimator__max_features": [None, "auto"],
          "base_estimator__min_samples_leaf": [1, 3, 5, 7, 10],
          "base_estimator__min_samples_split": [2, 5, 7],
          'bootstrap_features': [False, True],
          'max_features': [0.5, 0.7, 1.0],
          'max_samples': [0.5, 0.7, 1.0],
          'n_estimators': [2, 5, 10, 20],
         }
    

gsbdtr = grid_search.GridSearchCV(bdtr, params, n_jobs=-1, cv=5)

In [56]:
gsbdtr.fit(X, y)

GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingRegressor(base_estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [2, 5, 10, 20], 'max_samples': [0.5, 0.7, 1.0], 'base_estimator__min_samples_split': [2, 5, 7], 'base_estimator__max_depth': [3, 5, 10, 20], 'bootstrap_features': [False, True], 'base_estimator__splitter': ['best', 'random'], 'max_features': [0.5, 0.7, 1.0], 'base_estimator__min_samples_leaf': [1, 3, 5, 7, 10], 'base_estimator__max_features': [None, 'auto']},
       pre_dispatch='2*

In [57]:
gsbdtr.best_params_

{'base_estimator__max_depth': 20,
 'base_estimator__max_features': 'auto',
 'base_estimator__min_samples_leaf': 5,
 'base_estimator__min_samples_split': 7,
 'base_estimator__splitter': 'random',
 'bootstrap_features': False,
 'max_features': 1.0,
 'max_samples': 0.7,
 'n_estimators': 20}

In [58]:
gsbdtr.best_score_

0.48394816295048837

## Bonus: Project 6 data

Repeat the analysis for the Project 6 Dataset