# Decision Trees and Ensembles Lab

In this lab we will compare the performance of a simple Decision Tree classifier with a Bagging classifier. We will do that on few datasets, starting from the ones offered by Scikit Learn.

## 1. Breast Cancer Dataset
We will start our comparison on the breast cancer dataset.
You can load it directly from scikit-learn using the `load_breast_cancer` function.

### 1.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds
- Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?
> Answer: use the standard deviation of the scores from the cross_val_score output

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
from sklearn.datasets import load_breast_cancer

In [None]:
data = load_breast_cancer()

In [None]:
data.keys()

In [None]:
X = pd.DataFrame(data['data'], columns=data['feature_names'])
y = pd.Series(data['target'])

In [None]:
X.head()

In [None]:
y.value_counts()/y.count()

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.cross_validation import cross_val_score

In [None]:
dt = DecisionTreeClassifier()

In [None]:
def do_cross_val(model):
    scores = cross_val_score(model, X, y, cv=5, n_jobs=-1)
    return scores.mean(), scores.std()

do_cross_val(dt)

In [None]:
bdt = BaggingClassifier(DecisionTreeClassifier())

In [None]:
do_cross_val(bdt)

### 1.b Scaled pipelines
As you may have noticed the features are not normalized. Do the score improve with normalization?
By now you should be very familiar with pipelines and scaling, so:

1. Create 2 pipelines, with a scaling preprocessing step and then either a decision tree or a bagging decision tree.
- Which score is better? Are the score significantly different? How can you judge that?
- Are the scores different from the non-scaled data?
> No, the scores are the same


In [None]:
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import make_pipeline

In [None]:
pipedt = make_pipeline(RobustScaler(),
                       DecisionTreeClassifier())

pipebdt = make_pipeline(RobustScaler(),
                        BaggingClassifier(DecisionTreeClassifier()))

In [None]:
do_cross_val(pipedt)

In [None]:
do_cross_val(pipebdt)

### 1.c Grid Search

Grid search is a great way to improve the performance of a classifier. Let's explore the parameter space of both models and see if we can improve their performance.

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier
- Search for few values of the parameters in order to improve the score of the classifier
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
> Yes
- how does the score of the Grid-searched DT compare with the score of the Bagging DT?
> Answer: they are the same (within error), Grid search improved the score of simple DT
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search

- Does the score improve for the Grid-searched Bagging Classifier?
> Yes
- Which score is better? Are the score significantly different? How can you judge that?
> The grid search bagging classifier is the best one

In [None]:
from sklearn.grid_search import GridSearchCV

In [None]:
params = {"max_depth": [3,5,10,20],
          "max_features": [None, "auto"],
          "min_samples_leaf": [1, 3, 5, 7, 10],
          "min_samples_split": [2, 5, 7]
         }
    

gsdt = GridSearchCV(dt, params, n_jobs=-1, cv=5)

In [None]:
gsdt.fit(X, y)

In [None]:
gsdt.best_params_

In [None]:
gsdt.best_score_

In [None]:
bdt.get_params()

In [None]:
params = {"base_estimator__max_depth": [3,5,10,20],
          "base_estimator__max_features": [None, "auto"],
          "base_estimator__min_samples_leaf": [1, 3, 5, 7, 10],
          "base_estimator__min_samples_split": [2, 5, 7],
          'bootstrap_features': [False, True],
          'max_features': [0.5, 0.7, 1.0],
          'max_samples': [0.5, 0.7, 1.0],
          'n_estimators': [2, 5, 10, 20],
         }
    

gsbdt = GridSearchCV(bdt, params, n_jobs=-1, cv=5)

In [None]:
gsbdt.fit(X, y)

In [None]:
gsbdt.best_params_

In [None]:
gsbdt.best_score_

## 2 Diabetes and Regression

Scikit Learn has a dataset of diabetic patients obtained from this study:

http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

442 diabetes patients were measured on 10 baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements.

The target is a quantitative measure of disease progression one year after baseline.

Repeat the above comparison between a DecisionTreeRegressor and a Bagging version of the same.

### 2.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. Which score will you use?
> Answer: r2
- Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?
> The Bagging Regressor is better
> Answer: use the standard deviation of the scores from the cross_val_score output


In [None]:
from sklearn.datasets import load_diabetes

data = load_diabetes()
X = data['data']
y = data['target']

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor

def do_cross_val(model):
    scores = cross_val_score(model, X, y, cv=5, n_jobs=-1, scoring='r2')
    return scores.mean(), scores.std()


In [None]:
dtr = DecisionTreeRegressor()
do_cross_val(dtr)

In [None]:
bdtr = BaggingRegressor(DecisionTreeRegressor())
do_cross_val(bdtr)

### 2.b Grid Search

Repeat Grid search as above:

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Regressor
- Search for few values of the parameters in order to improve the score of the regressor
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
> Yes
- how does the score of the Grid-searched DT compare with the score of the Bagging DT?
> Answer: they are the same (within error), Grid search improved the score of simple DT
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Regressor
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search

- Does the score improve for the Grid-searched Bagging Regressor?
> Yes
- Which score is better? Are the score significantly different? How can you judge that?
> The grid search bagging classifier is the best one

In [None]:
DecisionTreeRegressor()

In [None]:
params = {"splitter": ['best', 'random'],
          "max_depth": [3,5,10,20],
          "max_features": [None, "auto"],
          "min_samples_leaf": [1, 3, 5, 7, 10],
          "min_samples_split": [2, 5, 7]
         }
    

gsdtr = GridSearchCV(dtr, params, n_jobs=-1, cv=5)

In [None]:
gsdtr.fit(X, y)

In [None]:
gsdtr.best_params_

In [None]:
gsdtr.best_score_

In [None]:
params = {"base_estimator__splitter": ['best', 'random'],
          "base_estimator__max_depth": [3,5,10,20],
          "base_estimator__max_features": [None, "auto"],
          "base_estimator__min_samples_leaf": [1, 3, 5, 7, 10],
          "base_estimator__min_samples_split": [2, 5, 7],
          'bootstrap_features': [False, True],
          'max_features': [0.5, 0.7, 1.0],
          'max_samples': [0.5, 0.7, 1.0],
          'n_estimators': [2, 5, 10, 20],
         }
    

gsbdtr = GridSearchCV(bdtr, params, n_jobs=-1, cv=5)

In [None]:
gsbdtr.fit(X, y)

In [None]:
gsbdtr.best_params_

In [None]:
gsbdtr.best_score_

## Bonus: Project 6 data

Repeat the analysis for the Project 6 Dataset