# Decision Trees and Ensembles Lab

In this lab we will compare the performance of a simple Decision Tree classifier with a Bagging classifier. We will do that on few datasets, starting from the ones offered by Scikit Learn.

## 1. Breast Cancer Dataset
We will start our comparison on the breast cancer dataset.
You can load it directly from scikit-learn using the `load_breast_cancer` function.

### 1.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds
- Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [19]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn import metrics
from sklearn import datasets

import matplotlib.pyplot as plt
%matplotlib inline

In [195]:
X, y = datasets.load_breast_cancer(return_X_y=True)

# dataset = datasets.load_breast_cancer()
# dataset.keys()
# X, y = dataset['data'], dataset['target']

In [196]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [209]:
# 2. Decision tree classifier
dt = DecisionTreeClassifier(random_state=1)

# dt.fit(X_train, y_train)
scores = cross_val_score(dt, X_train, y_train, cv=5)
np.mean(scores)

0.90283663704716344

In [210]:
from sklearn.ensemble import BaggingClassifier

bagging = BaggingClassifier(dt, random_state=1)
bagging.fit(X_train, y_train)
scores = cross_val_score(bagging, X_train, y_train, cv=5)
np.mean(scores)

0.93691045796308958

#### Score with bagging is better

### 1.b Scaled pipelines
As you may have noticed the features are not normalized. Do the score improve with normalization?
By now you should be very familiar with pipelines and scaling, so:

1. Create 2 pipelines, with a scaling preprocessing step and then either a decision tree or a bagging decision tree.
- Which score is better? Are the score significantly different? How can you judge that?
- Are the scores different from the non-scaled data?

In [218]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler

pipe = make_pipeline(StandardScaler(),DecisionTreeClassifier(random_state=1))

pdt = pipe.fit(X_train,y_train)

scores = cross_val_score(pdt, X_train, y_train, cv=5)
np.mean(scores)

0.90283663704716344

In [219]:
pipe1 = make_pipeline(StandardScaler(),BaggingClassifier(dt))

bdt = pipe1.fit(X_train,y_train)

scores = cross_val_score(bdt, X_test, y_test, cv=5)
np.mean(scores)

0.94110223583907793

### Data is already scaled, so scaling doesnt help

### 1.c Grid Search

Grid search is a great way to improve the performance of a classifier. Let's explore the parameter space of both models and see if we can improve their performance.

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier
- search for few values of the parameters in order to improve the score of the classifier
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Classifier?
- Which score is better? Are the score significantly different? How can you judge that?

In [100]:
from sklearn.model_selection import GridSearchCV

DT = DecisionTreeClassifier()

parameters = {'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
              'max_features': [1, 2, 3, 4],
              'max_leaf_nodes': [2, 3, 4, 5],
              'min_samples_leaf': [1, 2, 3, 4, 5, 6],
              'min_samples_split': [2, 3, 4, 5],
              "criterion": ["gini", "entropy"]}
    
gs = GridSearchCV(DT, parameters, n_jobs=4)

gs.fit(X, y)

# tree_model = clf.best_estimator_
# print (clf.best_score_, clf.best_params_) 

GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=4,
       param_grid={'max_leaf_nodes': [2, 3, 4, 5], 'min_samples_leaf': [1, 2, 3, 4, 5, 6], 'min_samples_split': [2, 3, 4, 5], 'criterion': ['gini', 'entropy'], 'max_features': [1, 2, 3, 4], 'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [106]:
print gs.best_score_
gs.best_params_

0.945518453427


{'criterion': 'gini',
 'max_depth': 5,
 'max_features': 3,
 'max_leaf_nodes': 4,
 'min_samples_leaf': 3,
 'min_samples_split': 5}

In [114]:
# Bagging Grid Search
BDT = BaggingClassifier(DT)
parameters = {"n_estimators": [1, 3, 5, 7, 9, 11],
              'max_features': [1, 2, 3, 4],
              "bootstrap": [True, False],
              "bootstrap_features": [True, False]}
    
gsb = GridSearchCV(BDT, parameters,cv=5, n_jobs=4)

gsb.fit(X, y)

print gsb.best_score_
gsb.best_params_

  **self._backend_args)
  **self._backend_args)
  **self._backend_args)
  **self._backend_args)


0.95079086116


{'bootstrap': False,
 'bootstrap_features': True,
 'max_features': 4,
 'n_estimators': 7}

#### The score for GridSearch with Bagging is the best score so far.

## 2 Diabetes and Regression

Scikit Learn has a dataset of diabetic patients obtained from this study:

http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

442 diabetes patients were measured on 10 baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements.

The target is a quantitative measure of disease progression one year after baseline.

Repeat the above comparison between a DecisionTreeRegressor and a Bagging version of the same.

### 2.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. Which score will you use?
- Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [263]:
X, y = datasets.load_diabetes(return_X_y= True)

In [118]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [243]:
DTR = DecisionTreeRegressor(random_state=1)

scores = cross_val_score(DTR, X, y, cv=5)
np.mean(scores)

-0.14046274921096624

In [274]:
from sklearn.ensemble import BaggingRegressor
BDTR = BaggingRegressor(DTR, random_state=1)

scores = cross_val_score(BDTR,X, y, cv=5)
np.mean(scores)

0.36729430127225626

### Score with Bagging is Better

### 2.b Grid Search

Grid search is a great way to improve the performance of a classifier. Let's explore the parameter space of both models and see if we can improve their performance.

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier
- search for few values of the parameters in order to improve the score of the classifier
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Classifier?
- Which score is better? Are the score significantly different? How can you judge that?

In [276]:
DTR = DecisionTreeRegressor(random_state=1)

parameters = {'max_depth': [1, 2, 3, 4, 5, 6, 7, 8],
              'max_features': [1,2,3,4,5,6,7,8,9,10],
                  'max_leaf_nodes': [10,12,14,16,20],
              'min_samples_leaf': [8,10,12,14,16,18]}
    
gs = GridSearchCV(DTR, parameters, cv=5 , n_jobs=4)

gs.fit(X,y)

print gs.best_score_
gs.best_params_

0.387340157608


{'max_depth': 5,
 'max_features': 6,
 'max_leaf_nodes': 12,
 'min_samples_leaf': 14}

In [278]:
BDTR = BaggingRegressor(DTR)
parameters = {"n_estimators": [ 9, 11,13,15,17],
              'max_features': [1, 2, 3, 4,5,6,7,8,9,10],
              "bootstrap": [True, False],
              "bootstrap_features": [True, False]}
    
gsb = GridSearchCV(BDTR, parameters,cv=5, n_jobs=4)

gsb.fit(X, y)

print gsb.best_score_
gsb.best_params_

  **self._backend_args)
  **self._backend_args)
  **self._backend_args)
  **self._backend_args)


0.435245816963


{'bootstrap': True,
 'bootstrap_features': False,
 'max_features': 9,
 'n_estimators': 15}

### The score for bagging with Grid Search CV is better.

## Bonus: Project 6 data

Repeat the analysis for the Project 6 Dataset