# Decision Trees and Ensembles Lab

In this lab we will compare the performance of a simple Decision Tree classifier with a Bagging classifier. We will do that on few datasets, starting from the ones offered by Scikit Learn.

## 1. Breast Cancer Dataset
We will start our comparison on the breast cancer dataset.
You can load it directly from scikit-learn using the `load_breast_cancer` function.

### 1.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds
- Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler

In [3]:
breast_cancer = datasets.load_breast_cancer()

In [4]:
X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [5]:
y = breast_cancer.target

In [6]:
breast_cancer.target_names

array(['malignant', 'benign'], 
      dtype='|S9')

### 1.b Scaled pipelines
As you may have noticed the features are not normalized. Do the score improve with normalization?
By now you should be very familiar with pipelines and scaling, so:

1. Create 2 pipelines, with a scaling preprocessing step and then either a decision tree or a bagging decision tree.
- Which score is better? Are the score significantly different? How can you judge that?
- Are the scores different from the non-scaled data?

In [7]:
p_one = make_pipeline(StandardScaler(), DecisionTreeClassifier())
p_two = make_pipeline(StandardScaler(), BaggingClassifier(DecisionTreeClassifier()))

In [10]:
cross_val_score(p_one, X, y, cv=10).mean()

0.91051443263330734

In [11]:
cross_val_score(p_one, X, y, cv=10).std()

0.03592690753476218

In [9]:
cross_val_score(p_two, X, y, cv=10).mean()

0.94757583614207941

In [12]:
cross_val_score(p_two, X, y, cv=10).std()

0.028894587032114023

In [13]:
cross_val_score(DecisionTreeClassifier(), X, y, cv=10).mean()

0.91057600898798707

In [14]:
cross_val_score(DecisionTreeClassifier(), X, y, cv=10).std()

0.032175539738384877

In [15]:
cross_val_score(BaggingClassifier(DecisionTreeClassifier()), X, y, cv=10).mean()

0.94905366865439456

In [16]:
cross_val_score(BaggingClassifier(DecisionTreeClassifier()), X, y, cv=10).std()

0.03716585148054953

### 1.c Grid Search

Grid search is a great way to improve the performance of a classifier. Let's explore the parameter space of both models and see if we can improve their performance.

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier
- search for few values of the parameters in order to improve the score of the classifier
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Classifier?
- Which score is better? Are the score significantly different? How can you judge that?

In [21]:
gs = GridSearchCV(DecisionTreeClassifier(),
                  {'criterion': ['gini', 'entropy'], 
                  'max_depth': np.arange(1,15,1),
                  'min_samples_leaf': np.arange(15,50,5)},
                  cv=5,
                 verbose=1)
gs.fit(X,y)

Fitting 5 folds for each of 196 candidates, totalling 980 fits


[Parallel(n_jobs=1)]: Done 980 out of 980 | elapsed:    9.5s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'criterion': ['gini', 'entropy'], 'max_depth': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]), 'min_samples_leaf': array([15, 20, 25, 30, 35, 40, 45])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [22]:
gs.best_score_

0.93145869947275928

In [23]:
gs2 = GridSearchCV(BaggingClassifier(DecisionTreeClassifier()),
                  {'n_estimators': np.arange(1,15,1), 
                  'max_samples': np.arange(0.1,1.0,0.1),
                  'max_features': np.arange(0.1,1.0,0.1),
                  "base_estimator__max_features": [None, "auto"],
                  "base_estimator__min_samples_leaf": [1, 3, 5, 7, 10],
                  "base_estimator__min_samples_split": [2, 5, 7],
                  'bootstrap_features': [False, True]},
                  cv=5,
                  verbose=1)
gs2.fit(X,y)

Fitting 5 folds for each of 68040 candidates, totalling 340200 fits


KeyboardInterrupt: 

In [19]:
gs2.best_score_

NameError: name 'gs2' is not defined

## 2 Diabetes and Regression

Scikit Learn has a dataset of diabetic patients obtained from this study:

http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

442 diabetes patients were measured on 10 baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements.

The target is a quantitative measure of disease progression one year after baseline.

Repeat the above comparison between a DecisionTreeRegressor and a Bagging version of the same.

### 2.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. Which score will you use?
- Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [16]:
diabetes = datasets.load_diabetes()

In [17]:
diabetes.data[0]

array([ 0.03807591,  0.05068012,  0.06169621,  0.02187235, -0.0442235 ,
       -0.03482076, -0.04340085, -0.00259226,  0.01990842, -0.01764613])

In [18]:
X = pd.DataFrame(diabetes.data, columns=['age', 'sex', 'body_mass_index',\
                                         'average_blood_pressure', 'bsm1',\
                                         'bsm2', 'bsm3', 'bsm4', 'bsm5', 'bsm6'])

In [19]:
y = diabetes.target

### 2.b Grid Search

Repeat Grid search as above:

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Regressor
- Search for few values of the parameters in order to improve the score of the regressor
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Regressor
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Regressor?
- Which score is better? Are the score significantly different? How can you judge that?


In [20]:
gs = GridSearchCV(DecisionTreeClassifier(),
                  {'criterion': ['gini', 'entropy'], 
                  'max_depth': np.arange(1,15,1),
                  'min_samples_leaf': np.arange(15,50,5)},
                  cv=5)
gs.fit(X,y)



GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'criterion': ['gini', 'entropy'], 'max_depth': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]), 'min_samples_leaf': array([15, 20, 25, 30, 35, 40, 45])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [21]:
gs.best_score_

0.020361990950226245

In [22]:
gs2 = GridSearchCV(BaggingClassifier(DecisionTreeClassifier()),
                  {'n_estimators': np.arange(1,15,1), 
                  'max_samples': np.arange(0.1,1.0,0.1),
                  'max_features': np.arange(0.1,1.0,0.1)},
                  cv=5)
gs2.fit(X,y)

GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
        ...n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]), 'max_samples': array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9]), 'max_features': array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [23]:
gs2.best_score_

0.022624434389140271

## Bonus: Project 6 data

Repeat the analysis for the Project 6 Dataset