# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS109A Introduction to Data Science: 

## Homework 8  AC 209 : Trees and ensemble methods


**Harvard University**<br/>
**Fall 2018**<br/>
**Instructors**: Pavlos Protopapas, Kevin Rader

<hr style="height:2pt">


In [1]:
# RUN THIS CELL FOR FORMAT
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

In [4]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import datasets

%matplotlib inline

### Question 1

**1) Describe the main differences between bagging and adaptative boosting.**

Bagging is a parallel method that uses bootstrap samples to train N weak learners, and then combines the predictions. Adaptative boosting, on the other hand, works sequentially: at each step, a boostrap sample is obtained from a weighted dataset, where the weight of each observation corresponds to its likelihood of being chosen. The weights are updated based on the performance of the previous learner.

**2) Why do we use the word "gradient" in gradient boosting?**

Gradient boosting is actually performing gradient descent over the specified loss function for our dataset - we are minimizing empirical risk. Gradient boosting is basically gradient descent over prediction space. At each step, we fit a learner $h_m$ to the residuals of the current best model, and we update our classifier with:

$$F_m(x)=F_{m-1}(x)+\eta h_m(x)$$

Which corresponds to an approximation of a gradient descent update in prediction space.

**3) Describe three improvements of XGBoost over the conventional implementation of Boosted Trees.**

- Can use L1 or L2 regularization.
- Incorporates a sparsity-aware split finding algorithm to handle different types of sparsity patterns in the data.
- Uses distributed weighted quantile sketch algorithm to effectively handle weighted data.
- Uses a parallelized gradient boosting algorithm that greatly reduces inference time.

### Question 2



Here, we will compare some of the top ensemble methods for classification. We will look at AdaBoost, XGBoost, LGBM and CatBoost. 

- To install XGBoost, run `pip3 install xgboost`.
- To install LGBM, run `pip install lightgbm`
- To install CatBoost, run `conda -c conda-forge install catboost` if using conda, or `pip install catboost` if not. 

We will be using a different dataset than what we're used to, so as to test the capabilities of these advanced classifiers. We will be playing with the Forest Cover Type dataset, a classification dataset where observations from 30mx30m patches of forest are associated with the type of tree that grows there. We will be trying to predict the primary species of those patches based on 54 predictors, e.g. elevation, slope, distance to water, etc.

Here are the main predictors of the dataset:
- Elevation
- Aspect
- Slope
- Horizontal_Distance_To_Hydrology 
- Vertical_Distance_To_Hydrology 
- Hillshade_9am
- Hillshade_Noon
- Hillshade_3pm
- Horizontal_Distance_To_Fire_Points
- Wilderness_Area (one-hot encoded, 4 binary columns)
- Soil_Type (one-hot encoded, 40 binary columns)

Response:
Cover_Type (7 types), integer, 1 to 7

For more details on the dataset, visit http://archive.ics.uci.edu/ml/datasets/Covertype 

**1) Import the coverage type dataset from sklearn.datasets with `datasets.fetch_covtype`. Use return_X_y=True and split the data into train and test sets. You can downsample the data to 10% of the full dataset if needed.**

**2) Train a DecisionTreeClassifier, RandomForestClassifier, AdaboostClassifier, LGBMClassifier, XGBoostClassifier, and CatBoost on the data.**

**Make sure that you use the sklearn-like interfaces:**

- DecisionTreeClassifier, RandomForestClassifier, AdaboostClassifier given by sklearn
- XGBClassifier can be accessed with `from xgboost import XGBClassifier`
- LGBMClassifier can be accessed with `from lightgbm.sklearn import LGBMClassifier`
- CatBoostClassifier can be accessed with `from catboost import CatBoostClassifier`


**3) Time both training (.fit method) and inference (.predict method), and show classification accuracy for all classifiers. For this dataset, substract 1 to your array of labels so that the label format plays nicely with CatBoost. Comment on the results.**

In [2]:
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm.sklearn import LGBMClassifier
from catboost import CatBoostClassifier
from timeit import default_timer as timer

In [52]:
X,y = datasets.fetch_covtype(data_home=None, download_if_missing=True, random_state=None, shuffle=False, return_X_y=True)

X = X[:int(len(X)*0.2)]
y = y[:int(len(y)*0.2)]

y=y-1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify = y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(77855, 54) (77855,) (38347, 54) (38347,)


In [6]:
def fit_predict_time(clf, X_train, X_test, y_train, y_test):
    # training
    start = timer()
    clf.fit(X_train, y_train)
    end = timer()
    train_time = end-start

    # inference
    start = timer()
    clf.predict(X_test)
    end = timer()
    inf_time = end-start

    # accuracy
    acc = clf.score(X_test, y_test)
    print(clf.__class__.__name__)
    print('Accuracy: %.5f' % acc)
    print('Training time: %.5fs' % train_time)
    print('Inference time: %.5fs' % inf_time)
    
    return acc, train_time, inf_time

In [58]:
clf =  DecisionTreeClassifier()
fit_predict_time(clf, X_train, X_test, y_train, y_test);

DecisionTreeClassifier
Accuracy: 0.93371
Training time: 0.98405s
Inference time: 0.01637s


In [59]:
clf =  AdaBoostClassifier()
fit_predict_time(clf, X_train, X_test, y_train, y_test);

AdaBoostClassifier
Accuracy: 0.76903
Training time: 6.79260s
Inference time: 0.60482s


In [60]:
clf =  RandomForestClassifier()
fit_predict_time(clf, X_train, X_test, y_train, y_test);



RandomForestClassifier
Accuracy: 0.93548
Training time: 1.50544s
Inference time: 0.12063s


In [48]:
clf = XGBClassifier()
fit_predict_time(clf, X_train, X_test, y_train, y_test);

XGBClassifier
Accuracy: 0.98305
Training time: 0.13361
Inference time: 0.00077


In [61]:
clf = LGBMClassifier()
fit_predict_time(clf, X_train, X_test, y_train, y_test);

LGBMClassifier
Accuracy: 0.92112
Training time: 5.76897s
Inference time: 1.64111s


In [62]:
clf = CatBoostClassifier(loss_function='MultiClass', iterations=10, verbose=False)
fit_predict_time(clf, X_train, X_test, y_train, y_test);

CatBoostClassifier
Accuracy: 0.81159
Training time: 1.73230s
Inference time: 0.15873s


**4) Let's now play with a high-dimensional dataset. Load the Faces in The Wild dataset with `datasets.fetch_lfw_people(return_X_y=True, min_faces_per_person=20)`. Split the data into train and test sets (30% test). Use random_state=209 for the train test split.**

In [12]:
X,y = datasets.fetch_lfw_people(return_X_y=True, min_faces_per_person=20)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify = y, random_state=209)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(2025, 2914) (2025,) (998, 2914) (998,)


**5) Again, train all classifiers enumerated above and time training and inference times. With this dimensionality, you will have to increase the n_jobs parameter of the ensemble methods to reduce training time. Evaluate the algorithms with `n_estimators =10` and `n_jobs=10`. You should expect XGBoost to take a couple of minutes to run. Reduce the number of jobs and estimators to 5 if your machine cannot handle it (but make sure that you remain consistent across estimators). Comment on the results.**

In [13]:
clf =  DecisionTreeClassifier()
fit_predict_time(clf, X_train, X_test, y_train, y_test);

DecisionTreeClassifier
Accuracy: 0.19840
Training time: 13.64126s
Inference time: 0.00241s


In [17]:
clf =  AdaBoostClassifier(n_estimators=10)
fit_predict_time(clf, X_train, X_test, y_train, y_test);

AdaBoostClassifier
Accuracy: 0.18637
Training time: 7.01525s
Inference time: 0.03597s


In [10]:
clf =  RandomForestClassifier(n_estimators = 10)
fit_predict_time(clf, X_train, X_test, y_train, y_test);

RandomForestClassifier
Accuracy: 0.29559
Training time: 1.58273s
Inference time: 0.01612s


In [93]:
clf = XGBClassifier(n_estimators=10, n_jobs=20)
fit_predict_time(clf, X_train, X_test, y_train, y_test);

XGBClassifier
Accuracy: 0.39679
Training time: 48.13600s
Inference time: 0.56427s


In [94]:
clf = LGBMClassifier(n_estimators=10, n_jobs=20)
fit_predict_time(clf, X_train, X_test, y_train, y_test);

LGBMClassifier
Accuracy: 0.32665
Training time: 100.97716s
Inference time: 0.02705s


In [None]:
clf = CatBoostClassifier(loss_function='MultiClass', verbose=False, n_estimators=10)
fit_predict_time(clf, X_train, X_test, y_train, y_test);

**6) How did the high dimensionality affect each classifier? Comment on the results.**

In general, the dimensionality reduces the predictive power of our classifiers. The ensemble methods, however, proved to work better than the simple decision tree in this case. 