# Bagging and Boosting Lab

In this lab we will practice using Random Forest Regressor and Boosted Trees Regressor on the SF health code violation data.

---

## 1. Load and inspect the data

In the [asset folder](../../assets/datasets/violations_words.csv) you can find the dataset of SF health code violations and inspections.

There are many columns in this dataset, many of which you will not want/need to use!

**NOTE: the back 373 columns are word appearances from CountVectorizer derived from the inspection/violation description.**

---

We will be training several classification and regression models. For classification, you can choose to predict:

    neighborhood   :  the neighborhood of the business recieving the inspection/violation
    score_code     :  health code score category
    zip code       :  zip code the business is in

For the regression problem, you will predict either:

    score          :  the recieved health code score

**OR [BONUS]**: aggregate the total number of violations by `neighborhood` (using groupby). If you choose to use this as your dependent for regression, you must *adjust this count for number of businesses and population in the neighborhood!*

---

### 1.1 Inspect the data and create your X, Y data for the classification and regression problems


In [1]:
import pandas as pd
import numpy as np
import patsy

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
import sklearn.datasets as dataloader

bcancer = dataloader.load_breast_cancer()
X = bcancer.data
Y = bcancer.target

In [5]:
X[0:2]

array([[  1.79900000e+01,   1.03800000e+01,   1.22800000e+02,
          1.00100000e+03,   1.18400000e-01,   2.77600000e-01,
          3.00100000e-01,   1.47100000e-01,   2.41900000e-01,
          7.87100000e-02,   1.09500000e+00,   9.05300000e-01,
          8.58900000e+00,   1.53400000e+02,   6.39900000e-03,
          4.90400000e-02,   5.37300000e-02,   1.58700000e-02,
          3.00300000e-02,   6.19300000e-03,   2.53800000e+01,
          1.73300000e+01,   1.84600000e+02,   2.01900000e+03,
          1.62200000e-01,   6.65600000e-01,   7.11900000e-01,
          2.65400000e-01,   4.60100000e-01,   1.18900000e-01],
       [  2.05700000e+01,   1.77700000e+01,   1.32900000e+02,
          1.32600000e+03,   8.47400000e-02,   7.86400000e-02,
          8.69000000e-02,   7.01700000e-02,   1.81200000e-01,
          5.66700000e-02,   5.43500000e-01,   7.33900000e-01,
          3.39800000e+00,   7.40800000e+01,   5.22500000e-03,
          1.30800000e-02,   1.86000000e-02,   1.34000000e-02,
       

---

### 2. Decision Tree Regressor

1. Train a decision tree regressor on the regression problem
- Evaluate the score with a 5-fold cross-validation
- Make a scatter plot of the predicted vs actual scores for each of the 5 folds, do they match?


---

### 3. Random Forest Regressor

1. Train a random forest regressor on the regression problem and predict your dependent.
- Evaluate the score with a 5-fold cross-validation
- Do a scatter plot of the predicted vs actual scores for each of the 5 folds, do they match?
- How does this fit compare with the previous one?

---

### 4. Extra Trees Regressor

1. Train an extra trees regressor on the regression problem and predict your dependent.
- Evaluate the score with a 5-fold cross-validation
- Do a scatter plot of the predicted vs actual scores for each of the 5 folds, do they match?
- How does this fit compare with the previous models?

---

### 5. AdaBoost Classifier

1. Train a AdaBoost classifier on your chosen classification problem.
- Evaluate the classifier performance with a 5-fold cross-validation.

---

### 6. Gradient Boosted Trees Classifier


1. Train a Gradient Boosting Trees classifier on your chosen classification problem.
- Evaluate the score with a 5-fold cross-validation.
- Compare with the AdaBoost score.

### 7. [BONUS] Use gridsearch to fine-tune a model or models.

1. What are the best parameters found with the gridsearch?
2. How does the best score compare to the model(s) without cross-validation?