Today, we design a new machine learning algorithm from scratch.
This algorithm learns how to correct for the mistakes it's made in the past by training a series of "base learners" one by one.

1) Load the california housing data and do a train-test split as below:

    import pandas as pd

    from sklearn.datasets import california_housing
    from sklearn.model_selection import train_test_split

    housing_dataset = california_housing.fetch_california_housing()
    X = pd.DataFrame(housing_dataset.data)
    X.columns = housing_dataset.feature_names
    y = housing_dataset.target

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2018)


2) Ok that wasn't hard. Fit a simple linear regression on train and score it on test as a baseline.

3) That also wasn't so bad. Now the hard part: create a python class or series of functions that follows the steps below to create a predictive model from training data **X, y and a hyperparameter n_estimators**.

**A.** Set C = mean of y. This is our initial prediction, a constant prediction; track the current residuals y_i - C for all the target values

**B.** Do the following n_estimators times: using sklearn DecisionTreeRegressor, fit a tree of max_depth 3 to (X, current residuals). Save the tree in a list, and update the residuals by subtracting the tree's predicted values on X from the current residuals.

**C.** To make predictions on new data, you must sum the predictions made by all of the trees in your list, then add C. Fit your model on the training data and predict on the test data. Score your model on the test data. Try to get above .70 R^2. N_estimators = 10 is a good starting point to try.

**D.** Time permitting, expand your model by adding hyperparameters **max_depth** that adjust the max_depth of each tree, as well as **learning_rate**. With learning rate, when you update the residuals subtract learning_rate * tree.predict(X) (what does this remind you of?) Also, when predicting multiply the predictions made by each tree by the learning_rate.

Why do you think this works well? Where have we seen iterative mistake corrections with small step sizes come up before? Can you push your model to do even better?

1) Load the california housing data and do a train-test split as below:

    import pandas as pd

    from sklearn.datasets import california_housing
    from sklearn.model_selection import train_test_split

    housing_dataset = california_housing.fetch_california_housing()
    X = pd.DataFrame(housing_dataset.data)
    X.columns = housing_dataset.feature_names
    y = housing_dataset.target

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2018

In [4]:
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.metrics import r2_score

from sklearn.datasets import california_housing
from sklearn.model_selection import train_test_split


housing_dataset = california_housing.fetch_california_housing()
X = pd.DataFrame(housing_dataset.data)
X.columns = housing_dataset.feature_names
y = housing_dataset.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2018)

2) Ok that wasn't hard. Fit a simple linear regression on train and score it on test as a baseline.

### Create Baseline Model

In [6]:
# Initialize model object
lr = LinearRegression()

# Fit model
fitted_lr = lr.fit(X_train, y_train)

# Make predictions
y_pred = fitted_lr.predict(X_test)

In [9]:
# Score model
print("Linear Regression R^2 Score: ", round(fitted_lr.score(X_test, y_test), 3))

Linear Regression R^2 Score:  0.605


3) That also wasn't so bad. Now the hard part: create a python class or series of functions that follows the steps below to create a predictive model from training data **X, y and a hyperparameter n_estimators**.

**A.** Set C = mean of y. This is our initial prediction, a constant prediction; track the current residuals y_i - C for all the target values

**B.** Do the following n_estimators times: using sklearn DecisionTreeRegressor, fit a tree of max_depth 3 to (X, current residuals). Save the tree in a list, and update the residuals by subtracting the tree's predicted values on X from the current residuals.

**C.** To make predictions on new data, you must sum the predictions made by all of the trees in your list, then add C. Fit your model on the training data and predict on the test data. Score your model on the test data. Try to get above .70 R^2. N_estimators = 10 is a good starting point to try.

**D.** Time permitting, expand your model by adding hyperparameters **max_depth** that adjust the max_depth of each tree, as well as **learning_rate**. With learning rate, when you update the residuals subtract learning_rate * tree.predict(X) (what does this remind you of?) Also, when predicting multiply the predictions made by each tree by the learning_rate.

Why do you think this works well? Where have we seen iterative mistake corrections with small step sizes come up before? Can you push your model to do even better?

In [10]:
X

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [19]:
mean_y = y.sum() / len(y)

In [23]:
len(y)

20640

In [20]:
updated_residuals = y - mean_y

In [22]:
len(updated_residuals)

20640

In [None]:
def xgboost(X,y,n_estimators):
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.model_selection import train_test_split
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2018)
    
    # Initialize/Train tree
    desc_tree_reg = DecisionTreeRegressor(max_depth=3)
    
    for _ in n_estimators:
        mean_y = y.sum()/len(y)
        residuals_new = y - mean_y
        desc_tree_reg.
        
        y_pred_new = desc_tree_reg(X_test)
       

# BOOSTING Class Solution

In [30]:
import numpy as np
from sklearn.metrics import r2_score
from sklearn.tree import DecisionTreeRegressor



# Analogy of boosting ->
# Taking exams
## Imagine you got 20% of the questions wrong. What would you do?
## YOU would improve your score by focusing on your mistakes -- the questions you got wrong
## likewise, boosting is the same. we improve our score by focusing on our errors. 
## We train the model on the errors 

class IAmTheOneWhoBoosts():
    
    def __init(self, n_estimators=10, max_depth=3, learning_rate=1):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.learning_rate = learning_rate
    
    
    def fit(self, X, y):
        self.C = np.mean(y) 
        self.estimators = []
        
        resids = y - self.C
        
        for _ in range(self.n_estimators):
            
            est = DecistionTreeRegressor(max_depth = self.max_depth)
            est.fit(X, resids)
            resids -= self.learning_rate * est.predict(X)
            
            self.estimators.append(est)
            
    def predict(self, X):
        return self.C + np.sum([self.learning_rate + est.predict(X) # add up the prediction for every single model that is the final prediction
                              
                                for est in self.estimators], axis=0)
    
    def score(self, X, y):
        return r2_score(y, self.predict(X))

In [None]:
booster = IAmTheOneWhoBoosts(100, 3, 1)
booster.fit(X_train, y_train)
booster.score(X_test, y_test)