# Ensemble Learning Exercise

## 1. Implementing Random Forest From Scratch (30 points)
In this exercise you will need to implement a simple version of Random Forest Regressor from scratch. Your model will handle **continuous input and output**. 

1.1. Compelete the skeleton class below (you should use scikit-learn's `DecisionTreeRegressor` model that the `TreeEnsemble` will use)
  - `X` is a matrix of data values (rows are samples, columns are attributes)
  - `y` is a vector of corresponding target values
  - `n_trees` is the number of trees to create
  - `sample_sz` is the size of the sample set to use of each of the trees in the forest (chose the samples randomly, with or without repetition)
  - `n_features` is the size of features to sample. This can be a natrual number > 0, or a ratio of the features as a number in range (0,1]
  - `min_leaf` is the minimal number of samples in each leaf node of each tree in the forest
  

1.2. The `predict` function will use mean of the target values of the trees. The result is a vector of predictions matching the number of rows in `X`.

1.3. The `oob_mse` function will compute the mean squared error over all **out of bag (oob)** samples. That is, for each sample calculate the squared error using  predictions from the trees that do not contain x in their respective bootstrap sample, then average this score for all samples. See:  [OOB Errors for Random Forests](https://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html).

1.4. To check your random forest implementation, use the boston dataset (`from sklearn.datasets import load_boston`)

  - Use the following to estimate what are the best hyper parameters to use for your model
```
for n in [1,5,10,20,50,100]:
  for sz in [50,100,300,500]:
    for min_leaf in [1,5]:
      forest = TreeEnsemble(X, y, n, sz, min_leaf)
      mse = forest.oob_mse()
      print("n_trees:{0}, sz:{1}, min_leaf:{2} --- oob mse: {3}".format(n, sz, min_leaf, mse))
```
  
  - Using your chosen hyperparameters as a final model, plot the predictions vs. true values of all the samples in the training set . Use something like:
  ```
  y_hat = forest.predict(X)  # forest is the chosen model
  plt.scatter(y_hat, y)
  ```
 


In [30]:
from sklearn.datasets import load_boston
boston_dataset = load_boston()


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

In [36]:
import pandas as pd
from sklearn.model_selection import train_test_split
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
X = pd.DataFrame(boston)
Y = pd.DataFrame(boston_dataset.target)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=317)

In [43]:
from sklearn.tree import DecisionTreeRegressor
tree = DecisionTreeRegressor().fit(X_train, Y_train)
pred = tree.predict(X_test)

In [59]:
pred/3

array([ 6.8       ,  8.33333333,  6.76666667,  2.46666667, 16.66666667,
        7.93333333,  2.83333333,  6.36666667, 15.13333333,  6.9       ,
        4.86666667,  3.96666667,  7.73333333,  4.03333333,  7.23333333,
        9.16666667,  8.23333333, 14.5       ,  4.46666667,  4.66666667,
       11.06666667,  6.3       ,  8.16666667,  5.23333333,  6.26666667,
        9.7       ,  7.5       ,  3.4       ,  4.        , 11.13333333,
        5.13333333,  5.7       ,  2.33333333,  4.36666667,  5.36666667,
        8.23333333,  4.86666667,  6.43333333, 12.4       ,  4.86666667,
        4.36666667,  9.33333333,  9.56666667,  5.36666667,  4.5       ,
        5.13333333, 10.5       ,  5.        ,  6.4       ,  5.4       ,
       10.16666667,  2.76666667,  3.4       ,  5.36666667, 12.16666667,
        6.16666667,  7.23333333,  6.36666667,  6.46666667,  7.03333333,
        9.7       , 16.66666667,  7.23333333,  6.83333333,  4.46666667,
        4.6       ,  9.96666667,  5.8       ,  6.3       ,  2.9 

In [24]:

import numpy as np
s = boston.data[np.random.choice(boston.data.shape[0], 3, replace=True), :][:,np.random.choice(boston.data.shape[1], 2, replace=False)]
s

array([[0.718  , 7.02259],
       [0.538  , 1.61282],
       [0.4379 , 0.03466]])

In [29]:
round(0.157 * boston.data.shape[1]),0.157 * boston.data.shape[1]

(2, 2.041)

In [102]:
from sklearn.tree import DecisionTreeRegressor

class TreeEnsemble():
    def __init__(self, X, y, n_trees, sample_sz, n_features, min_leaf):
        # let get the number of features, not fraction:
        if n_features < 1:
            self.n_features = round(X.shape[1] * n_features)
        else:
            self.n_features = n_features
        self.trees = []
        self.features = []
        self.n_trees = n_trees
        for i in range(self.n_trees):
            ind_rows = np.random.choice(X.shape[0], sample_sz, replace=True)
            ind_features = np.random.choice(X.shape[1], self.n_features, replace=False)
            x_sample = X.iloc[ind_rows,:].iloc[:,ind_features]
            y_sample = y.iloc[ind_rows]
            tree = DecisionTreeRegressor(min_samples_leaf=min_leaf).fit(x_sample, y_sample)
            self.trees.append(tree)
            self.features.append(ind_features)


    def predict(self, X):
        predictions = np.zeros(X.shape[0])
        n_trees = len(self.trees)
        for i in range(self.n_trees):
            tree = self.trees[i]
            ind_features = self.features[i]
            prediction = tree.predict(X.iloc[:,ind_features])
            predictions += prediction / n_trees
        return prediction

    def oob_mse(self):
        pass



In [None]:
from sklearn.datasets import load_boston
boston_dataset = load_boston()
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
X = pd.DataFrame(boston)
Y = pd.DataFrame(boston_dataset.target)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=317)

In [90]:
ind_rows = np.random.choice(X_train.shape[0], 5, replace=True)
ind_features = np.random.choice(X_train.shape[1], 5, replace=False)
sample = X_train.iloc[ind_rows,:].iloc[:,ind_features]
# [:,ind_features]
sample
# X_train.iloc[ind_rows,:]
Y_train.loc[ind_rows]

Unnamed: 0,0
291,37.3
99,33.2
108,19.8
357,21.7
239,23.3


In [103]:
ens = TreeEnsemble(X_train, Y_train, n_trees=15, sample_sz=100, n_features=5, min_leaf=5)
ens.predict(X_test)

array([19.5375    , 36.15      , 20.7875    , 16.77142857, 47.98571429,
       29.48      ,  9.7       ,  9.7       , 47.98571429, 22.46      ,
       16.77142857, 22.46      , 22.55714286,  9.7       , 22.46      ,
       25.72      , 20.7875    , 36.15      , 16.77142857, 19.5375    ,
       36.15      , 20.02857143, 19.5375    , 20.02857143, 17.175     ,
       24.46666667, 22.46      ,  9.7       , 29.48      , 36.15      ,
       20.7875    , 16.77142857, 17.175     , 16.77142857, 16.77142857,
       20.7875    , 19.5375    , 19.5375    , 36.15      , 15.58333333,
       15.58333333, 25.72      , 22.55714286, 16.77142857, 19.5375    ,
       16.77142857, 30.55      , 36.15      , 17.175     , 20.02857143,
       30.55      ,  9.7       , 25.72      , 16.77142857, 36.15      ,
       19.5375    , 20.02857143,  9.7       , 17.175     , 20.7875    ,
       25.72      , 47.98571429, 19.5375    , 20.02857143, 16.77142857,
       15.58333333, 25.72      , 17.175     , 20.7875    ,  9.7 

## 2. Implementing AdaBoost From Scratch (30 points)


2.1.   Implement the AdaBoost algorithm for classification task. Your `AdaBoost` class should receive a method for creating a weak learner, which has a fit and predict methods (**hint**: you can simulate re-weighting of the samples by an appropriate re-sampling of the train set).

2.2.   Use your model to find a strong classifier on the sample set given below, using $n$ weak learners:

2.2.1. For the base weak learners, use any classifier you want (check different types with different configuration, keep them simple = Weak). 

2.2.2. Split the sample set into train and test sets.

2.2.3 Plot the final decision plane of your classifier for $n\in \{1, 2, 3, 5, 10, 50\}$, and visualize the final iteration weights of the samples in those plots.
    - How does the overall train set accuracy changes with $n$?
    - Does you model starts to overfit at some point?





In [None]:
from sklearn.datasets import make_circles
from matplotlib import pyplot
from pandas import DataFrame

# generate 2d classification dataset
X, y = make_circles(n_samples=1500, noise=0.2, random_state=101, factor=0.5)

# scatter plot, dots colored by class value
df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
colors = {0:'red', 1:'blue'}
fig, ax = pyplot.subplots()
grouped = df.groupby('label')
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])
pyplot.show()

## 3. Boosting Trees from Scratch (40 points)
1. Use the scikit-learn's DecisionTreeRegressor (again :) with `max_depth = 1` (stumps)  to write a L2Boost model which minimize the L2 square loss iteration by iteration.
Reminder: in each step, build a decision tree to minimize the error between the true label and the accumulated (sum) of the previous step predictions.
![alt text](https://explained.ai/gradient-boosting/images/latex-321A7951E78381FB73D2A6874916134D.svg)
2. Use the Boston dataset to plot the MSE as a function of the number of trees for a logspace of `n_trees` up to 1,000. What is the optimal value of `n_trees`? of learning rate?
3. Compare the performance with a deep DecisionTreeRegressor (find the optimal `max_depth`).  Who wins?
4. Add an early-stopping mechanisim to the GBTL2 model to use a validation set to detect over-fit.