# Assignment 7: Regularization - Bagging, Early Stopping and Dropout (deadline: 22 Dec, 23:59)

### Exercise 1. Regularization: Bagging (7 points)

**Goal:** Study the effects of **bagging** regularization on Decision Tree based methods against a single instance of such a classifier.

Bagging, briefly mentioned in the Lecture 6, refers to an ensemble machine learning method. The Bagging scheme, suggested to be used in this exercise, samples instances from the training data with replacement and creates multiple training subsets. For each of these subsets, a new regressor is constructed internally and finally, all combined to produce the result. For more details read: 

1. Scikit Learn Documentation for Bagging Regressor. http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html#id6

2. Bootstrap Aggregating Wikipedia Article. https://en.wikipedia.org/wiki/Bootstrap_aggregating

Implement a bagging regularization scheme using ***DecisionTreeRegressor***, a Decision Tree based classifier from the python package ***sklearn.tree***. To implement the bagging scheme you can use ***BaggingRegressor*** available in the python package ***sklearn.ensemble***. Fill in the code pieces marked by "# TODO" in the following notebook to complete this assignment. Finally, comment on the results you obtain.

Note: to run the following code you will need to download **data.csv** from the NNIA's resource page on Piazza

In [1]:
# Import Packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor

`Define Estimators`: Create an array of two estimators. <br>
First a "Tree" using `DecisionTreeRegressor` and  <br>
second a Regularized version obtain using `BaggingRegressor` on this Tree, labelled "Bagging (Tree)". (2 points)

In [2]:
estimators = [("Tree", #TODO
               DecisionTreeRegressor()
              ),
              ("Bagging (Tree)", #TODO
               BaggingRegressor(max_samples=1.0,
                                bootstrap=True)
              )]

In [3]:
n_estimators = len(estimators)
#print (n_estimators)
np.random.seed(0)


#Load Data
data = pd.read_csv('data.csv')

#Drop Player Field
data = data.drop('Player', axis=1)

#Set variable y with 'Salary' column and then drop it from data
y = data['Salary'].as_matrix()
data = data.drop('Salary', axis=1)

#Convert data to an numpy array x
x = data.as_matrix()
#print (plt.subplot(1,n_estimators,1))
#print(plt)
# Split x in X_train (80 %) and X_test (20 %),  while constructing the corresponding y_train and y_test
n_train = np.int(0.8 * len(x) )
n_test = len(x) - n_train
X_train = x[:n_train,:]
y_train = y[:n_train]
X_test = x[n_train:,:]
y_test = y[n_train:]

In [4]:
# Plot Figures and report error using the different ensemble methods
fig = plt.figure(figsize=(10,3))
print(n_estimators)

# Loop over estimators to compare
for n, (name, estimator) in enumerate(estimators):
    #print (n, name, estimator) n ={0,1}, name = {Tree, Bagging (Tree)}, estimator is the function
    
    # Compute predictions
    y_predict = np.zeros(n_test)
    
    # Train the estimator (1 point)
    # TODO
    estimator.fit(X_train, y_train)
    #Predict results using the estimator on X_test (1 point)
    # TODO
    y_predict = estimator.predict(X_test)
    y_error = np.zeros(n_test)
    
    # Compute the sqaured error using y_test and y_predict and store it in y_error (1 point)
    # TODO
    for i in range(y_test.size):
        y_error[i] = (y_test[i] - y_predict[i])**2
    #y_error = accuracy_score(y_test, y_predict)
    print("{0}: {1:.4f} (error)".format(name,np.mean(y_error)))
    
    # Plot the Result
    plt.subplot(1,n_estimators, n+1)
    plt.plot(np.arange(n_test), y_error, "r", label="$error(x)$")
    plt.ylim([0, 1300000])
    plt.title(name)

plt.show()

SyntaxError: invalid syntax (<ipython-input-4-d727f31ce518>, line 28)

a) Explain the differences (in 2-3 sentences) between the plots you obtain for **Tree** and **Bagging (Tree)**. (2 points)

### Exercise 2. Regularization: Early Stopping (6 points)

**Goal:** To study how increasing neurons of a neural network (model complexity) affects the Early Stopping threshold.

Download **MNIST** dataset from the NNIA's resource page on Piazza. First, update the feedforward neural network code from  Assignment 6 to calculate training and validation error at every 100 iterations (also known as validation frequency) of the training scheme. For this update, you will modify the fit function and the signature of the function will look like follows. Notice the extra arguments that need to be provided to calculate the validation error at every 100 iterations. (2 points)

In [None]:
    def fit(self, X, y, print_progress=False, validation_freq=0, X_val=None, y_val=None):
        """ Learn weights from training data.

        Parameters
        -----------
        X : array, shape = [n_samples, n_features]
            Input layer with original features.
        y : array, shape = [n_samples]
            Target class labels.
        print_progress : bool (default: False)
            Prints progress as the number of epochs
            to stderr.
        validation_freq : int (default: 0)
            For the value "i" it takes, it calculates the 
            train set and validation set error every "ith" iteration
        X_val : array, shape = [n_validation_samples, n_features]
            the validation set X values, to be provided 
            when validation_freq > 0
        y_val : array, shape = [n_validation_samples]
            the validation set y values, to be provided
            when validation_freq > 0

        Returns:
        ----------
        self

        """

Then using this code for a different number of neurons (50, 100 and 200), plot the variation of training and validation error for every 100th iteration upto 1000 iterations. Label the axes and legends appropriately in the plots. (1.5 points)

Note: to caluclate the validation error use the test set as a proxy validation set.

a) Using these plots and the related variables from the code to make suggestions for an early stopping criteria for each hidden layer size. (1.5 points)

b) As the number of neurons are increased, you will observe differences in the early stopping criteria for each hidden layer size. Why do you observe such differences? (1 point)

### Exercise 3. Regularization: Dropout (7 points)

**Goal:** To implement and study dropout for neural networks.

Implement dropout for layer 2 in the three-layered network you developed for Exercise 2. A simple dropout implementation creates a mask ($r^{(l)}_j$) for every neuron $j$ of the hidden layer $l$ by drawing from a [Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution) with probability $p$.
$$ r^{(l)}_j \sim Bernoulli(p) $$
This mask is then applied to the hidden layer output ($h^{(l)}$) to obtain the regularized hidden layer activation $\hat{h}^{(l)}$
$$ \hat{h}^{(l)} = r^{(l)} * h^{(l)}$$
However, such an implementation requires the layerl be multiplied by the dropout coefficient $p$ at evaluation time to balance the larger number of active units during testing.
$$ \hat{h}^{(l)} = p * h^{(l)}$$
Such an implementation requires the code to switch between different code blocks for forward-pass evaluation during training and testing. Hence, a smoother way to implement dropout is to use ***inverted dropout*** where the mask generated at the training is multiplied by the inverse of the dropout coefficient.
$$ r^{(l)}_j \sim Bernoulli(p) * \frac{1}{p}$$
This scheme allows the scaling to be learned during training and hence, no switching between code blocks is required.

Update the code from Exercise 2 to implement inverted dropout for a hidden layer size of 200 neurons. (4 points)

a) Furthermore, update the code from Exercise 2 to plot the variation of training and test accuracies on MNIST for dropout values denoted by np.arange(0.0, 1.0, 0.1). (1.5 points)

b) Intuitively, L1 and L2 minimize the interdependence and the value of feature weights by penalising the loss function. In the same vein, what kind of interdependence does dropout affect? (0.5 points)

c) Why can Dropout be considered as an approximation to Bagging? (1 point)

---

## Submission instructions
You should provide a single Jupyter notebook as a solution. The naming should include the assignment number and matriculation IDs of all team members in the following format:
**assignment-7_matriculation1_matriculation2_matriculation3.ipynb** (in case of 3 team members). 
Make sure to keep the order matriculation1_matriculation2_matriculation3 the same for all assignments.

Please, submit your solution to your tutor (with **[NNIA][assignment-7]** in email subject):
1. Maksym Andriushchenko s8mmandr@stud.uni-saarland.de
2. Marius Mosbach s9msmosb@stud.uni-saarland.de
3. Rajarshi Biswas rbisw17@gmail.com
4. Marimuthu Kalimuthu s8makali@stud.uni-saarland.de

**If you are in a team, please submit only 1 solution to only 1 tutor.**