## Boosting Neural Networks

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from pathlib import Path
import pandas as pd
import numpy as np

In [3]:
import math

import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.preprocessing import StandardScaler



In [4]:
PATH = Path("spamdata")

## Spam Data

In this notebook we're going to be working with a database of emails, some of which are spam and some are not. Here's a link to the source on UCI Repository: https://archive.ics.uci.edu/ml/datasets/spambase

### Predictors

The predictive features are as follows: 

48 continuous real [0,100] attributes of type word_freq_WORD
= percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.

6 continuous real [0,100] attributes of type char_freq_CHAR]
= percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail

1 continuous real [1,...] attribute of type capital_run_length_average
= average length of uninterrupted sequences of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_longest
= length of longest uninterrupted sequence of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_total
= sum of length of uninterrupted sequences of capital letters
= total number of capital letters in the e-mail

### Target

Our target is the classification of spam or not. It's encoded in the final column of the dataset as:

1 nominal {0,1} class attribute of type spam
= denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.



## Read in the spambase data

And get it ready for  model fitting. 

In [5]:
def parse_spambase_data(filename):
    """ 
    Given a filename return X and Y numpy arrays.

    X is of size number of rows x num_features.
    Y is an array of size the number of rows.
    Y is the last element of each row. (Convert 0 to -1)
    """
    dataset = np.loadtxt(filename, delimiter=",")
    K = len(dataset[0])
    Y = dataset[:, K - 1]
    X = dataset[:, 0 : K - 1]
    Y = np.array([-1. if y == 0. else 1. for y in Y])
    return X, Y

In [6]:
parse_spambase_data("spamdata/spambase.train")

(array([[0.000e+00, 0.000e+00, 0.000e+00, ..., 3.850e+00, 1.900e+01,
         7.700e+01],
        [9.000e-02, 0.000e+00, 9.000e-02, ..., 3.704e+00, 4.800e+01,
         7.260e+02],
        [7.800e-01, 0.000e+00, 7.800e-01, ..., 2.555e+00, 2.200e+01,
         1.150e+02],
        ...,
        [3.700e-01, 1.800e-01, 1.800e-01, ..., 3.455e+00, 2.400e+01,
         3.870e+02],
        [0.000e+00, 0.000e+00, 8.000e-01, ..., 2.360e+00, 3.500e+01,
         5.900e+01],
        [0.000e+00, 0.000e+00, 0.000e+00, ..., 1.666e+00, 7.000e+00,
         4.500e+01]]),
 array([ 1., -1.,  1., ..., -1., -1., -1.]))

#### Define helper function for scaling

In [7]:
def normalize(X, X_val):
    """ Given X, X_val compute X_scaled, X_val_scaled
    
    return X_scaled, X_val_scaled
    """
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    X_val_scaled = scaler.transform(X_val)

    return X_scaled, X_val_scaled

In [8]:
X, Y = parse_spambase_data(PATH/"spambase.train")
X_val, Y_val = parse_spambase_data(PATH/"spambase.test")
X, X_val = normalize(X, X_val)

## Define an example Neural Network Class 

### Small NNs w/ 1 hidden layer

In [9]:
class NN(nn.Module):
    def __init__(self, D, seed, hidden=10):
        super(NN, self).__init__()
        torch.manual_seed(seed) # this is for reproducibility
        self.linear1 = nn.Linear(D, hidden)
        self.linear2 = nn.Linear(hidden, 1)
        self.bn1 = nn.BatchNorm1d(num_features=hidden) #normalizes batches
        #forces the hidden layer features to have mean 0 and unit var
        
    def forward(self, x):
        #make an nn with 1 hidden layer
        #the hidden layer should be activated with ReLU
        #which can be found in F.relu
        #you can also batch normalize if you want
        
        
        #...
        
        
        
        return x

In [10]:
#list(NN(10,1).parameters())

Exercise: Sketch the NN diagram which corresponds to this class. Would we think of this network as a weak learner or a strong learner? 

## Preparing to Boost

If we want to do gradient boosting we'll need pseudoresiduals. Recall the most general form of a pseudoresidual is:

$$r_i = - \frac{\partial L}{\partial f} \bigg| _{f = f_{m-1}(x_i)}$$

If we use logloss then $L(y, f) = log(1+e^{-yf}).$

### Calculate pseudoresidual for classification for $m^{th}$ stage

In [11]:
def compute_pseudo_residual(y, fm):
    """ 
    vectorized computation of the pseudoresidual for logloss
    """
    res = y/(1+np.exp(y*fm))
    return res

## Function which fits a Neural Network

In [12]:
def fitNN(model, X, r, epochs=20, lr=0.1, verbose = False):
    """ 
    Fit a regression model to the pseudo-residuals.
        
    Returns the fitted values on training data as a numpy array.
    Shape of the return should be (N,) not (N,1).
    """
    
    #get data as float tensors w proper dimensions
    x = torch.FloatTensor(X)
    y = torch.FloatTensor(r).unsqueeze(1)
    
    #specify Adam as optimizer
    #could add weight_decay if we want to attenuate lr over time
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    
    
    for i in range(epochs):
        model.train() #training mode
        out = model(x) #pred f_{m-1}(x_i)
        
        
        #calculate empirical log loss 
        #...
        
        #grad descent
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if verbose:
            print("loss {:.3f}".format(loss.item()))
        out = out.view(-1)
        
    return out.detach().numpy()

In [13]:
N, D = X.shape
N, D

(3600, 57)

In [14]:
model = NN(D = D, seed = 1)
model

NN(
  (linear1): Linear(in_features=57, out_features=10, bias=True)
  (linear2): Linear(in_features=10, out_features=1, bias=True)
  (bn1): BatchNorm1d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)

### Manual Boosting: First stage

In [15]:
nu = 0.1

In [16]:
y_mean = Y.mean()
f0 = np.log((1 + y_mean)/(1 - y_mean))
f0

-0.4019940876152855

In [17]:
r1 = Y/(1 + np.exp(Y*f0))

In [18]:
yhat1 = fitNN(model, X, r1, epochs=20, lr=0.1)
f1 = f0 + nu * yhat1

In [19]:
#then ...

r2 = Y/(1 + np.exp(Y*f1))

#...and so on until m = M

## Function which does the boosting automatically

Let's write a function to do the boosting for us.

In [20]:
def boostingNN(X, Y, num_iter, nu):
    """Given an numpy matrix X, an array y,
    and the number of iterations num_iter (aka M),
    return the fitted trees and weights 
   
    Input: X, y, num_iter
    Outputs: array of Regression models
    Assumes y is in {-1, 1}
    """
    models = []
    N, D = X.shape
    seeds = [s+1 for s in range(num_iter)] # use these seeds to call the models

    #boost the NNs
    y_mean = Y.mean()
    f0 = np.log((1 + y_mean)/(1 - y_mean))
    fm = f0
    for t in range(num_iter):
        res = Y/(1+np.exp(Y*fm))
        
        
        #call the model with the t^th seed
        #...
        
        #fit it
        #...
        
        #additive stagewise modeling
        #...
        
        models.append(model)
    return f0, models

### Convert to final prediction

In [21]:
def gradient_boosting_predict(X, f0, models, nu):
    """Given X, models, f0 and nu predict y_hat in {-1, 1}
    
    y_hat should be a numpy array with shape (N,)
    """
    y_hat = f0
    x = torch.FloatTensor(X)
    for model in models:
        model.eval() #eval mode
        out = model(x).view(-1)
        y_hat += nu*out.detach().numpy()

    y_hat = 2 * (y_hat > 0) - 1 #this is effectively sign(y_hat)

    return y_hat

## Testing & Sanity Checks

In [22]:
def accuracy(y, pred):
    return np.sum(y == pred) / float(y.shape[0]) 

In [23]:
X, Y = parse_spambase_data(PATH/"tiny.spam.train")
X_val, Y_val = parse_spambase_data(PATH/"tiny.spam.test")
X, X_val = normalize(X, X_val)

In [24]:
xx = np.around(X_val[0, :3],3)
assert(np.array_equal(xx, np.array([-0.433, -0.491, -0.947])))

In [25]:
X, Y = parse_spambase_data(PATH/"spambase.train")
X_val, Y_val = parse_spambase_data(PATH/"spambase.test")
X, X_val = normalize(X, X_val)

In [26]:
xx = np.around(X[0, :3],3)
assert(np.array_equal(xx, np.array([-0.343, -0.168, -0.556])))

In [27]:
y = np.array([-1, -1, 1, 1])
fm = np.array([-0.4, .1, -0.3 , 2])
res = compute_pseudo_residual(y, fm)
xx = np.around(res, 3)
actual = np.array([-0.401, -0.525,  0.574,  0.119])
assert(np.array_equal(xx, actual))

In [28]:
X, Y = parse_spambase_data(PATH/"tiny.spam.train")
X_val, Y_val = parse_spambase_data(PATH/"tiny.spam.test")
X, X_val = normalize(X, X_val)

In [29]:
nu = .1
f0, models = boostingNN(X, Y, num_iter=10, nu=nu)
y_hat = gradient_boosting_predict(X, f0, models, nu=nu)

In [30]:
acc_train = accuracy(Y, y_hat)
assert(acc_train==1)

In [31]:
y_hat = gradient_boosting_predict(X_val, f0, models, nu=nu)
acc_val = accuracy(Y_val, y_hat)
assert(acc_val==0.8)

In [32]:
X, Y = parse_spambase_data(PATH/"spambase.train")
X_val, Y_val = parse_spambase_data(PATH/"spambase.test")
X, X_val = normalize(X, X_val)

In [33]:
nu = .1
f0, models = boostingNN(X, Y, num_iter=100, nu=nu)
y_hat = gradient_boosting_predict(X, f0, models, nu=nu)

In [193]:
acc_train = accuracy(Y, y_hat)
assert(np.around(acc_train, decimals=4)==0.9697)

In [194]:
y_hat = gradient_boosting_predict(X_val, f0, models, nu=nu)
acc_val = accuracy(Y_val, y_hat)
assert(np.around(acc_val, decimals=4)==0.944)

# Exercise: 

1. Tune the boosting model to optimize accuracy on the validation set. What hyperparameters can we adjust?

2. Consider adding more layers to the NN. How would this affect the need for boosting? How would it change the ideal hyperparameters? Implement a deeper NN class and do some experiments to see how this change affects the boosting hyperparameters.

3. Generalize what we did with Boosting NNs for classification to Boosting NNs for regression. Consider: what changes? what remains the same? You can use the following dataset to guide your development: 
https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data
    

