# Dropout Regularization
In this assignment you will implement dropout regularization. See the class notes for understanding dropout regularization.

Let' import the required packages.

In [1]:
import numpy as np
import argparse # for command-line parsing
import matplotlib # for plotting
from matplotlib import pyplot as plt # for plotting
from abc import ABC, abstractmethod # for crating abstract classes

from assign2_utils import load_train_data, load_test_data, flatten, df 

%matplotlib inline 

We will deviate from the function style implementations in the previous assigments to class style implementation in this assigment. The previous assignments would have made you little more familiar with python. You should feel comfortable with some aspects of basic python and numpy. However, note that these assignments will not teach you python. Whereever possible, we will try to touch different aspects of python.

# Design
We will create an abstract class called Layers. Every layer whether linear or non-linear or dropout or batchnorm should inherit from this class. Each such concrete layer should override the forward and backward method of the abstract class. A concrete layer should also override the update_params method of the abstract class if it has parameters like weights and biases to be learnt during training. Further, we will make the layers callable classes so that the forward method is automatically called when the class is called with parameters to forward method.

So, here is the abstract class Layers.

In [2]:
class Layers(ABC): # Layers inherits from ABC.The class ABC in abc module is required to make a class abstract. 
                   # Native python does not support abstract classes
    def __init__(self): 
        super().__init__() # calls the constructor of the super class ABC
    
    @abstractmethod
    def forward(self, *x): # should be overridden by every class that inherits from this class
        pass
    
    @abstractmethod
    def backward(self, *da): # should be overridden by every class that inherits from this class
        pass    
    
    def update_params(self, learning_rate = None): # should be overridden by those inheriting classes that have
                                                   # parameters to be learnt
        raise NotImplementedError
    
    def __call__(self, *x): # makes the inheriting classes callable, calling the forward method of the class
        return self.forward(*x)

Now, lets create the Linear layer. When an object of this layer is instantiated, the number of incoming features/nodes, number of outgoing activations/features, whether bias is required at this layer, whether regularization is required at this layer are initialized. Also, the weight matrix and the bias vector(if required) are initialized. Gradients of loss with respect to weights and biases are initialized to empty. Then, forward, backward and update_params are overridden.

In [3]:
class Linear(Layers): # inherits from abstract class Layers
    def __init__(self, in_features, out_features, bias = True, regularization = None): # constructor
        super().__init__()
        self.in_features = in_features # initialize number of incoming features
        self.out_features = out_features # initialize number of outgoing features
        self.weight = np.random.randn(out_features, in_features) * .01 # initialize weight matrix
        if bias:
            self.bias = np.zeros((out_features, 1)) # initialize bias if required
        else:
            self.bias = None # if bias not required, set it to None
        self.regularization = regularization # initialize regularization
        if self.regularization is not None: 
            self.reg_penalty = 0 # if valid regularization at this layer, set the regularization penalty
                                 # at this layer initially to zero
            
        self.dw = np.empty_like(self.weight) # intialize dw to empty
        self.db = np.empty_like(self.bias) if self.bias is not None else None 
                                            # initialize db to empty if bias == True
        
    def forward(self, x): # forward method overridden; x is the incoming activation. 
                          # shape of x is (num of activations, num of samples)
            
        m = x.shape[1]    # number of training examples
        self.x = x # x is required for backward. We don't need a separate cache. We can store it in the object.
        
        output = self.weight @ x # computation of the linear part Wx                                 
        if self.bias is not None:            
            output += self.bias # add to Wx bias if bias == True
                                # Note that we don't apply non-linearity here as this layer computes only Wx+b
        
        #update regularization penalty at this layer
        if self.regularization is None:
            pass
        elif self.regularization == 'L2':
            self.reg_penalty = args.lamda/(2*m) * np.sum(self.weight**2)
        elif self.regularization == 'L1':
            self.reg_penalty = args.lamda/m * np.sum(np.abs(self.weight))
        else:
            raise ValueError(f'Regularization method{self.regularization} not defined')
            
        return output # return forward output as the next layer in the model will require it     
    
    # Backward of this layer will receive dz. Note that at this layer z = Wx+b. So backward will compute
    # dw, db and dx. To compute dW, x is required. That's why x was stored in forward. To compute dx, W is 
    # required. This is already available in self.weight
    
    def backward(self, dz): # backward method overridden
        m = dz.shape[1]     # number of training examples
        self.dw = dz @ self.x.T # compute dw
        
        # add derivative of regularization penalty at this layer w.r.to w
        if self.regularization is None:
            pass
        elif self.regularization == 'L1':
            signw = np.sign(self.weight)
            signw[signw == 0] = 1
            self.dw += args.lamda/m * signw
        elif self.regularization == 'L2':
            self.dw += args.lamda/m * self.weight
        else:
            raise ValueError(f'Regularization method{self.regularization} not defined')
            
        if self.bias is not None: # compute db if bias == True
            self.db = np.sum(dz, axis = 1, keepdims = True) 
            
        dx = self.weight.T @ dz  # compute dx     
        return dx # we only return dx as this this required for chain rule in the next layer 
                  # in backward direction. dw and db are kept available in this object which will directly
                  # be used by update_params for updating weights and biases
    
    # update parameters in this layer 
    def update_params(self, learning_rate = 0.005):
        self.weight -= learning_rate*self.dw
        if self.bias is not None:
            self.bias -= learning_rate*self.db

Hope you got the idea of building different layers. Now, you will implement the NonLinear layer class. During instantiation of an object of type NonLinear, the nonlinearity name denoted by fname is set to the one received by the contructor. This part is already done for you. You are required to override forward and backward methods of the parent abstract class. 

forward method will receive input as x and return the non-linear mapping of x where the non-linearity is specified by fname. Note that this input x would have come from the previous Linear layer.

backward method will receive dx, the gradient from the previous layer and return the gradient of loss at this layer.

update_params not required to be overridden here as this layer does not have any parameters to be learnt.

In [4]:
class NonLinear(Layers):
    def __init__(self, fname='ReLU'):
        super().__init__()
        self.fname = fname
    
    def forward(self, x):
        self.x = x
        if self.fname == 'ReLU':
            return np.maximum(x, 0)
        elif self.fname == 'Sigmoid':
            return 1 / (1 + np.exp(-x))
        elif self.fname == 'Tanh':
            return np.tanh(x)
        else:
            raise ValueError('Unknown non-linear function error')
    
    def backward(self, dx):    # implemented instead of using df(...)
        if self.fname == 'ReLU':
            return dx * (self.x > 0)
        elif self.fname == 'Sigmoid':
            sigmoid_x = 1 / (1 + np.exp(-self.x))
            return dx * sigmoid_x * (1 - sigmoid_x)
        elif self.fname == 'Tanh':
            return dx * (1 - np.tanh(self.x)**2)
        else:
            raise ValueError('Unknown non-linear function error')

Similarly, you will implement Dropout layer class in the cell below.

In [5]:
class Dropout(Layers):
    def __init__(self, keep_prob = 0.8):
        super().__init__()
        self.keep_prob = keep_prob
    
    def forward(self, x, train=True):
        if train:    
            d = np.random.rand(*x.shape)
            d = (d < self.keep_prob)
            self.d = d
            x = x * d
            x = x / self.keep_prob
        return x
    
    def backward(self, dx):
        return dx * self.d
    
    def __call__(self, *x, train=True):
        return self.forward(*x, train=train)

Now that all layers are built, we will build our model. Our model is a composition of these layers. Let's say our model is as follows:
        
        Ip layer(I)----->Linear(L1)------>Non-linear(R)------>Dropout(D)----->Linear(L2)----->Non-linear(S)
 
Note that the first group of Linear, Non-linear and Dropout layers comprise the first hidden layer. The second group of Linear and Non-linear layers forms the output layer.

- Ip features shape: nx$^{[0]}$ x m (a batch of m samples each of dim nx. In this assignment, nx$^{[0]}$ will be 64\*64*3 = 12288).
- weights between I and L1 have shape: nx$^{[1]}$ x nx$^{[0]}$. nx$^{[1]}$ = 32.
- bias vector at L1 has shape: nx$^{[1]}$ x 1
- non-linear layer R is ReLU.
- weights between D and L2 have shape: nx$^{[2]}$ x nx$^{[1]}$. nx$^{[2]}$ = 1.
- bias vector at L2 has shape: nx$^{[2]}$ x 1
- non-linear layer R is Sigmoid.

The model is shown below. It is self-explanatory provided you go through the code carefully. We have also added a loss method that computes and returns the logistic loss.

In [6]:
class Model(Layers):
    def __init__(self, in_features):
        super().__init__()
        # self.fc1 = Linear(in_features, 32, regularization = 'L2')
        self.fc1 = Linear(in_features, 32)
        self.relu = NonLinear('ReLU')
        self.dp = Dropout()
        self.fc2 = Linear(32, 1)
        self.sigmoid = NonLinear('Sigmoid')
        
    def forward(self, x, train=True):
        x = self.fc1(x) # Note that we made classes callable which automatically calls forward method
                        # That's why we could call fc1(x) instead of fc1.forward(x). Calls below 
                        # are in similar line.
                        # we could call fc1.forward(x) also.
        x = self.relu(x)
        x = self.dp(x,train=train)
        x = self.fc2(x)
        x = self.sigmoid(x)
        return x
    
    def loss(self, output, y):
        m = output.shape[1]
        L = -(1./m) * np.sum(y*np.log(output) + (1-y)*np.log(1-output)) # compute loss
        for att in self.__dict__:
            if hasattr(att, 'reg_penalty'):
                L += att.reg_penalty
        return L
    
    def backward(self, output, y):
        epsilon = 1e-6
        m = output.shape[1]
        d_output = (1./m) * (output-y) * (1./((output*(1-output))+epsilon)) # compute da        
        dz = self.sigmoid.backward(d_output)
        dx = self.fc2.backward(dz)
        dx = self.dp.backward(dx)
        dz = self.relu.backward(dx)
        dx = self.fc1.backward(dz)  
    
    def update_params(self, learning_rate = 0.005):
        self.fc1.update_params(learning_rate)
        self.fc2.update_params(learning_rate)
    
    def __call__(self, *x, train=True):
        return self.forward(*x, train=train)

The rest of the code including train function, test function and main function are shown below. The codes are self-explanatory.

In [7]:
# instantiate the ArgumentParser object
parser = argparse.ArgumentParser(description='Train a fully connected network with regularization')
# add arguments
parser.add_argument('--miter', metavar='N', type=int, default=200, help='max number of iterations to train')
parser.add_argument('--alpha', metavar='LEARNING_RATE', type=float, default=0.001, help='initial learning rate')
parser.add_argument('--lamda', metavar='LAMBDA', type=float, default=1., help='regularization parameter')
parser.add_argument('--print_freq', metavar='N', type=int, default=300, help='print model loss every print_freq iterations')

# parse the arguments. 
# Since we cannot invoke the code written in jupyter directly from command-line, 
# we can specify the required arguments in the call to parse_args as shown below with other arguments 
# left out to use their default values.
args = parser.parse_args('--miter 3000 --alpha .008'.split()) # you may play with this code by changing
                                                                        # the arguments as required

In [8]:
def train(model, x, y):
    for i in range(args.miter):
        output = model(x) # model is a callable object with call to its forward method.
                          # we could also have written the rhs as model.forward(x)
        L = model.loss(output, y)
        model.backward(output, y)
        model.update_params(args.alpha)
        if not i%args.print_freq: # print loss every 100 iterations
                print(f'Loss at iteration {i}:\t{np.asscalar(L):.4f}')
                
def test_model(model, x, y):
    predictions = model(x, train=False)
    predictions[predictions > 0.5] = 1
    predictions[predictions <= 0.5] = 0
    acc = np.mean(predictions == y)
    acc = np.asscalar(acc)
    return acc

In [9]:
def main(): # main function to train and test the model    
    
    global args
    # load train data
    x, y = load_train_data()
    x = flatten(x)
    x = x/255. # normalize the data to [0, 1]     
    
    # Instantiate the model
    my_model = Model(x.shape[0])
    
    # train the model
    train(my_model, x, y)
    
    # test the model
    print(f'train accuracy: {test_model(my_model, x, y) * 100:.2f}%')

    x, y = load_test_data()
    x = flatten(x)
    x = x/255. # normalize the data to [0, 1]
    print(f'test accuracy: {test_model(my_model, x, y) * 100:.2f}%')
    
    return
    
if __name__ == '__main__':
    main()

Loss at iteration 0:	0.6963
Loss at iteration 300:	0.5974
Loss at iteration 600:	0.4506
Loss at iteration 900:	0.3342
Loss at iteration 1200:	0.2633
Loss at iteration 1500:	0.1985
Loss at iteration 1800:	0.1051
Loss at iteration 2100:	0.0779
Loss at iteration 2400:	0.0465
Loss at iteration 2700:	0.0385
train accuracy: 100.00%
test accuracy: 70.00%
