# Regularization

### Need for Regularization

I want to start the discussion with occum's Razer which suggests us to choose the simplest model that works. Choosing a simple model for a neural network is difficult because it is inherently complex. A neural network learns the distribution of data while training so that it can work on new data in that distribution(test accuracy). There will be a slight performance drop between the training time and testing time and this drop is called generalization error. In few cases when the training is too aggressive, the network starts learning the data and starts fitting the data instead of learning the data distribution which results in a poor performance at test time. This is a result of poor generalizability of the network. We use some techniques to improve the generalization of a network and these techniques are also called as regularization. There are three main regularizations used in neural networks
1. Classic Regularization ($L^1$ and $L^2$)
2. Dropouts
3. Batch Normalization

## Classic Regularization(Weight Norm penalities):
Regularization helps us to simplify our final model even with a complex architecture. One classic type of regularization is weight penalities which keeps the values of weight vectors in check. We achieve this we add the norm of the weight vector to the error function to get the final cost function. We can use any norm from $L^1$ to $L^\infty$. The most widely used norms are $L^2$ and $L^1$. 

### $L^2$ Regularization
$L^2$ Regularization is also called as Ridge Regression or Tikhonov regularization. Among the weight penalities $L^2$ is the most used weight penality. $L^2$ Regularization penalizes the bigger weights. We achieve regularization by adding square of $L^2$ norm to the cost function. mathematical representation of $L^2$ regularization is given by:
$$Cost = E(X) + \lambda \parallel W \parallel_2 ^ 2$$
New Gradient g of the cost function $E(X)$ w.r.t to Weights w is given by:
$$g = \frac{\partial E(X)}{\partial W} + 2 \lambda W$$

$\lambda$ is the regularization coefficient that can be used to control the level of regularization.

### $L^1$ Regularization

In $L^1$ Regularization we add the first norm of the weight vector to the cost function. $L^1$ Regularization penalizes the weights that are not zero. It forces the weights to be zero as a result of which the final parameters are sparse with most of the weights bring zero. Mathematical representation of $L^1$ regularization is given by:
$$Cost = E(X) + \lambda \parallel W \parallel_1$$
New Gradient g of the cost function $E(X)$ w.r.t to Weights w is given by:
$$g = \frac{\partial E(X)}{\partial W} + \lambda sign(W)$$

#### combination of Norm penalities:

We do not have to restrict ourselves to one weight Norm penality for a parameter. We can have a combination of more than one weight penalities. Our final model will be impacted by the properties of all the regularizers. For example, If we use both $L^1$ and $L^2$ weight penalities in our model then the cost function becomes
$$Cost = E(X) + \lambda_2 \parallel W \parallel_2 ^ 2 + \lambda_1 \parallel W \parallel_1$$
New Gradient g of the cost function $E(X)$ w.r.t to Weight vector W is given by:
$$g = \frac{\partial E(X)}{\partial W} + 2 \lambda_2 W + \lambda sign(W) $$

#### Regularization by Norm Penalities in YANN:
YANN has a flexibility of regularizing selected layer or an entire network. To regularize a layer, we should set the following arguments for ***`network.add_layer()`*** function
<pre>
regularize – True is you want to apply regularization, False if not.
regularizer – coeffients for L1, L2 regulaizer coefficients,Default is (0.001, 0.001).
</pre>
To give common regularization parameters for entire network, we can give regularization argument for optimizer parameters.
 <pre>"regularization"    : (l1_coeff, l2_coeff). Default is (0.001, 0.001) </pre>
 
 Let's see Regularization in action:

In [None]:
from yann.network import network
from yann.utils.graph import draw_network
from yann.special.datasets import cook_mnist
def lenet5 ( dataset= None, verbose = 1, regularization = None ):             
    """
    This function is a demo example of lenet5 from the infamous paper by Yann LeCun. 
    This is an example code. You should study this code rather than merely run it.  
    
    Warning:
        This is not the exact implementation but a modern re-incarnation.

    Args: 
        dataset: Supply a dataset.    
        verbose: Similar to the rest of the dataset.
    """
    optimizer_params =  {        
                "momentum_type"       : 'nesterov',             
                "momentum_params"     : (0.65, 0.97, 30),      
                "optimizer_type"      : 'rmsprop',                
                "id"                  : "main"
                        }

    dataset_params  = {
                            "dataset"   : dataset,
                            "svm"       : False, 
                            "n_classes" : 10,
                            "id"        : 'data'
                      }

    visualizer_params = {
                    "root"       : 'lenet5',
                    "frequency"  : 1,
                    "sample_size": 144,
                    "rgb_filters": True,
                    "debug_functions" : False,
                    "debug_layers": False,  # Since we are on steroids this time, print everything.
                    "id"         : 'main'
                        }       

    # intitialize the network
    net = network(   borrow = True,
                     verbose = verbose )                       
    
    # or you can add modules after you create the net.
    net.add_module ( type = 'optimizer',
                     params = optimizer_params, 
                     verbose = verbose )

    net.add_module ( type = 'datastream', 
                     params = dataset_params,
                     verbose = verbose )

    net.add_module ( type = 'visualizer',
                     params = visualizer_params,
                     verbose = verbose 
                    )
    # add an input layer 
    net.add_layer ( type = "input",
                    id = "input",
                    verbose = verbose, 
                    datastream_origin = 'data', # if you didnt add a dataset module, now is 
                                                 # the time. 
                    mean_subtract = False )
    
    # add first convolutional layer
    net.add_layer ( type = "conv_pool",
                    origin = "input",
                    id = "conv_pool_1",
                    num_neurons = 20,
                    filter_size = (5,5),
                    pool_size = (2,2),
                    activation = 'relu',
                    # regularize = True,
                    verbose = verbose
                    )

    net.add_layer ( type = "conv_pool",
                    origin = "conv_pool_1",
                    id = "conv_pool_2",
                    num_neurons = 50,
                    filter_size = (3,3),
                    pool_size = (2,2),
                    activation = 'relu',
                    # regularize = True,
                    verbose = verbose
                    )      


    net.add_layer ( type = "dot_product",
                    origin = "conv_pool_2",
                    id = "dot_product_1",
                    num_neurons = 1250,
                    activation = 'relu',
                    # regularize = True,
                    verbose = verbose
                    )

    net.add_layer ( type = "dot_product",
                    origin = "dot_product_1",
                    id = "dot_product_2",
                    num_neurons = 1250,                    
                    activation = 'relu',  
                    # regularize = True,    
                    verbose = verbose
                    ) 
    
    net.add_layer ( type = "classifier",
                    id = "softmax",
                    origin = "dot_product_2",
                    num_classes = 10,
                    # regularize = True,
                    activation = 'softmax',
                    verbose = verbose
                    )

    net.add_layer ( type = "objective",
                    id = "obj",
                    origin = "softmax",
                    objective = "nll",
                    datastream_origin = 'data', 
                    regularization = regularization,                
                    verbose = verbose
                    )
                    
    learning_rates = (0.05, .0001, 0.001)  
    net.pretty_print()  
    # draw_network(net.graph, filename = 'lenet.png')   

    net.cook()

    net.train( epochs = (20, 20), 
               validate_after_epochs = 1,
               training_accuracy = True,
               learning_rates = learning_rates,               
               show_progress = True,
               early_terminate = True,
               patience = 2,
               verbose = verbose)

    net.test(verbose = verbose)
data = cook_mnist()
dataset = data.dataset_location()
lenet5 ( dataset, verbose = 0)

. Setting up dataset 
.. setting up skdata
... Importing mnist from skdata
.. setting up dataset
.. training data
.. validation data 
.. testing data 
. Dataset 78347 is created.
. Time taken is 0.47504 seconds
.. This method will be deprecated with the implementation of a visualizer,also this works only for tree-like networks. This will cause errors in printing DAG-style networks.
 |-
 |-
 |-
 |- id: input
 |- type: input
 |- output shape: (500, 1, 28, 28)
 |------------------------------------
          |-
          |-
          |-
          |- id: conv_pool_1
          |- type: conv_pool
          |- output shape: (500, 20, 12, 12)
          |- batch norm is OFF
          |------------------------------------
          |- filter size [5 X 5]
          |- pooling size [2 X 2]
          |- stride size [1 X 1]
          |- input shape [28 28]
          |- input number of feature maps is 1
          |------------------------------------
                   |-
                   |-
      

| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               



                   |- batch norm is OFF
                   |------------------------------------
                   |- filter size [3 X 3]
                   |- pooling size [2 X 2]
                   |- stride size [1 X 1]
                   |- input shape [12 12]
                   |- input number of feature maps is 20
                   |------------------------------------
                            |-
                            |-
                            |-
                            |- id: 4
                            |- type: flatten
                            |- output shape: (500, 1250)
                            |------------------------------------
                                     |-
                                     |-
                                     |-
                                     |- id: dot_product_1
                                     |- type: dot_product
                                     |- output shape: (500, 1250)
                   

| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.536438
... Learning Rate       : 9.49999957811e-05
... Momentum            : 0.660666584969


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.301666
... Learning Rate       : 9.02499959921e-05
... Momentum            : 0.671333312988


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.191394
... Learning Rate       : 8.57374980114e-05
... Momentum            : 0.681999981403


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.135745
... Learning Rate       : 8.14506202005e-05
... Momentum            : 0.692666649818


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.105225
... Learning Rate       : 7.73780921008e-05
... Momentum            : 0.703333318233


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.0858618
... Learning Rate       : 7.3509188951e-05
... Momentum            : 0.713999986649


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.0719565
... Learning Rate       : 6.98337316862e-05
... Momentum            : 0.724666655064


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.0611352
... Learning Rate       : 6.63420432829e-05
... Momentum            : 0.735333323479


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.0528243
... Learning Rate       : 6.30249414826e-05
... Momentum            : 0.745999991894


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.0460229
... Learning Rate       : 5.9873695136e-05
... Momentum            : 0.756666600704


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.040339
... Learning Rate       : 5.68800096516e-05
... Momentum            : 0.767333269119


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.035487
... Learning Rate       : 5.40360088053e-05
... Momentum            : 0.777999937534


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.0312105
... Learning Rate       : 5.13342092745e-05
... Momentum            : 0.788666665554


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.0273663
... Learning Rate       : 4.87674988108e-05
... Momentum            : 0.799333274364


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.0241477
... Learning Rate       : 4.63291253254e-05
... Momentum            : 0.80999994278


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.021233
... Learning Rate       : 4.40126677859e-05
... Momentum            : 0.820666670799


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.018664
... Learning Rate       : 4.18120325776e-05
... Momentum            : 0.83133327961


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.016489
... Learning Rate       : 3.97214316763e-05
... Momentum            : 0.841999948025


| training  100% Time: 0:00:03                                                 
| validation  100% Time: 0:00:01                                               


.. Cost                : 0.0148402
... Learning Rate       : 3.77353608201e-05
... Momentum            : 0.85266661644
