# Incremental Learning

## Elastic Weight Consolidation

Denote parameters of layers of a deep neural network (DNN) with θ. Training DNNs generates a mapping between the input distribution space and target distribution space.

This is done by finding out an optimum $\theta$ = $\theta^*$ which results in the least error in the training objective. It has been shown in earlier works that such a mapping can be obtained 

with many configurations of $\theta^*$ like this image.

<div>
  <img src="Images/Untitled.png" alt="Untitled" style="width: 250px; height: 250px;">
</div>

This is basically the solution around the most optimum $\theta$ with acceptable error in the learned mapping.

Let’s begin with a simple case of two tasks, task A and task B. To have a configuration of parameters that performs well for both A and B, the network should be able to pick θ 

from the overlapping region of the individual solution spaces. In the first instance, the network can learn any θ = $\theta_A$ that performs well for task A. But with the arrival of
task B, the network should pick up a θ = $\theta_{A,B}$. The next question that arrives is how can the network learn the a set of parameters that lies in this overlapping region.

To this end, EWC presents a method of selective regularization of parameters θ. After learning A, this regularization method identifies which parameters are important for A, and then penalizes any change made to the network parameters according to their importance while learning B.

<div>
  <img src="Images/2tasks.png" alt="2tasks" style="width: 250px; height: 250px;">
    <img src="Images/4tasks.png" alt="4tasks" style="width: 250px; height: 250px;">
</div>

Let’s say the weights to be learned are W = [$w_1, w_2, w_3, ....., w_n]$

Now these weights could have a certain type of distribution pattern like

1. Normal distribution
2. Uniform Distribution
3. Laplace distribution
4. Dirichlet distribution

You can find this using visual or statistical test.

So what is the purpose of these distribution in terms of training 

1. Before training
    
    Initializing the weights according to a distribution can help in convergence and performance of training
    
    example: Xavier initialization uses Uniform or normal distribution of initial weights
    
2. After training
    
    Analyzing the distribution of weights can help in analyzing properties of model such as sparsity, variance or presence of certain patterns.
    

To formulate the objective we use a Bayesian approach to estimate the parameters $\theta$.

$P(\theta|\sum) = \frac{P(\sum|\theta)P(\theta)}{P(\sum)}$                                                                ————————(1)

Here $\sum$ is the data

$P(\theta|\sum)$ is the posterior

$P(\sum|\theta)$  is the likelihood

$P(\theta)$ is the prior

and we want to learn the posterior PDF $P(\theta|\sum)$

So maximizing a function is same as maximizing its log 

log($P(\theta|\sum)$) = log($P(\theta|\sum)$) + log($P(\theta)$) - log($P(\sum)$)             ————————(2)

Now to train a Neural Network is to maximize this logarithm

$arg \max\limits_{\theta} \{ l(\theta) = log(P(\theta|\sum)) \}$

We can write (2) as

$log(p(\theta|\sum)) = log(p(B|A, \theta) + log(p(\theta|A)) - log(p(B|A))$   ————————(3)

$log(p(\theta|\sum)) = log(p(B| \theta) + log(p(\theta|A)) - log(p(B))$  ————————(4)

But  $log(p(\theta|A))$ is intractable so they calculated its approimation and the overall loss function becomes

$l(\theta) = l_B(\theta) - \frac{\lambda}{2}(\theta - \theta^*_A)^T []_A (\theta - \theta^*_A)+\epsilon'$                                ————————(5)

where $[]_A = E [-\frac{\delta^2(log(p(\theta|A)))}{\delta^2\theta}|_{\theta^*_A}]$ = $E[ ((\frac{\delta(log(p(\theta|A)))}{\delta\theta})(\frac{\delta(log(p(\theta|A)))}{\delta\theta})^T)|_{\theta^*_A}]$   is the Fisher Information Matrix(FIM)

The Fisher Information Matrix (FIM) holds the importance of weights for previous task say A.

<div>
  <img src="Images/regularization.png" alt="regularization" style="width: 800px; height: 200px;">
</div>

Pytorch Implementation

It is pretty straight forward

All we have to do is train the model on say task A.

But while training on task B we have to use the FIM (precision matrices) for the model based on model trained for task A.

For which all i have to do is find the gradient for each parameter based on data for task A and square this value for each data instance

Now we have out importance of each parameter array.

Now while training for task B all we have to do is add a penalty term which is basically calculating the difference of the current parameters to the old model parameters multiplied by the FIM or precision matrices value.

https://github.com/moskomule/ewc.pytorch/tree/master

Issues with EWC

1. Scalability issues.
    
    Computation of FIM is computationally expensive and memory-intensive.
    
2. Assumption of independent matrix
    
    EWC assumes that the parameters are independent, meaning that it only considers the diagonal elements of the Fisher Information Matrix. This assumption ignores the potential correlations between different parameters, which can lead to suboptimal performance in preserving knowledge.
    
3. Task similarity and importance weighting
    
    The method relies on the Fisher Information Matrix to determine the importance of weights. However, this matrix might not capture the true importance of weights across very different tasks. If tasks are significantly different, the Fisher Information Matrix from previous tasks might not be a good indicator of weight importance for future tasks.
    
4. Storage of Past Information
    
    While EWC reduces the need to store entire datasets of past tasks, it still requires storing the learned parameters and their corresponding Fisher Information Matrices. For a large number of tasks, this can lead to significant storage requirements, which can become impractical over time.
    
5. Diminishing Effectiveness with Many Tasks
    
    As the number of tasks increases, the regularization terms from multiple tasks accumulate. This can lead to a scenario where the model becomes overly constrained, reducing its capacity to learn new tasks effectively. This phenomenon is often referred to as "**regularization collapse**."
    
6. Empirical Performance
    
    In practice, the performance of EWC can vary significantly depending on the specific tasks and datasets. Some studies have shown that EWC can be less effective compared to other continual learning methods, especially in scenarios where tasks are very diverse or when the number of tasks is large.
    
7. Complex Hyperparameter Tuning
    
    EWC introduces additional hyperparameters, such as the regularization strength λ\lambdaλ. Tuning these hyperparameters can be complex and time-consuming, and the optimal values can vary widely depending on the tasks and datasets.