### Regularizaton techniques

### Introduction
Regularization is widely used to prevent Neural Networks from overfitting, such as weight decay, dropout, early stop etc.  In Computer Vision, image augmentation techniques like CutOut, Mixup and CutMix can also act as a regularizer and  is crucial to improve model's generalization ability. This nodebook provides a collection of resources (papers or codes) about these common regularization methods, aiming to help you to understand the core implementation and  to make it easy for you to adopt it in your project.

### 1.Cutout, [code](https://github.com/uoguelph-mlrg/Cutout/blob/master/util/cutout.py) 
\begin{align*} M_i &= \text{random}(h, w)\\ x &= x \odot M_i \end{align*}
where $M_i$ is a mask and $i$ is the index of holes.

### 2.Mixup
\begin{align*} x &= \lambda x_i + (1 - \lambda)x_j, \qquad \text{where } x_i, x_j \text{ are raw input vectors}\\ y &= \lambda y_i + (1 - \lambda)y_j \qquad \text{where } y_i, y_j \text{ are raw input vectors} \end{align*}

where $\lambda$ is the mixing rate.

```python
class Mix_Up(object):
  """
  args: alpha, int, beta parameter used to sample lambda from beta distribution
  p: proba that applied mixing up
  """
  def __init__(self, alpha=1.0, mixing_rate=0.3, p=0.5):
        super().__init__()
        self.alpha = alpha
        self.p = p

  def mix_batch(self, images, labels):
        # sample mixing rate form bete distribution
        lam = np.random.beta(self.alpha, self.alpha)
        batch_size = images.shape[0]
        indexs = torch.randperm(batch_size)

        # mixing pairs of images
        images_mixed = images*lam + images[indexs,:]*(1-lam)
        labels_a, labels_b = labels, labels[indexs] 
        
        return images_mixed, labels_a, labels_b, lam

   def mix_criterion(self, criterion, pred, labels_a, labels_b, lam):
        return criterion(pred, labels_a)*lam + criterion(pred, labels_b)*(1-lam)
    
   def apply_mix(self,):
        return True if np.random.random() < self.p else False    
```

### 3.CutMix

\begin{align*} x &= M_i \odot x_i + (1 - M_i) \odot x_j, \qquad \text{where } x_i, x_j \text{ are raw input vectors}\\ y &= \lambda y_i + (1 - \lambda)y_j \qquad\qquad\qquad \text{where } y_i, y_j \text{ are raw input vectors} \end{align*}

where $M_i$ is a mask, $\lambda$ is the mix rate.

``` python
  class Cut_Mix(object):
    def __init__(self, alpha=1.0, p=0.5):
      super().__init__()
      self.alpha = alpha
      self.p = p
    
    def mix_batch(self, images, labels):
        # sample mixing rate form bete distribution
        lam = np.random.beta(self.alpha, self.alpha)
        batch_size = images.shape[0]
        indexs = torch.randperm(batch_size)

        # mixing pairs of images
        mask, lam = self.create_mask(images[0], lam)
        images_mixed = images*mask + images[indexs,:]*(1-mask)

        labels_a, labels_b = labels, labels[indexs] 
        return images_mixed, labels_a, labels_b, lam
    
    def create_mask(self, img, lam):
        C, H, W = img.shape

        mask = np.ones((H, W), np.float32)
        
        # cut size
        cut_rat = np.sqrt(1. - lam)
        cut_w = np.int(W * cut_rat)
        cut_h = np.int(H * cut_rat)

        # uniform
        cx = np.random.randint(W)
        cy = np.random.randint(H)

        bbx1 = np.clip(cx - cut_w // 2, 0, W)
        bby1 = np.clip(cy - cut_h // 2, 0, H)
        bbx2 = np.clip(cx + cut_w // 2, 0, W)
        bby2 = np.clip(cy + cut_h // 2, 0, H)
        
        mask[bby1: bby2, bbx1: bbx2] = 0.

        mask = torch.from_numpy(mask)
        mask = mask.expand_as(img)
        lam = 1 - (bbx2 - bbx1) * (bby2 - bby1) / (H * W) # compute the real lambda
        return mask, lam
    
    def mix_criterion(self, criterion, pred, labels_a, labels_b, lam):
        return criterion(pred, labels_a)*lam + criterion(pred, labels_b)*(1-lam)
    
    def apply_mix(self,):
        return True if np.random.random() < self.p else False
```

We can also combine Mxing up and Cut mix together into a more powerful regularization function, for example:

``` python
class Cut_Mix_Up(object):
    def __init__(self, alpha=1.0, mixing_rate=0.3, p=0.5):
          super().__init__()
          self.alpha = alpha
          self.mixing_rate = mixing_rate  # mixing up prob
          self.p = p
          
    def mix_batch(self, images, labels):
      ...
      # mixing pairs of images
      if np.random.random() < self.mixing_rate: # apply Mixup
          images_mixed = images*lam + images[indexs,:]*(1-lam)
              
      else: # apply Cutmix
          mask, lam = self.create_mask(images[0], lam)
          images_mixed = images*mask + images[indexs,:]*(1-mask)
      ...

    def create_mask(self, img, lam):
      ...

    def mix_criterion(self, criterion, pred, labels_a, labels_b, lam):
        return criterion(pred, labels_a)*lam + criterion(pred, labels_b)*(1-lam)


    def apply_mix(self,):
        return True if np.random.random() < self.p else False
```

### 4.Label Smoothing

\begin{align*}
 CELoss(y, p) &= \sum^{C}_{k} - y^{LS}_{k} \log(p_k) = (1 - \alpha)\sum^{C}_{k} - y_{k} \log(p_k) + \alpha / C \sum^{C}_{k} - \log(p_k) \\
 y^{LS}_{k} &=  y_k(1 - \alpha) + \alpha / C 
\end{align*}

Where $y_k$ is $1$ if class[k] is ground truth, 0 otherwise. $\alpha$ is the smoothing rate, $C$ is the number of classes. If we set $\alpha = 0$, above becomes to standard Cross Entropy loss.

```python
class LabelSmoothingCrossEntropy(nn.Module):
    def __init__(self, smoothing=0.1, reduction='mean'):
        super().__init__()
        self.reduction = reduction
        self.alpha = smoothing

    def forward(self, inputs, target):
        c = inputs.size()[-1]  # num of classes
        # log_softmax function    
        probs = F.softmax(inputs, dim=-1)
        log_preds = torch.log(probs)

        loss = -log_preds.sum(dim=-1)  # reduce class 
        ls_celoss = (1-self.alpha) * F.nll_loss(log_preds, target, reduction=self.reduction) + self.alpha/c*loss
        return ls_celoss if self.reduction == 'none' else ls_celoss.mean()

```
Relevant articles:
> 1. [When Does Label Smoothing Help?](https://arxiv.org/abs/1906.02629)
> 2. [Label Smoothing](https://leimao.github.io/blog/Label-Smoothing/)
> 3. [Rethinking the Inception Architecture for Computer Vision](https://arxiv.org/pdf/1512.00567.pdf), part 7.
> 4. [Distilling the Knowledge in a Neural Network](https://arxiv.org/pdf/1503.02531.pdf), part6.

### 5.Confidence Penalty (focal loss)

[Focal loss](https://arxiv.org/abs/1708.02002) was first proposed to address the class Imbalanced problem in object detection where background objects are much more than positive objects, hence, author argued that such imbalanced problem is the central factor that limits the detector’s accuracy.
Instead of directly forcing model to more focus on minority samples, author proposed to modify standard cross-entropy loss that down-weights the loss that assigned to high confidence (well-classified) example with a modulating factor.

Focal Loss is formally defined as 
$FL = \sum^C_k -y_k[\alpha_k(1- p_k)^\gamma \log(p_k)]$

where $\alpha$ is a weight factor, $\gamma$ controls the confidence penalty.

``` python
class FocalLoss(nn.Module):
    def __init__(self, alpha=1, gamma=2, reduce=True):
        super(FocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.reduce = reduce

    def forward(self, inputs, targets):
        ce_loss = F.cross_entropy(inputs, targets, reduction='none')
        p_k = torch.exp(-ce_loss)  # compute confidence
        focal_loss = self.alpha * (1 - p_k)**self.gamma * ce_loss

        if self.reduce:
            focal_loss = torch.mean(focal_loss)

        return focal_loss
```