In [1]:
import torch
a = torch.tensor([[0.,1.,2.,3.,4.],
                  [0.,2.,4.,6.,8.]]) 

### $L_p$  normalization
* <span style="font-size:1.5em">$ \upsilon = \frac{\upsilon}{max(||\upsilon||_p, \epsilon)}$<span> 

In [2]:
torch.nn.functional.normalize(a, p=1.0, dim=0, eps=1e-12)

tensor([[0.0000, 0.3333, 0.3333, 0.3333, 0.3333],
        [0.0000, 0.6667, 0.6667, 0.6667, 0.6667]])

In [3]:
torch.nn.functional.normalize(a, p=2.0, dim=1, eps=1e-12)

tensor([[0.0000, 0.1826, 0.3651, 0.5477, 0.7303],
        [0.0000, 0.1826, 0.3651, 0.5477, 0.7303]])

### z-score normalization
* mean $ \mu = \frac{1}{N} \sum_{i=1}^N x_i $
* standard deviation $ \sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N(x_i-\mu)^2}$

In [4]:
a.mean(), a.std(unbiased=False)

(tensor(3.), tensor(2.4495))

### Batch Normalization

The technique consists of adding an operation in the model just before the activation function of each layer, simply zero-centering and normalizing the inputs, then scaling and shifting the result using two new parameters per layer (one for scaling, the other for shifting). In other words, this operation lets the model learn the optimal scale and mean of the inputs for each layer.<br>
<span style="font-size:1.5em">$y=\frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}*\gamma+\beta $</span>, where $\gamma=weigh, \beta=bias$

In [5]:
import torch
import torch.nn as nn
n = nn.BatchNorm1d(5, affine=True) 
n.weight, n.bias, n.eps

(Parameter containing:
 tensor([1., 1., 1., 1., 1.], requires_grad=True),
 Parameter containing:
 tensor([0., 0., 0., 0., 0.], requires_grad=True),
 1e-05)

When affine = False, $\gamma=1, \beta=0$

In [6]:
n = nn.BatchNorm1d(5, affine=False)
n.weight, n.bias, n.eps

(None, None, 1e-05)