# Factorization Example

Large convolution and affine layers can be decomposed into smaller layers to reduce the number
of parameters and computations. 
NNabla introduces a set of factorized layer types which can approximate the functions of 
larger convolution or affine layers. The parameters of the factorized layers can be initialized 
by low rank approximation of the original layers. The factorized layers currently include:

## SVD Affine

SVD affine is a low rank approximation of the affine layer. It can be seen as 
two consecutive affine layers with a bottleneck.
It computes $\mathbf{y} = \mathbf{U} \mathbf{V} \mathbf{x} + \mathbf{b}$, where $\mathbf{x}$
and $\mathbf{y}$ are the inputs and outputs respectively,and $\mathbf{U}$, $\mathbf {V}$ and 
$\mathbf{b}$ are constants.

The weights $\mathbf{U}$ and $\mathbf{V}$ are approximated with singular value decomposition (SVD) 
of the original weight matrix $\mathbf{W}$ and by selecting the $R$ dominant singular 
values and the corresponding singular vectors. Therefore the low rank $R$ is the size 
of the bottleneck. 

## SVD Convolution
  
SVD convolution is a low rank approximation of the convolution layer. It can
be seen as a depth wise convolution followed by a $1\times1$ convolution. 
The flattened kernels for the $i^{th}$ input map are expressed by their low rank approximation.
The kernels for the $i^{th}$ input $\mathbf{W}_i$ are approximated with the singular value 
decomposition (SVD) and by selecting the $R$ dominant singular values and the
corresponding singular vectors. 
  
  $$ \mathbf{W}_{:,i,:} \approx \mathbf{U}_i \mathbf{V}_i $$

$\mathbf{U}$ contains the weights of the depthwise convolution with multiplier $R$ and 
$\mathbf{V}$ contains the weights of the $1\times1$ convolution. 
  
Note that if $R=1$ the structure is equivalent to the depthwise separable convolution introduced in [1]


## CP Convolution

CP convolution is a low rank approximation of the 3D kernel tensor of a convolution layer. It 
can be seen as linear combinations of the input feature maps to $R$ feature maps followed
by a depthwise convolution and followed by linear combinations of the feature maps to compute the
output feature maps. 
The CP decomposition allows to approximate the kernel tensor by $R$ rank-$1$ tensors of the form:
  
  $$ \sum_{r=1}^{R} \lambda_r \mathbf{o}^{(r)}\otimes\mathbf{i}^{(r)}\otimes\mathbf{k}^{(r)} $$
  
where $\lambda$ is the normalization coefficient and $\otimes$ is the outer product.
  
CP layers were introduced in [2], however they decompose the $4$D weight tensor while, here, the 
CP convolution is initialized to approximate the $3$D weight tensor with reshaped kernels as in [3].

In this example, we show how we can reduce the size of a neural network using the factorized layers
initialized from the layers of a pre-trained network on CIFAR10. 

The example can be ran as follows:

Train the original network first:

```sh
python classification.py -o './original_net' \
                         -d 0  -c cudnn \
                         --net 'cifar10_resnet23_prediction'
```

Train the reduced network with SVD affine and SVD convolutions (weight compression rate is at least 40%):

```sh
python classification.py -o './svd_net' \
                         -d 0  -c cudnn \
                         --model-load-path './original_net/params_224000.h5' \
                         --net 'cifar10_svd_factorized_resnet23_prediction' \
                         --compression_ratio 0.4
```

If you want to use CP convolution, 
train the reduced network with SVD affine and CP convolutions (weight compression rate is at least 40%):

```sh
python classification.py -o './cp_net' \
                         -d 0  -c cudnn \
                         --model-load-path './original_net/params_2240000.h5' \
                         --net 'cifar10_cpd3_factorized_resnet23_prediction' \
                         --compression_ratio 0.4
```

## References

1. Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,
Marco Andreetto, and Hartwig Adam.  
"Mobilenets: Efficient convolutional neural networks for mobile vision applications."
https://arxiv.org/pdf/1704.04861

2. Lebedev, Vadim, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. 
"Speeding-up convolutional neural networks using fine-tuned cp-decomposition." 
arXiv preprint arXiv:1412.6553 (2014).

3. Astrid, Marcella, and Seung-Ik Lee. 
"CP-decomposition with Tensor Power Method for Convolutional Neural Networks Compression." 
In Big Data and Smart Computing (BigComp), 
2017 IEEE International Conference on, pp. 115-118. IEEE, 2017.