# Structured Sparsity Example

It is well-known that neural networks have redundant filters, thus one would like
to reduce the number of filters, then the number of output feature maps 
in convolutions and the number of output dimensions in affine are reduced, 
which not only reduces the memory space but also the computational cost. 

For example, in 2D Convolution case, one can get a slim network by reducing unnecessary 3D kernels, 





$w_m \in \mathcal{R}^{N \times K_h \times K_w}$ where 
$w_m$ denotes one convolution 3d filter, 
$N$ is the number of input maps, 
$K_h$ is the kernel height, and 
$K_w$ is the kernel width.

This can be achieved by sparsing the filters. It is induced by using 
`Structured Sparsity Learning` called `SSL` in the following paper, 

```
WeiWen, et al., 
"Learning Structured Sparsity in Deep Neural Networks",
https://arxiv.org/abs/1608.03665
```

Literally, `SSL` includes the filters which have many zeros elements but it has structure, 
in this case, $w_m$ might become zero, thus one can ignore such filters. Thus,
one get a slim network.

Mathematically, there are two regularization to induce sparsity; $R_f(W)$ and $R_c(W)$, Each of which are denoted by

$$R_f(W) = \sum_{m=1}^{M}\sqrt{\sum_{n,k_{h},k_{w}=1}^{N,K_{h},K_{w}}w_{m,n,k_{h}, k{w}}^{2}},$$

$$R_c(W) = \sum_{n=1}^{N}\sqrt{\sum_{m,k_{h},k_{w}=1}^{M,K_{h},K_{w}}w_{m,n,k_{h}, k{w}}^{2}} .$$

where $R_f(W)$ induces the *filter-wise* sparsity and $R_c(W)$ does the *channel-wise* sparsity. Note that $R_c(W)$ also induces the *filter-wise* sparsity since in Neural Network context, an input map is the result of a preceding layer so that including the *channel-wise* sparsity corresponds to the *filter-wise* sparsity in the preceding layer.

Usually, $R_f(W)$ and $R_c(W)$ are used together for each layer, 

$$\lambda_f \sum_{l=1}^{L} R_f(W^{l}) + \lambda_c \sum_{l=1}^{L} R_c(W^{l}),$$

where $\lambda_f$ and $\lambda_c$ are the hyper parameters.

One follows the steps for using `SSL`,

1. Train a reference network with Structured Sparsity induced regularization
2. Finetune a reference network without unnecessary filters

For using this example, first train a network,

```sh
python classification.py -c "cudnn" \
    --monitor-path "monitor.filter.lambda-5e4" \
    --model-save-path "monitor.filter.lambda-5e4" \
    --filter-decay 5e-4 \
    --channel-decay 5e-4 \
    -d 0
```

Then, finetune that network,

```sh

python finetuning.py -c "cudnn" \
    --monitor-path "monitor.finetune.filter.lambda-5e4.rrate-025" \
    --model-save-path "monitor.finetune.filter.lambda-5e4.rrate-025" \
    --model-load-path "monitor.filter.lambda-5e4/${the best result}.h5" \
    --reduction-rate 0.25 \
    --val-interval 1000 \
    -d 0
```

## References
1. Wen Wei, Wu Chunpeng, Wang Yandan, Chen Yiran, and Li Hai "Learning Structured Sparsity in Deep Neural Networks", arXiv:1608.03665