<a href="https://colab.research.google.com/github/wtsyang/dl-reproducibility-project/blob/master/Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Striving for Simplicity: The All Convolutional Net
###### *Wei-Tse Yang, Sunwei Wang, and Qingyuan Cao*
###### 14/4/2020
In this notebook, we try to reproduce the TABLE 3 in [original paper](https://arxiv.org/abs/1412.6806). The source code of one of the models on ```Pytorch``` and the training procedure on Google Colab can be found in [github](https://github.com/StefOe/all-conv-pytorch). **We adopt the original training procedure and change it to the Python class. Also, we build the models from scratch on Pytorch.**

---

# Brief Introduction
The paper shows that replacing the max-pooling with the convolutional with increased strides can improve the performance. The authors prove it by training models with max-pooling, models removing max-pooling, and models replacing max-pooling with convolutional layers. The results show the models replacing max-pooling with convolutional layers with strides generally have better performance. We provide the detail explanation as follows. The authors tested 12 networks by designing 3 model bases and 3 branches. 

## *Base: Model A, Model B, and Model C*
Since the design of convolutional layers would influence the performance, the authors test three model bases. **Model A** uses the 5x5 strides. **Model B** uses the 5x5 strides but also adds one convolutional layer with 1x1 strides after that. **Model C** uses two convolutional layers with 3x3 strides. 

<img src='https://drive.google.com/uc?id=1HKDGWePX-PkBqRbeb8J_mwO-DDULhU2q' width="600px"/>


## *Branch: Model, Strided-CNN, ConvPool-CNN, and ALL-CNN*
Each model base has one original model and thee branches. **“Model”** is the model with max-pooling. **“Strided-CNN”** is the model removing max-pooling. **“All-CNN”** is the model replacing max-pooling with convolutional strides. The better performance of “All-CNN” might result from more parameters than “Model” and “Strided-CNN”. To solve it, “ConvPool-CNN” is proposed. **“ConvPool-CNN”** is the model with max-pooling and one more convolutional layer before the pooling. “ConvPool-CNN” should have the same number of parameters as “All-CNN”. Therefore, if “All-CNN” has a better performance than “ConvPool-CNN”, we can prove the better performance on “All-CNN” does not result from more parameters. 
We show architecture with the base of model C in the following image. 

<img src='https://drive.google.com/uc?id=1gzpTwoW_Xx8YrHZdvU0ZmFT3n-1ktx1v' width="600px"/>

---
# Experiment Setup 
All 12 networks are trained on the CIFAR-10 with the stochastic gradient descent with a fixed momentum of 0.9 and 350 epochs. The learning rate γ is chosen from the set ∈ [0.25, 0.1, 0.05, 0.01]. It is also scheduled by multiplying with a fixed factor of 0.1 in the epoch S= [200, 250, 300]. The paper only presents the best performance among all learning rates. Based on the source code, the performance is directly evaluated on the CIFAR-10 test set. In other words, **the source code did not use the cross-validation for hyper-parameter tuning!** 

---

# Reproduction Results


| Model      |  Error Rate of Paper  | Error Rate of Ours| 
|-|-|-|
|   Model A   | 12.47%|19.27%|
|   Strided-CNN-A  |13.46% |20.27%|
|   **ConvPool-CNN-A**  |**10.21%**|**15.46%**|
|   ALL-CNN-A  |10.30% |15.60%|
| | ||
|   Model B | 10.20%|**17.01%**|
|   Strided-CNN-B  | 10.98%|23.20%|
|   ConvPool-CNN-B  | 9.33%|18.22%|
|   ALL-CNN-B   | ***9.10%***|*29.48%|
| | ||
|   Model C  | 9.74%|**13.07%**|
|   Strided-CNN-C  | 10.19%|15.49%|
|   ConvPool-CNN-C  | 9.31%|14.39%|
|   ALL-CNN-C  | ***9.08%***|17.89%|



In [0]:
# The example of training procedure
# Choose the model A 
training=Training(baseModel=[True,False,False])
# Create the dataset
training.createDataset()
# Choose the branch: All-CNN
training.modifiedModel=[False,False,False,True]
# Start Training
training.Procedure()

# Cross-Valudation 
Since the source code did not use the validation set for hyper-parameters tuning, we conduct the cross-validation in this section. We split 5% of the training data to create the cross-validation set. We show the results in the following table. The performance on the test set does not have too much difference from the counterpart on the validation set. However, if we compare the test error with the original test error, the test error with cross-validation is generally higher. This is because right now we save the best model based on the validation error, and then evaluate the test set with it. 

| Model     |  Validation Set Error | Test Error| Original Test Error|
|-|-|-|-|
|   Model A   | 21.20% |20.45%|19.27%|
|   Strided-CNN-A  |21.72% |21.38%|20.27%|
|   **ConvPool-CNN-A**  |?|?|**15.46%**|
|   ALL-CNN-A  |? |?|15.60%|
| | ||
|   Model B | 16.65%|17.81%|**17.01%**|
|   Strided-CNN-B  | 17.53%|18.68%|23.20%|
|   ConvPool-CNN-B  | 17.53%|17.51%|18.22%|
|   ALL-CNN-B   | *24.53% | *25.78%|*29.48%|
| | ||
|   **Model C**  | **14.13%**|**14.87%**|**13.07%**|
|   Strided-CNN-C  | 20.89%|21.67%| 15.49%| 
|   ConvPool-CNN-C  | 17.81%|17.60%| 14.39%|
|   ALL-CNN-C  | ?|?| 17.89%|

In [0]:
# Cross Validation can be conducted by settting validation equal to True 
training=Training(validation=True,bestModel_allLR=True,baseModel=[True,False,False])

# DropOut and Batch Normalization
Dropout is a simple method to prevent neural networks from overfitting. In our all convolutional net paper, the author stated that dropout was used to regularize all networks. 
Dropout was almost essential in all the state-of-the-art networks before the introduction of batch normalization(BN). With the introduction of BN, it has shown its effectiveness and practicability in the recent neural networks. However, there is evidence that when these two techniques are used combinedly in a network, the performance of the network actually becomes worse. (Ioffe & Szegedy, 2015). In our study, we will investigate the results using BN and Dropout independently and also the effect of equipping both BN and Dropout simultaneously.



### BatchNorm *only*

In [0]:
model = Model(dropOut=False, BN=True)

<img src='https://drive.google.com/uc?id=17zb3ZUMTgRVLyRa1HYKm1SXqd-M_b6ai' width="500px"/>

### BatchNorm with Dropout

In [0]:
model = Model(dropOut=True, BN=True)

<img src='https://drive.google.com/uc?id=11rOzypoJjhbfdQsD6KND2SyqtLfOuh2n' width="500px"/>

### Dropout *only*

In [0]:
model = Model(dropOut=True, BN=False)

<img src='https://drive.google.com/uc?id=1eVqvPxWPCAovkP3lNT1G_odgyisbAlLr' width="500px"/>

### Without using BatchNorm or Dropout

In [0]:
model = Model(dropOut=False, BN=False)

<img src='https://drive.google.com/uc?id=1uzW63lONO39_Sy9JGhzLc6Qqc976qe00' width="500px"/>

We compare the results of different combination of these two techniques and generated the table below: 

| Model     |  BN only | BN + Dropout |Dropout only| No BN no Dropout |
|-|-|-|-|-|
|   Model A   | no converge |no converge|19.27%| 14%|



As shown in the table above, we used general Model A to study the two techniques Batch Normalziation (BN) and Dropout, and whether it increase or decrease our model performance in this case. We implemented BN layer between two convolution layers, right before feeding into ReLu actiations. Dropout is also applied in between convolution layers, and it is used after pooling layer. 
The original paper stated that they used Dropout only, and we found out that that using BN without Dropout or combining both BN with Dropout will not let our model converge. Li, Xiang, et al. (2019) stated in their papers that the worse performance may be the result of variance shifts. However, using Dropout only does lead the model to converge, giving the result of 19.27%. But the performance is still not as good as 14% without using either BN or dropout. It might be due to that we did not have the time to tune hyperparameter, dropout rate, we only used the parameter from the original paper: 20% for dropping out inputs and 50% otherwise.

# Optimizer
The default optimizer used in the paper and in our reproduction is Stochastic Gradient Descent (SGD) with momentum. We experimented with different optimizers, since Adaptive Moment Estimation (Adam) is one of the most popular optimization algorithms, we decided to run Adam instead of SGD optimizer in our model. However, the model did not converge with Adam under the specific setting that we tried to reproduce. So even though in theory Adam combines the advantage of two SGD variants, RMSProp and AdaGrad, it is not very consistent in our specific setting to converge to an optimal solution.



```
# This is formatted as code
```

Default SGD optimizer:

In [0]:
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9)


# Discussion


# Reference

Springenberg, Jost Tobias, et al. "Striving for simplicity: The all convolutional net." arXiv preprint arXiv:1412.6806 (2014).


Li, Xiang, et al. "Understanding the disharmony between dropout and batch normalization by variance shift." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2019).



# Appendix


*   [Github Repository](https://github.com/wtsyang/dl-reproducibility-project)
*   [A Python Class to build all Models](https://github.com/wtsyang/dl-reproducibility-project/blob/master/Model.py)
*   [A Python Class for Training Procedure](https://github.com/wtsyang/dl-reproducibility-project/blob/master/Training.py)
*   [The Notebooks for Model A](https://github.com/wtsyang/dl-reproducibility-project/tree/master/modelA)
*   [The Notebooks for Model B](https://github.com/wtsyang/dl-reproducibility-project/tree/master/modelB)
*   [The Notebooks for Model C](https://github.com/wtsyang/dl-reproducibility-project/tree/master/modelC)
*   [The Notebooks for Cross Validation](https://github.com/wtsyang/dl-reproducibility-project/tree/master/crossValidation)
*   [The Notebooks for DropOut and Batch Normalization](https://github.com/wtsyang/dl-reproducibility-project/tree/master/BN_Dropout)









