<a href="https://colab.research.google.com/github/wtsyang/dl-reproducibility-project/blob/master/Blog.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Striving for Simplicity: The All Convolutional Net
###### *Wei-Tse Yang, Sunwei Wang, and Qingyuan Cao*
###### 14/4/2020
In this notebook, we try to reproduce the TABLE 3 in [original paper](https://arxiv.org/abs/1412.6806). The source code of one of the models on ```Pytorch``` and the training procedure on Google Colab can be found in [github](https://github.com/StefOe/all-conv-pytorch). **We adopt the original training procedure and change it to the Python class. Also, we build the models from scratch on Pytorch.**


---


# Brief Introduction
The paper shows that replacing the max-pooling with the convolutional with increased strides can improve the performance. The authors prove it by training models with max-pooling, models removing max-pooling, and models replacing max-pooling with convolutional layers. The results show the models replacing max-pooling with convolutional layers with strides generally have better performance. 

## *Model, Strided-CNN, ConvPool-CNN, and ALL-CNN?*
Each model category has one original model and thee branches. **“Model”** is the model with max-pooling. **“Strided-CNN”** is the model removing max-pooling. **“All-CNN”** is the model replacing max-pooling with convolutional strides. The better performance of “All-CNN” might result from more parameters than “Model” and “Strided-CNN”. To solve it, “ConvPool-CNN” is proposed. **“ConvPool-CNN”** is the model with max-pooling and one more convolutional layer before the pooling. “ConvPool-CNN” should have the same number of parameters as “All-CNN”. Therefore, if “All-CNN” has a better performance than “ConvPool-CNN”, we can prove the better performance on “All-CNN” does not result from more parameters. 

Model A, Model B, Model C?
Since the design of convolutional layers would influence the performance, the authors test three model bases. The model A uses the 5x5 strides. The model B uses the 5x5 strides but also adds one convolutional layer with 1x1 strides after that. The model C uses two convolutional layers with 3x3 strides. 

---


In [0]:
![Comparison between the base and derived models on the CIFAR-10 dataset.](img/table1.png)


# Experiment Setup 

Three model bases and three branches in each base are designed, so there are 12 models in total. These models are trained on the CIFAR-10 with the stochastic gradient descent with a fixed momentum of 0.9 and 350 epochs. The learning rate $\gamma$ is chosen from the set ∈ [0.25, 0.1, 0.05, 0.01]. It is also scheduled by multiplying with a fixed factor of 0.1 in the epoch S= [200, 250, 300]. The paper only presents the best performance among all learning rates. Based on the source code, the performance is directly evaluated on the CIFAR-10 test set. No cross-validation is conducted to evaluate the hyperparameters! 


## Results 

A Table

In [0]:
# Copy the trianing chunk to here?

# Cross-Valudation 

In [0]:
# Copy the trianing chunk to here?

# DropOut and Batch Normalization
Dropout is a simple method to prevent neural networks from overfitting. In our all convolutional net paper, the author stated that dropout was used to regularize all networks. 
Dropout was almost essential in all the state-of-the-art networks before the introduction of batch normalization(BN). With the introduction of BN, it has shown its effectiveness and practicability in the recent neural networks. However, there is evidence that when these two techniques are used combinedly in a network, the performance of the network actually becomes worse. (Ioffe & Szegedy, 2015). In our study, we will investigate the results using BN and Dropout independently and also the effect of equipping both BN and Dropout simultaneously.


In [0]:
model = Model(dropOut=False, BN=True)


In [0]:
![Model with BN without Dropout](img/model_BN_withoutDropout.png)

# Optimizer
The default optimizer used in the paper and in our reproduction is Stochastic Gradient Descent (SGD) with momentum. We experimented with different optimizers, since Adaptive Moment Estimation (Adam) is one of the most popular optimization algorithms, we decided to run Adam instead of SGD optimizer in our model. However, the model did not converge with Adam under the specific setting that we tried to reproduce. So even though in theory Adam combines the advantage of two SGD variants, RMSProp and AdaGrad, it is not very consistent in our specific setting to converge to an optimal solution.

Default SGD optimizer:

In [0]:
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9)


# Softmax?

In [0]:
# Copy the trianing chunk to here?

# Appendix


*   [Github Repository](https://github.com/wtsyang/dl-reproducibility-project)
*   [A Python Class to build all Models](https://github.com/wtsyang/dl-reproducibility-project/blob/master/Model.py)
*   [A Python Class for Training Procedure](https://github.com/wtsyang/dl-reproducibility-project/blob/master/Training.py)
*   [The Notebook for Model A]()







