https://arxiv.org/abs/1608.06993

In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [2]:
from fastai.conv_learner import *
from fastai.plots import *
from utils import *

In [3]:
print_info()

Last run on: 2018-03-29
PyTorch version: 0.3.1.post2
fastai version: 0.6


In [4]:
PATH = 'data/'
get_cifar10(PATH)

In [5]:
stats = (np.array([ 0.4914 ,  0.48216,  0.44653]), np.array([ 0.24703,  0.24349,  0.26159]))

In [6]:
create_sample(os.path.join(PATH, 'cifar10/train'), 0.05)
create_sample(os.path.join(PATH, 'cifar10/test'), 0.05)

Authors trained on entire train set so we will try to mimic their approach. I imagine they didn't use the test set to monitor progress but as I am trying to reimplement their architecture using the hyperparams they provide I think it is fine for me to do so. 

In [7]:
def get_data(sz, bs, sample=False):
    tfms = tfms_from_stats(stats, sz, aug_tfms=[RandomFlip()], pad=sz//8, pad_mode=cv2.BORDER_CONSTANT)
    return ImageClassifierData.from_paths(
        f'{PATH}cifar10/',
        trn_name='train' if not sample else 'train_sample',
        val_name='test' if not sample else 'test_sample',
        tfms=tfms,
        bs=bs,
        num_workers=12
    )

In [8]:
bs = 64
sample = False
sz = 32

In [9]:
data = get_data(sz, bs, sample)

A nice reference implementation in PyTorch by the authors can be found here: https://github.com/gpleiss/efficient_densenet_pytorch

## Small DenseNet

Based on the code / information I found, I was unable to infer what the architecture of a DenseNet with 40 layers and a growth rate of 12 is. I am guessing it consists of 3 blocks each with 12 layers but I might be wrong.

Nonetheless, the plan is to implement something resembling a DenseNet, train it, and see if the results I am getting are in the correct ballpark.

In [10]:
def get_param_count(m):
    return np.sum([o.numel() for o in m.parameters()])

In [11]:
class DenseLayer(nn.Sequential):
    def __init__(self, ni, gr):
        ''' gr: growth rate '''
        super().__init__()
        self.add_module('bn0', nn.BatchNorm2d(ni))
        self.add_module('relu0', nn.ReLU(inplace=True))
        # 4 x is arbitrary but that is what one of the implementations I looked at seems to be doing
        self.add_module('conv0', nn.Conv2d(ni, 4*gr, 1, padding=0, bias=False)) 
        self.add_module('bn1', nn.BatchNorm2d(4*gr))
        self.add_module('relu1', nn.ReLU(inplace=True))
        self.add_module('conv1', nn.Conv2d(4*gr, gr, 3, padding=1, bias=False))
        
    def forward(self, x):
        new_f = super().forward(x)
        return torch.cat([x, new_f], 1)

In [12]:
class DenseBlock(nn.Sequential):
    def __init__(self, ni, gr, nl):
        super().__init__()
        for i in range(nl):
            self.add_module(f'dense{i}', DenseLayer(ni + gr * i, gr))

In [13]:
class Transition(nn.Sequential):
    def __init__(self, ni, comp=0.5):
        super().__init__()
        self.add_module('bn', nn.BatchNorm2d(ni))
        self.add_module('relu', nn.ReLU(inplace=True))
        self.add_module('conv', nn.Conv2d(ni, int(ni*comp), 1, bias=False))
        self.add_module('avg_pool', nn.AvgPool2d(2, 2))

In [14]:
def ni_per_block(i, no, nl, gr, comp=0.5):
    ni = no
    for i in range(i):
        ni += nl * gr
        ni = int(comp * ni)
    return ni

In [16]:
class DenseNet40_12(nn.Module):
    def __init__(self, c):
        super().__init__()
        no = 16 # count of output features from the initial convolution
        nb = 3 # count of blocks
        nl = 12 # count of layers per block
        gr = 12 # growth rate, amount of features output by each layer
        comp=0.5
        
        self.conv0 = nn.Conv2d(3, no, 3, padding=1, bias=False)
        
        self.conv_blocks = nn.ModuleList()
        for i in range(nb):
            self.conv_blocks.add_module(f'block_{i}', DenseBlock(ni_per_block(i, no, nl, gr, comp), gr, nl))
            
        
        self.trans = nn.ModuleList([
            Transition(ni_per_block(i, no, nl, gr, comp) + nl * gr) for i in range(nb-1)
        ])
        
        n_f_final = ni_per_block(nb-1, no, nl, gr, comp) + nl * gr
        self.bn_final = nn.BatchNorm2d(n_f_final)
        self.classifier = nn.Linear(n_f_final, c)
        
    def forward(self, x):
        x = self.conv0(x)
        
        for i, b in enumerate(self.conv_blocks):
            if i is not 0: x = self.trans[i-1](x)
            x = b(x)
            
        x = self.bn_final(x)
      
        x = x.view(x.shape[0], x.shape[1], -1).mean(2)
        x = self.classifier(x)
        
        return F.log_softmax(x)

In [17]:
learn = ConvLearner.from_model_data(DenseNet40_12(10), data, opt_fn=SGD_Momentum(0.9))

In [18]:
get_param_count(learn.model)

475850

The paper mentions that the DenseNet I tried to recreate has 1 million parameters. I was unable to pinpoint in the paper what could be the reason for such a big difference in parameter count.

In [19]:
wds = 10e-4 # paper mentions this wd but I think this might be incorrect, it might actually be 1e-4
            # (it references the training or resnets where 1e-4 was used)

Paper suggests training for 300 epochs, starting with lr 0.1 and decaying it by 10 at 50% and 75% of training. But as I have a smaller model I think I should be okay training with training for a lesser number of epochs.

In [20]:
%%time
learn.fit(1e-1, 60, wds=wds)

epoch      trn_loss   val_loss   accuracy                   
    0      1.311386   1.274939   0.532643  
    1      1.037473   1.401766   0.523587                   
    2      0.92245    1.369378   0.591162                    
    3      0.880907   1.360747   0.581111                    
    4      0.804627   1.027813   0.656847                    
    5      0.800998   1.177325   0.62918                     
    6      0.766326   1.048657   0.648189                    
    7      0.752539   1.077507   0.650179                    
    8      0.76203    0.913143   0.692178                    
    9      0.729287   1.052956   0.65824                     
    10     0.772939   1.009882   0.653463                    
    11     0.761346   0.918471   0.698846                    
    12     0.742009   1.012124   0.660231                    
    13     0.759809   1.062543   0.656051                    
    14     0.749707   1.032562   0.657245                    
    15     0.754506   1.0021

[1.1993911, 0.6039012738853503]

In [21]:
%%time
learn.fit(1e-2, 30, wds=wds)

epoch      trn_loss   val_loss   accuracy                    
    0      0.447679   0.461868   0.841063  
    1      0.419699   0.423838   0.855494                    
    2      0.403035   0.469202   0.839371                    
    3      0.374985   0.425033   0.8542                      
    4      0.374666   0.438647   0.850318                    
    5      0.404271   0.459892   0.846935                    
    6      0.384292   0.450538   0.842058                    
    7      0.381985   0.456771   0.845442                    
    8      0.386762   0.442477   0.846338                    
    9      0.372147   0.502554   0.830016                    
    10     0.365654   0.580266   0.807325                    
    11     0.394753   0.498774   0.833798                    
    12     0.395801   0.451188   0.84365                     
    13     0.377641   0.507878   0.829319                    
    14     0.362975   0.534201   0.82295                     
    15     0.359549   0.43

[0.4226373, 0.8564888535031847]

In [22]:
%%time
learn.fit(1e-3, 30, wds=wds)

epoch      trn_loss   val_loss   accuracy                    
    0      0.224567   0.291461   0.898985  
    1      0.217853   0.284613   0.900478                    
    2      0.19508    0.280558   0.902568                    
    3      0.190859   0.278302   0.901971                    
    4      0.181993   0.280493   0.906051                    
    5      0.187025   0.279654   0.904359                    
    6      0.176685   0.281269   0.904857                    
    7      0.189919   0.280698   0.904558                    
    8      0.17278    0.282082   0.905852                    
    9      0.1589     0.271804   0.908639                    
    10     0.157584   0.276442   0.908141                    
    11     0.161629   0.277764   0.907942                    
    12     0.155887   0.280883   0.906648                    
    13     0.148017   0.283008   0.905852                    
    14     0.150339   0.287987   0.901971                    
    15     0.161001   0.29

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



    22     0.133391   0.289553   0.905752                    
    23     0.142564   0.285936   0.906449                    
    24     0.130048   0.291292   0.904857                    
    25     0.127702   0.293783   0.905255                    
    26     0.137578   0.29019    0.907146                    
    27     0.141066   0.293704   0.903762                    
    28     0.134391   0.300314   0.899881                    
    29     0.123537   0.290525   0.90426                     

CPU times: user 35min 33s, sys: 6min 13s, total: 41min 47s
Wall time: 33min 9s


[0.29052508, 0.9042595541401274]

In [30]:
accuracy_np(*learn.predict_with_targs())

0.9038

In [31]:
log_preds,y = learn.TTA(True)
preds = np.mean(np.exp(log_preds),0)
accuracy_np(preds,y)

                                             

0.9111

The network is much smaller vs the one mentioned in the paper and it might be that there are also some other differences that I have missed.

Just to experiment more with the architecture, I will try to train a bigger model. I might also add dropout as we seem to be overfitting.

In [15]:
class CustomDenseNet(nn.Module):
    def __init__(self, c, no=16, nb=3, nl=12, gr=12, comp=0.5):
        super().__init__()
        
        self.conv0 = nn.Conv2d(3, no, 3, padding=1, bias=False)
        
        self.conv_blocks = nn.ModuleList()
        for i in range(nb):
            self.conv_blocks.add_module(f'block_{i}', DenseBlock(ni_per_block(i, no, nl, gr, comp), gr, nl))
            
        
        self.trans = nn.ModuleList([
            Transition(ni_per_block(i, no, nl, gr, comp) + nl * gr) for i in range(nb-1)
        ])
        
        n_f_final = ni_per_block(nb-1, no, nl, gr, comp) + nl * gr
        self.bn_final = nn.BatchNorm2d(n_f_final)
        self.classifier = nn.Linear(n_f_final, c)
        
    def forward(self, x):
        x = self.conv0(x)
        
        for i, b in enumerate(self.conv_blocks):
            if i is not 0: x = self.trans[i-1](x)
            x = b(x)
            
        x = self.bn_final(x)
      
        x = x.view(x.shape[0], x.shape[1], -1).mean(2)
        x = self.classifier(x)
        
        return F.log_softmax(x)

In [16]:
learn = ConvLearner.from_model_data(CustomDenseNet(10, 24, 4, 15, 12), data, opt_fn=SGD_Momentum(0.9))

In [17]:
get_param_count(learn.model)

1007332

In [18]:
wds=1e-3

In [19]:
%%time
%%capture
learn.fit(1e-1, 150, wds=wds)

In [20]:
%%time
%%capture
learn.fit(1e-2, 75, wds=wds)

In [21]:
%%time
%%capture
learn.fit(1e-3, 75, wds=wds)

In [22]:
accuracy_np(*learn.predict_with_targs())

0.8917

In [23]:
learn.save('densenet_cifar_fully_trained')

In [24]:
%%time
learn.fit(1e-3, 1, wds=wds)

epoch      trn_loss   val_loss   accuracy                    
    0      0.163783   0.367921   0.888834  

CPU times: user 1min 33s, sys: 9.88 s, total: 1min 43s
Wall time: 1min 28s


[0.36792144, 0.8888335987261147]

In [27]:
preds, targs = predict_with_targs(learn.model, learn.data.trn_dl)

In [28]:
accuracy_np(preds, targs)

0.95372

I am without a doubt overfitting here extremely severly. A model I was trying to recreate seems to achieve an error rate of 5.24 with data augmentation (which we do here).

Going forward there are a couple of things that could be attempted here:
* training with logging the val accuracy over for each epoch and seeing if at any point we are close to the error rate from the paper
* checking the model architecture again against the paper (not sure that would get me far though)
* adding dropout (despite it being stated in the paper that dropout was not used when training with data augmentation)
* create a training schedule that allows for much quicker training using the ideas from the Leslie Smith [paper](https://arxiv.org/abs/1803.09820) (this sounds like a very fun and educational project!)
* implement the extreme bottlenecking and depth (100 layers with 0.8 million params!)

I reread the paper and now I think that it doesn't do the 1x1 convolutions layers! This might mean that more information is retained and possibly improve results! (going so many times from 200+ channels to 48 sounds a bit crazy). I do not know what this will do to parameter count but I think this is the next thing I should explore if I do continue working on this.

Quite interesting though that this is not the architecture as implemented in the PyTorch repo authored by the authors of the paper.