I was looking at the bytenet code and noticed that the implementation is different from the Neural Machine Translation in Linear Time paper.
From my understanding of the paper - each atrous conv layer is wrapped in a ResnetBlock, e.g.:
single repeat = [Resnet(conv layer, dilated=1), Resnet(conv layer, dilated=2), Resnet(conv layer, dilated=4)]
while in the current implementation it seems more like a ResnetBlock is composed of many dilated conv layers, e.g.:
single repeat = Resnet[ (conv layer, dilated=1), (conv layer, dilated=4), (conv layer, dilated=8)]
Is the difference from the paper intended?