I tried to implement SqueezeNet as a torchvision model and train it via ImageNet example, and found that it doesn't converge as is. The reference code differs in two aspect:
- All but the last convolutions are initialized with Xavier Glorot initializer, the last is normal with stdev 0.01
- The learning rate is linearly decreased (polynomial schedule with power=1).
In PyTorch these aspects are hard-coded inside the ImageNet example, but I think it makes sense to make them part of the model definition in torch.vision. What's your position on it?