alexnet [with cifar-10, batch size 256, worker 6]. training loss: NAN #152

Foristkirito · 2017-04-25T01:55:22Z

No description provided.

wangg12 · 2017-04-25T04:05:37Z

@Foristkirito Could you provide a small snippet to reproduce your bug?

Foristkirito · 2017-04-25T04:44:14Z

@wangg12 of course, I just use the modified code by me.
with command python main.py -a alexnet -j 6 --resume ./alexnet_cp --epochs 90 -b 256 ./data. I think the problem is that the loss is too large to over flow.
I also ran with --pretrained, loss is ok, but after 90 epochs, performance nearly did not change. shown as below:

 * Prec@1 10.000 Prec@5 50.000
Epoch: [89][0/196]      Time 1.455 (1.455)      Data 1.072 (1.072)      Loss 2.3118 (2.3118)    Prec@1 9.375 (9.375)    Prec@5 49.219 (49.219)
Epoch: [89][10/196]     Time 0.406 (0.507)      Data 0.001 (0.100)      Loss 2.3118 (2.3118)    Prec@1 10.547 (10.085)  Prec@5 51.172 (50.604)
Epoch: [89][20/196]     Time 0.408 (0.460)      Data 0.001 (0.053)      Loss 2.3118 (2.3118)    Prec@1 8.594 (10.212)   Prec@5 51.172 (50.930)
Epoch: [89][30/196]     Time 0.401 (0.444)      Data 0.001 (0.036)      Loss 2.3119 (2.3118)    Prec@1 8.594 (9.929)    Prec@5 44.922 (50.441)
Epoch: [89][40/196]     Time 0.004 (0.436)      Data 0.001 (0.027)      Loss 2.3118 (2.3118)    Prec@1 10.156 (9.861)   Prec@5 55.859 (50.210)
Epoch: [89][50/196]     Time 0.410 (0.431)      Data 0.001 (0.022)      Loss 2.3119 (2.3118)    Prec@1 10.547 (9.934)   Prec@5 49.609 (50.444)
Epoch: [89][60/196]     Time 0.415 (0.428)      Data 0.001 (0.019)      Loss 2.3117 (2.3118)    Prec@1 11.719 (10.028)  Prec@5 55.078 (50.506)
Epoch: [89][70/196]     Time 0.407 (0.426)      Data 0.001 (0.016)      Loss 2.3119 (2.3118)    Prec@1 8.594 (10.030)   Prec@5 49.219 (50.539)
Epoch: [89][80/196]     Time 0.393 (0.428)      Data 0.001 (0.014)      Loss 2.3118 (2.3118)    Prec@1 6.641 (9.968)    Prec@5 50.781 (50.236)
Epoch: [89][90/196]     Time 0.392 (0.426)      Data 0.001 (0.013)      Loss 2.3119 (2.3118)    Prec@1 8.984 (10.045)   Prec@5 49.219 (50.206)
Epoch: [89][100/196]    Time 0.591 (0.425)      Data 0.001 (0.011)      Loss 2.3118 (2.3118)    Prec@1 10.156 (9.998)   Prec@5 50.000 (50.085)
Epoch: [89][110/196]    Time 0.399 (0.423)      Data 0.001 (0.011)      Loss 2.3118 (2.3118)    Prec@1 13.672 (10.015)  Prec@5 51.953 (50.070)
Epoch: [89][120/196]    Time 0.395 (0.422)      Data 0.001 (0.010)      Loss 2.3119 (2.3118)    Prec@1 8.203 (9.985)    Prec@5 48.438 (49.913)
Epoch: [89][130/196]    Time 0.389 (0.422)      Data 0.001 (0.009)      Loss 2.3118 (2.3118)    Prec@1 10.938 (9.951)   Prec@5 50.781 (49.860)
Epoch: [89][140/196]    Time 0.404 (0.421)      Data 0.001 (0.008)      Loss 2.3119 (2.3118)    Prec@1 8.984 (9.912)    Prec@5 50.000 (49.986)
Epoch: [89][150/196]    Time 0.397 (0.421)      Data 0.001 (0.008)      Loss 2.3119 (2.3118)    Prec@1 7.422 (9.910)    Prec@5 49.609 (50.000)
Epoch: [89][160/196]    Time 0.408 (0.419)      Data 0.001 (0.008)      Loss 2.3119 (2.3118)    Prec@1 9.766 (9.899)    Prec@5 47.266 (49.939)
Epoch: [89][170/196]    Time 0.399 (0.419)      Data 0.001 (0.007)      Loss 2.3119 (2.3118)    Prec@1 12.109 (9.875)   Prec@5 48.438 (49.836)
Epoch: [89][180/196]    Time 0.405 (0.419)      Data 0.001 (0.007)      Loss 2.3119 (2.3118)    Prec@1 11.328 (9.874)   Prec@5 48.828 (49.767)
Epoch: [89][190/196]    Time 0.398 (0.419)      Data 0.000 (0.006)      Loss 2.3119 (2.3118)    Prec@1 8.203 (9.835)    Prec@5 49.219 (49.691)
Test: [0/40]    Time 0.845 (0.845)      Loss 2.3119 (2.3119)    Prec@1 8.984 (8.984)    Prec@5 46.875 (46.875)
Test: [10/40]   Time 0.155 (0.264)      Loss 2.3118 (2.3118)    Prec@1 10.547 (10.298)  Prec@5 54.688 (49.503)
Test: [20/40]   Time 0.168 (0.214)      Loss 2.3118 (2.3118)    Prec@1 10.938 (10.305)  Prec@5 53.125 (50.186)
Test: [30/40]   Time 0.338 (0.203)      Loss 2.3117 (2.3118)    Prec@1 9.766 (10.131)   Prec@5 57.812 (50.101)
 * Prec@1 10.000 Prec@5 50.000

Foristkirito · 2017-04-25T08:21:09Z

@wangg12 I figure it out. I made a mistake. Problem solved thank you guy.

wangg12 · 2017-04-25T08:27:56Z

@Foristkirito What is the problem?

Foristkirito · 2017-04-25T09:14:20Z

@wangg12 the problem is directories. It's necessary to maintain the project directory structure and put it right at your home directory. I do not understand why but it does not work if you put imagenet at your home directory. However, resnet always works fine i will spend some time to figure our the real problem.

wangg12 · 2017-04-25T09:23:51Z

@Foristkirito Do you get better results with alexnet on cifar-10 now?

Also IMO, alexnet is not suitable for cifar10 though. The architecture is for bigger images but 32x32 cifar-10 images.

Besides, if you do not use pre-trained weights, you should be careful with the learning rate and the weights initialization (The random behavior can be fixed by torch.manual_seed(seed)).

Foristkirito · 2017-04-25T09:28:00Z

@wangg12 The accuracy of alexnet is still low. It sounds a good solution i will try it.

Foristkirito closed this as completed Apr 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alexnet [with cifar-10, batch size 256, worker 6]. training loss: NAN #152

alexnet [with cifar-10, batch size 256, worker 6]. training loss: NAN #152

Foristkirito commented Apr 25, 2017

wangg12 commented Apr 25, 2017

Foristkirito commented Apr 25, 2017

Foristkirito commented Apr 25, 2017

wangg12 commented Apr 25, 2017

Foristkirito commented Apr 25, 2017

wangg12 commented Apr 25, 2017 •

edited

Foristkirito commented Apr 25, 2017

alexnet [with cifar-10, batch size 256, worker 6]. training loss: NAN #152

alexnet [with cifar-10, batch size 256, worker 6]. training loss: NAN #152

Comments

Foristkirito commented Apr 25, 2017

wangg12 commented Apr 25, 2017

Foristkirito commented Apr 25, 2017

Foristkirito commented Apr 25, 2017

wangg12 commented Apr 25, 2017

Foristkirito commented Apr 25, 2017

wangg12 commented Apr 25, 2017 • edited

Foristkirito commented Apr 25, 2017

wangg12 commented Apr 25, 2017 •

edited