Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alexnet [with cifar-10, batch size 256, worker 6]. training loss: NAN #152

Closed
Foristkirito opened this issue Apr 25, 2017 · 7 comments
Closed

Comments

@Foristkirito
Copy link

No description provided.

@wangg12
Copy link

wangg12 commented Apr 25, 2017

@Foristkirito Could you provide a small snippet to reproduce your bug?

@Foristkirito
Copy link
Author

@wangg12 of course, I just use the modified code by me.
with command python main.py -a alexnet -j 6 --resume ./alexnet_cp --epochs 90 -b 256 ./data. I think the problem is that the loss is too large to over flow.
I also ran with --pretrained, loss is ok, but after 90 epochs, performance nearly did not change. shown as below:

 * Prec@1 10.000 Prec@5 50.000
Epoch: [89][0/196]      Time 1.455 (1.455)      Data 1.072 (1.072)      Loss 2.3118 (2.3118)    Prec@1 9.375 (9.375)    Prec@5 49.219 (49.219)
Epoch: [89][10/196]     Time 0.406 (0.507)      Data 0.001 (0.100)      Loss 2.3118 (2.3118)    Prec@1 10.547 (10.085)  Prec@5 51.172 (50.604)
Epoch: [89][20/196]     Time 0.408 (0.460)      Data 0.001 (0.053)      Loss 2.3118 (2.3118)    Prec@1 8.594 (10.212)   Prec@5 51.172 (50.930)
Epoch: [89][30/196]     Time 0.401 (0.444)      Data 0.001 (0.036)      Loss 2.3119 (2.3118)    Prec@1 8.594 (9.929)    Prec@5 44.922 (50.441)
Epoch: [89][40/196]     Time 0.004 (0.436)      Data 0.001 (0.027)      Loss 2.3118 (2.3118)    Prec@1 10.156 (9.861)   Prec@5 55.859 (50.210)
Epoch: [89][50/196]     Time 0.410 (0.431)      Data 0.001 (0.022)      Loss 2.3119 (2.3118)    Prec@1 10.547 (9.934)   Prec@5 49.609 (50.444)
Epoch: [89][60/196]     Time 0.415 (0.428)      Data 0.001 (0.019)      Loss 2.3117 (2.3118)    Prec@1 11.719 (10.028)  Prec@5 55.078 (50.506)
Epoch: [89][70/196]     Time 0.407 (0.426)      Data 0.001 (0.016)      Loss 2.3119 (2.3118)    Prec@1 8.594 (10.030)   Prec@5 49.219 (50.539)
Epoch: [89][80/196]     Time 0.393 (0.428)      Data 0.001 (0.014)      Loss 2.3118 (2.3118)    Prec@1 6.641 (9.968)    Prec@5 50.781 (50.236)
Epoch: [89][90/196]     Time 0.392 (0.426)      Data 0.001 (0.013)      Loss 2.3119 (2.3118)    Prec@1 8.984 (10.045)   Prec@5 49.219 (50.206)
Epoch: [89][100/196]    Time 0.591 (0.425)      Data 0.001 (0.011)      Loss 2.3118 (2.3118)    Prec@1 10.156 (9.998)   Prec@5 50.000 (50.085)
Epoch: [89][110/196]    Time 0.399 (0.423)      Data 0.001 (0.011)      Loss 2.3118 (2.3118)    Prec@1 13.672 (10.015)  Prec@5 51.953 (50.070)
Epoch: [89][120/196]    Time 0.395 (0.422)      Data 0.001 (0.010)      Loss 2.3119 (2.3118)    Prec@1 8.203 (9.985)    Prec@5 48.438 (49.913)
Epoch: [89][130/196]    Time 0.389 (0.422)      Data 0.001 (0.009)      Loss 2.3118 (2.3118)    Prec@1 10.938 (9.951)   Prec@5 50.781 (49.860)
Epoch: [89][140/196]    Time 0.404 (0.421)      Data 0.001 (0.008)      Loss 2.3119 (2.3118)    Prec@1 8.984 (9.912)    Prec@5 50.000 (49.986)
Epoch: [89][150/196]    Time 0.397 (0.421)      Data 0.001 (0.008)      Loss 2.3119 (2.3118)    Prec@1 7.422 (9.910)    Prec@5 49.609 (50.000)
Epoch: [89][160/196]    Time 0.408 (0.419)      Data 0.001 (0.008)      Loss 2.3119 (2.3118)    Prec@1 9.766 (9.899)    Prec@5 47.266 (49.939)
Epoch: [89][170/196]    Time 0.399 (0.419)      Data 0.001 (0.007)      Loss 2.3119 (2.3118)    Prec@1 12.109 (9.875)   Prec@5 48.438 (49.836)
Epoch: [89][180/196]    Time 0.405 (0.419)      Data 0.001 (0.007)      Loss 2.3119 (2.3118)    Prec@1 11.328 (9.874)   Prec@5 48.828 (49.767)
Epoch: [89][190/196]    Time 0.398 (0.419)      Data 0.000 (0.006)      Loss 2.3119 (2.3118)    Prec@1 8.203 (9.835)    Prec@5 49.219 (49.691)
Test: [0/40]    Time 0.845 (0.845)      Loss 2.3119 (2.3119)    Prec@1 8.984 (8.984)    Prec@5 46.875 (46.875)
Test: [10/40]   Time 0.155 (0.264)      Loss 2.3118 (2.3118)    Prec@1 10.547 (10.298)  Prec@5 54.688 (49.503)
Test: [20/40]   Time 0.168 (0.214)      Loss 2.3118 (2.3118)    Prec@1 10.938 (10.305)  Prec@5 53.125 (50.186)
Test: [30/40]   Time 0.338 (0.203)      Loss 2.3117 (2.3118)    Prec@1 9.766 (10.131)   Prec@5 57.812 (50.101)
 * Prec@1 10.000 Prec@5 50.000

@Foristkirito
Copy link
Author

@wangg12 I figure it out. I made a mistake. Problem solved thank you guy.

@wangg12
Copy link

wangg12 commented Apr 25, 2017

@Foristkirito What is the problem?

@Foristkirito
Copy link
Author

@wangg12 the problem is directories. It's necessary to maintain the project directory structure and put it right at your home directory. I do not understand why but it does not work if you put imagenet at your home directory. However, resnet always works fine i will spend some time to figure our the real problem.

@wangg12
Copy link

wangg12 commented Apr 25, 2017

@Foristkirito Do you get better results with alexnet on cifar-10 now?

Also IMO, alexnet is not suitable for cifar10 though. The architecture is for bigger images but 32x32 cifar-10 images.

Besides, if you do not use pre-trained weights, you should be careful with the learning rate and the weights initialization (The random behavior can be fixed by torch.manual_seed(seed)).

@Foristkirito
Copy link
Author

@wangg12 The accuracy of alexnet is still low. It sounds a good solution i will try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants