Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the question about the epoch number of data converging #4

Closed
xuyifeng-nwpu opened this issue May 13, 2017 · 6 comments
Closed

the question about the epoch number of data converging #4

xuyifeng-nwpu opened this issue May 13, 2017 · 6 comments

Comments

@xuyifeng-nwpu
Copy link

Using the same hyper parameter :22632 the data will converge at about 1850 epoch.

After several times of training using other hyper parameter, however, the data will converge at the number outnumber 3500 using the same hyper parameter:22632.

Two experiments have the same hyper parameter and the same code.
The only difference of the second experiment related to the first one is that we have trained the data several times.

@xgastaldi
Copy link
Owner

xgastaldi commented May 13, 2017

Can you give me a bit more information? I'm not sure I understand your exact problem.
What do you exactly mean by converging? And what do you mean by trained the data several times?

@xuyifeng-nwpu
Copy link
Author

xuyifeng-nwpu commented May 13, 2017

@xgastaldi thank your reply.
the first image : test_top1
test_top1

The image named "test_top1" showed the test top 1 in two experiments which had the same hyper parameter :22632. I run the code "CUDA_VISIBLE_DEVICES=0 th main.lua -dataset cifar10 -nGPU 1 -batchSize 64 -depth 26 -shareGradInput false -optnet true -nEpochs 1800 -netType shakeshake -lrShape cosine -widenFactor 2 -LR 0.1 -forwardShake true -backwardShake true -shakeImage true" which was copied from your readme.md. Why two same experiments had so much difference.


the second image :loss

loss

The other image named "loss" showed the loss value in two same experiments which had the same hyper parameter and the same code. The blue line decreased more quickly.

First experiment which result indicated by blue line can achieve your result in your paper. Then I run the code by changing several parameter such as depth, width, short cut mode, and so on. Finally, I was surprised to find out that i can not achieve the result of the first experiment. In the last experiment, I changed the parameters back to the first experiment. I had used the code comparison software to compare the code of two experiments.

Will the train data change after several train?

@xgastaldi
Copy link
Owner

The training data does not change.
My guess is that you changed -nEpochs to something different than 1800. If it was 1800, it would have stopped at 1800. If you use more epochs, the learning rate decay schedule is "stretched". What I mean by that is that the learning rate is based on the progress percentage and not the absolute number of epochs:
lr at epoch 400 when -nEpochs 1800 = lr at epoch 800 when -nEpochs 3600

@xuyifeng-nwpu
Copy link
Author

xuyifeng-nwpu commented May 14, 2017

@xgastaldi Thank you.

epoch, training time

I think you are right. In the second experiment , showed in red line, the nEpochs really setted to 3600. Now I run the code in the nEpochs 1800 and hope the results will be the same as your essay. The whole 1800 epochs training time will take more than 2days in my pc which's graphic card is NVIDIA 980ti. Can you tell me about how long you train your code(epoch 1800,22632) in your pc and what is your graphic card?

the best epoch number

If i set nEpochs to 900, can I achieve the result showed as your paper. My second problem is whether 1800 is the best nEpochs. Is the number of 1800 obtained through multiple experiments?

learning rate

Do the learning rate descend with the epochs? linear or non linear?

@xgastaldi
Copy link
Owner

Training time:
1 TITAN X Pascal 26 2x32d -LR 0.1 -batchSize 64: 51s per epoch

Number of epochs:
Due to the ICLR dealine, I simply had to choose a conservative estimate that would give me the best possible result at 96d. I could not do that based on 32d tests because the higher capacity of 96d models also changes the ideal training time. It could very well be that even at 96d there is no need for so many epochs, but I didn't have the time nor the GPUs to run 3 tests at 900 epochs, 3 tests at 1200 epochs and 3 tests at 1500 epochs.

Learning rate:
The model uses a cosine annealing function as described in https://arxiv.org/abs/1608.03983.
You can find the code at the end of the train.lua.

@xgastaldi
Copy link
Owner

I will close this issue. Feel free to comment if you still have this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants