Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training problem when epoch changing #7

Closed
ltcs11 opened this issue Aug 23, 2018 · 13 comments
Closed

Training problem when epoch changing #7

ltcs11 opened this issue Aug 23, 2018 · 13 comments

Comments

@ltcs11
Copy link

ltcs11 commented Aug 23, 2018

No description provided.

@ltcs11
Copy link
Author

ltcs11 commented Aug 23, 2018

when i'm training with the Refined MS1M data-set as the training data
the loss would change greatly during the epoch number changes

attached is my tensorboard results
image
image

from above you can see the loss change

and this would become a terrible problem when the learning rate decrease since it would increase the converge time that you need to finish the training greatly

have you ever meet this problem when you training the net?
if so, how did you solve this?

thanks a lot

@ltcs11 ltcs11 changed the title Training problem when epoch Training problem when epoch changing Aug 23, 2018
@muyoucun
Copy link

i have the same problem

@sirius-ai
Copy link
Owner

@ltcs11 @muyoucun thanks for point out the bug. It likely caused by dataset.shuffle, the reason is that (total samples) % (shuffle buffer size) = remainder, if remainder far away smaller than suffer buffer size that dataset.shuffle will repeat it until equal to buffer size, so the end of every epoch will lead to overfiting,and increase the loss when beginning to train next epoch.(refer to https://stackoverflow.com/questions/46928328/why-training-loss-is-increased-at-the-beginning-of-each-epoch)
Temporary measures is annotation “dataset.shuffle” if your datasets had randomly enough, or set the dataset.shuffle buffer size to len(datasets).

@sirius-ai
Copy link
Owner

@zfs1993
Copy link

zfs1993 commented Sep 18, 2018

No description provided.

can you provide the hyperparameters you set,my inference loss doesn't get converge

@zfs1993
Copy link

zfs1993 commented Sep 18, 2018

did you just annotate the dataset.shuffle? i used the dataset it provided,i don't know weather it has disordered

@ltcs11
Copy link
Author

ltcs11 commented Sep 19, 2018

did you just annotate the dataset.shuffle? i used the dataset it provided,i don't know weather it has disordered

the provided dataset is in order by the label number
i just randomly split one tfrecord file into many small size files and use dataset.shuffle(size) big enough to cover the max length of those files

i just use the default hyperparameters, and i didn't get the results as 99.2+ either

@muyoucun
Copy link

i have 5822653 images and i use shuffle buffer size 35504, so the remainder will be 5822653 % 35504 = 35501 .
According to @sirius-ai , it will only repeat 3 images at last since 35504-35501=3. If my understanding is right, the remainder is so close to shuffle buffer size so maybe it would not overfit.
But the loss changed great when first epoch ends. (BTW after that the second epoch or the third epoch ends, loss will not increase greatly)

maybe i should split one tfrecord file into many small size files.

@zfs1993
Copy link

zfs1993 commented Sep 20, 2018

i have 5822653 images and i use shuffle buffer size 35504, so the remainder will be 5822653 % 35504 = 35501 .
According to @sirius-ai , it will only repeat 3 images at last since 35504-35501=3. If my understanding is right, the remainder is so close to shuffle buffer size so maybe it would not overfit.
But the loss changed great when first epoch ends. (BTW after that the second epoch or the third epoch ends, loss will not increase greatly)

maybe i should split one tfrecord file into many small size files.
i check the numbers of the tran.tfrecord(generate by the default datasets),there is only 3804846 pictures in it?(if i doesn't make any mistake), and i change the buffer size to 8747(3804846%8747=99), but the increase loss still occur. maybe it is not a good idea. did you try the method ltcs11 provide? i am going to try it

@muyoucun
Copy link

muyoucun commented Sep 20, 2018

insightface author has updated his datasets. you can download from his github.
and about this issue, now i believe it's totally the SHUFFLE's problem ( you can see https://stackoverflow.com/questions/46444018/meaning-of-buffer-size-in-dataset-map-dataset-prefetch-and-dataset-shuffle?noredirect=1&lq=1)

i test it. like this:

a = np.array([1,2,3.....,59])
...
dataset = dataset.shuffle(10)
dataset = dataset.batch(9)
...
el = iteration.get_next()

and i print the el every time. one example like this
[ 5 7 8 4 1 3 14 10 17]
[ 6 16 15 20 13 24 21 2 9]
[18 19 23 25 28 12 29 30 32]
[36 33 27 39 26 31 11 34 45]
[43 40 41 38 42 50 35 22 53]
[54 48 52 49 55 56 51 58 46]
[44 37 47 57]
you can see that it will get smaller number firstly.
since the provided dataset is in order by the label number, the batch chosen is not random.
so now i shuffle the data and generate tfrecord file again.

@zfs1993
Copy link

zfs1993 commented Sep 20, 2018

insightface author has updated his datasets. you can download from his github.
and about this issue, now i believe it's totally the SHUFFLE's problem ( you can see https://stackoverflow.com/questions/46444018/meaning-of-buffer-size-in-dataset-map-dataset-prefetch-and-dataset-shuffle?noredirect=1&lq=1)

i test it. like this:

a = np.array([1,2,3.....,59])
...
dataset = dataset.shuffle(10)
dataset = dataset.batch(9)
...
el = iteration.get_next()

and i print the el every time. one example like this
[ 5 7 8 4 1 3 14 10 17]
[ 6 16 15 20 13 24 21 2 9]
[18 19 23 25 28 12 29 30 32]
[36 33 27 39 26 31 11 34 45]
[43 40 41 38 42 50 35 22 53]
[54 48 52 49 55 56 51 58 46]
[44 37 47 57]
you can see that it will get smaller number firstly.
since the provided dataset is in order by the label number, the batch chosen is not random.
so now i shuffle the data and generate tfrecord file again.

i follow your advice. and find that if i annotate the dataset.shuffle, an error occur.
Traceback (most recent call last):
File "train_nets4.py", line 262, in
eer = brentq(lambda x: 1. - x - interpolate.interp1d(fpr, tpr)(x), 0., 1.)
File "/opt/app/anaconda3/lib/python3.6/site-packages/scipy/optimize/zeros.py", line 442, in brentq
r = _zeros._brentq(f,a,b,xtol,rtol,maxiter,args,full_output,disp)
File "train_nets4.py", line 262, in
eer = brentq(lambda x: 1. - x - interpolate.interp1d(fpr, tpr)(x), 0., 1.)
File "/opt/app/anaconda3/lib/python3.6/site-packages/scipy/interpolate/polyint.py", line 79, in call
y = self._evaluate(x)
File "/opt/app/anaconda3/lib/python3.6/site-packages/scipy/interpolate/interpolate.py", line 610, in _evaluate
below_bounds, above_bounds = self._check_bounds(x_new)
File "/opt/app/anaconda3/lib/python3.6/site-packages/scipy/interpolate/interpolate.py", line 642, in _check_bounds
raise ValueError("A value in x_new is above the interpolation "
ValueError: A value in x_new is above the interpolation range.
i don't know why this happened, did you encounter this problem?

@sirius-ai
Copy link
Owner

@muyoucun thanks you!
@zfs1993 Don't using pretrained model to Initialize weights, and trying to retrain.

@zfs1993
Copy link

zfs1993 commented Sep 21, 2018

@muyoucun thanks you!
@zfs1993 Don't using pretrained model to Initialize weights, and trying to retrain.

i trained model from initial.the accuracy on lfw has reached 98.5%,the val is 95%,but the result on agebd is really bad,especially the val part which is about 30%,it has a huge difference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants