Training problem when epoch changing #7

ltcs11 · 2018-08-23T05:51:10Z

No description provided.

ltcs11 · 2018-08-23T06:00:40Z

when i'm training with the Refined MS1M data-set as the training data
the loss would change greatly during the epoch number changes

attached is my tensorboard results

from above you can see the loss change

and this would become a terrible problem when the learning rate decrease since it would increase the converge time that you need to finish the training greatly

have you ever meet this problem when you training the net?
if so, how did you solve this?

thanks a lot

muyoucun · 2018-08-29T01:21:29Z

i have the same problem

sirius-ai · 2018-08-30T09:36:01Z

@ltcs11 @muyoucun thanks for point out the bug. It likely caused by dataset.shuffle, the reason is that (total samples) % (shuffle buffer size) = remainder， if remainder far away smaller than suffer buffer size that dataset.shuffle will repeat it until equal to buffer size， so the end of every epoch will lead to overfiting，and increase the loss when beginning to train next epoch.(refer to https://stackoverflow.com/questions/46928328/why-training-loss-is-increased-at-the-beginning-of-each-epoch)
Temporary measures is annotation “dataset.shuffle” if your datasets had randomly enough, or set the dataset.shuffle buffer size to len(datasets).

sirius-ai · 2018-09-14T01:16:25Z

https://github.com/sirius-ai/MobileFaceNet_TF/commit/604e36bf5d98d875a0d68ca2ef8b624973f91a0a
close！

zfs1993 · 2018-09-18T11:16:39Z

No description provided.

can you provide the hyperparameters you set，my inference loss doesn't get converge

zfs1993 · 2018-09-18T11:37:59Z

did you just annotate the dataset.shuffle? i used the dataset it provided,i don't know weather it has disordered

ltcs11 · 2018-09-19T01:33:28Z

did you just annotate the dataset.shuffle? i used the dataset it provided,i don't know weather it has disordered

the provided dataset is in order by the label number
i just randomly split one tfrecord file into many small size files and use dataset.shuffle(size) big enough to cover the max length of those files

i just use the default hyperparameters, and i didn't get the results as 99.2+ either

muyoucun · 2018-09-19T08:21:09Z

i have 5822653 images and i use shuffle buffer size 35504, so the remainder will be 5822653 % 35504 = 35501 .
According to @sirius-ai , it will only repeat 3 images at last since 35504-35501=3. If my understanding is right, the remainder is so close to shuffle buffer size so maybe it would not overfit.
But the loss changed great when first epoch ends. (BTW after that the second epoch or the third epoch ends, loss will not increase greatly)

maybe i should split one tfrecord file into many small size files.

zfs1993 · 2018-09-20T06:53:11Z

i have 5822653 images and i use shuffle buffer size 35504, so the remainder will be 5822653 % 35504 = 35501 .
According to @sirius-ai , it will only repeat 3 images at last since 35504-35501=3. If my understanding is right, the remainder is so close to shuffle buffer size so maybe it would not overfit.
But the loss changed great when first epoch ends. (BTW after that the second epoch or the third epoch ends, loss will not increase greatly)

maybe i should split one tfrecord file into many small size files.
i check the numbers of the tran.tfrecord(generate by the default datasets),there is only 3804846 pictures in it?(if i doesn't make any mistake), and i change the buffer size to 8747(3804846%8747=99), but the increase loss still occur. maybe it is not a good idea. did you try the method ltcs11 provide? i am going to try it

muyoucun · 2018-09-20T07:39:59Z

insightface author has updated his datasets. you can download from his github.
and about this issue, now i believe it's totally the SHUFFLE's problem ( you can see https://stackoverflow.com/questions/46444018/meaning-of-buffer-size-in-dataset-map-dataset-prefetch-and-dataset-shuffle?noredirect=1&lq=1)

i test it. like this:

a = np.array([1,2,3.....,59])
...
dataset = dataset.shuffle(10)
dataset = dataset.batch(9)
...
el = iteration.get_next()

and i print the el every time. one example like this
[ 5 7 8 4 1 3 14 10 17]
[ 6 16 15 20 13 24 21 2 9]
[18 19 23 25 28 12 29 30 32]
[36 33 27 39 26 31 11 34 45]
[43 40 41 38 42 50 35 22 53]
[54 48 52 49 55 56 51 58 46]
[44 37 47 57]
you can see that it will get smaller number firstly.
since the provided dataset is in order by the label number, the batch chosen is not random.
so now i shuffle the data and generate tfrecord file again.

zfs1993 · 2018-09-20T09:34:48Z

insightface author has updated his datasets. you can download from his github.
and about this issue, now i believe it's totally the SHUFFLE's problem ( you can see https://stackoverflow.com/questions/46444018/meaning-of-buffer-size-in-dataset-map-dataset-prefetch-and-dataset-shuffle?noredirect=1&lq=1)

i test it. like this:

a = np.array([1,2,3.....,59])
...
dataset = dataset.shuffle(10)
dataset = dataset.batch(9)
...
el = iteration.get_next()

and i print the el every time. one example like this
[ 5 7 8 4 1 3 14 10 17]
[ 6 16 15 20 13 24 21 2 9]
[18 19 23 25 28 12 29 30 32]
[36 33 27 39 26 31 11 34 45]
[43 40 41 38 42 50 35 22 53]
[54 48 52 49 55 56 51 58 46]
[44 37 47 57]
you can see that it will get smaller number firstly.
since the provided dataset is in order by the label number, the batch chosen is not random.
so now i shuffle the data and generate tfrecord file again.

i follow your advice. and find that if i annotate the dataset.shuffle, an error occur.
Traceback (most recent call last):
File "train_nets4.py", line 262, in
eer = brentq(lambda x: 1. - x - interpolate.interp1d(fpr, tpr)(x), 0., 1.)
File "/opt/app/anaconda3/lib/python3.6/site-packages/scipy/optimize/zeros.py", line 442, in brentq
r = _zeros._brentq(f,a,b,xtol,rtol,maxiter,args,full_output,disp)
File "train_nets4.py", line 262, in
eer = brentq(lambda x: 1. - x - interpolate.interp1d(fpr, tpr)(x), 0., 1.)
File "/opt/app/anaconda3/lib/python3.6/site-packages/scipy/interpolate/polyint.py", line 79, in call
y = self._evaluate(x)
File "/opt/app/anaconda3/lib/python3.6/site-packages/scipy/interpolate/interpolate.py", line 610, in _evaluate
below_bounds, above_bounds = self._check_bounds(x_new)
File "/opt/app/anaconda3/lib/python3.6/site-packages/scipy/interpolate/interpolate.py", line 642, in _check_bounds
raise ValueError("A value in x_new is above the interpolation "
ValueError: A value in x_new is above the interpolation range.
i don't know why this happened, did you encounter this problem?

sirius-ai · 2018-09-21T01:12:44Z

@muyoucun thanks you！
@zfs1993 Don't using pretrained model to Initialize weights, and trying to retrain.

zfs1993 · 2018-09-21T02:01:39Z

@muyoucun thanks you！
@zfs1993 Don't using pretrained model to Initialize weights, and trying to retrain.

i trained model from initial.the accuracy on lfw has reached 98.5%,the val is 95%,but the result on agebd is really bad,especially the val part which is about 30%,it has a huge difference

ltcs11 changed the title ~~Training problem when epoch~~ Training problem when epoch changing Aug 23, 2018

sirius-ai closed this as completed Sep 14, 2018

sirius-ai mentioned this issue Jun 24, 2019

作者您好，我的total_loss一直在15~40之间跳，reg_loss倒是一直降低，但降低到0.04的时候就停止了 #48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training problem when epoch changing #7

Training problem when epoch changing #7

ltcs11 commented Aug 23, 2018

ltcs11 commented Aug 23, 2018

muyoucun commented Aug 29, 2018

sirius-ai commented Aug 30, 2018

sirius-ai commented Sep 14, 2018

zfs1993 commented Sep 18, 2018

zfs1993 commented Sep 18, 2018

ltcs11 commented Sep 19, 2018

muyoucun commented Sep 19, 2018

zfs1993 commented Sep 20, 2018

muyoucun commented Sep 20, 2018 •

edited

Loading

zfs1993 commented Sep 20, 2018

sirius-ai commented Sep 21, 2018

zfs1993 commented Sep 21, 2018

Training problem when epoch changing #7

Training problem when epoch changing #7

Comments

ltcs11 commented Aug 23, 2018

ltcs11 commented Aug 23, 2018

muyoucun commented Aug 29, 2018

sirius-ai commented Aug 30, 2018

sirius-ai commented Sep 14, 2018

zfs1993 commented Sep 18, 2018

zfs1993 commented Sep 18, 2018

ltcs11 commented Sep 19, 2018

muyoucun commented Sep 19, 2018

zfs1993 commented Sep 20, 2018

muyoucun commented Sep 20, 2018 • edited Loading

zfs1993 commented Sep 20, 2018

sirius-ai commented Sep 21, 2018

zfs1993 commented Sep 21, 2018

muyoucun commented Sep 20, 2018 •

edited

Loading