Accuracy could not match with the log when load_model #17

CaffreyR · 2022-07-22T03:15:40Z

Hi, @muqeeth @dptam @craffel , when I set the eval_epoch_interval=1. I have some accuracy in my log, and I save my model and checkpoint. But when I tried to reload the model, its accuracy did not match the accuracy.

dptam · 2022-07-22T03:36:20Z

Hello,

To clarify, are you loading the model at step 67? Is the performance of the model when you load the checkpoint 53? And is the performance of the checkpoint in the log 58?

CaffreyR · 2022-07-22T04:18:50Z

Hi @dptam, the step is actually 75. As we see from the log here, in line 20(epoch 19), the log is 0.5812

And when I enter this code, it run accuracy is 0.5848

BTW step 79 is 0.5631

Same thing in COPA dataset , line 221 is 0.62

So when I tried to run step 883 887 it result is 0.54 , and 879 is 0.55

dptam · 2022-07-22T04:59:36Z

I'm not sure the issue. If you don't mind, could you rerun and add self.global_step to the metrics dictionary here. This should output the global step in the log that matches the global step used to save the model just to make sure the line number corresponds to the correct ckpt.

CaffreyR · 2022-07-22T05:16:04Z

Hi @dptam , actually when I tried to run the finish.pt, it can not match the last accuracy in log.

CaffreyR · 2022-07-22T05:20:27Z

Is there something wrong with the code? @muqeeth @jmohta @HaokunLiu

CaffreyR · 2022-07-22T05:44:53Z

@dptam I have add global step as your suggestion, but it still can not match

HaokunLiu · 2022-07-22T17:15:44Z

What is in pl_test.py? Mind share with us what you have there?

dptam · 2022-07-23T00:07:08Z

Hello,

Thanks for rerunning the code. I'm still not sure why loading and rerunning the model doesn't match the log performance - could you share the command used to train the model?

Regarding the issue of finish.pt not matching the last accuracy in log, see #11 for more details why.

CaffreyR · 2022-07-23T01:02:24Z

Hi @HaokunLiu @dptam , actually pl_test is just a copy of train, except for loading method. See I was use both your save model method and checkpoing method of pytorch ligetning. See,

And I change a little bit in encoderdecoder.py

But here is the thing, the train command is as bellow

And the test code is as bellow, actually pl_train/test run the same result

And the log here, not use finish.pt but the 51 as suggestion of @dptam

dptam · 2022-07-26T02:46:56Z

Hi,

I tried to look into a bit and couldn't figure out the cause but found one issue for me at least(not sure if it will be the same for you). Sorry I don't have more time to look into it currently, but maybe you can.

When using t5-small and printing out self.model.lm_head.weight(), the norm is 94070 in the train_step function but 94072 in the predict function. This is due some precision issues when moving from CPU to GPU and one remedy was adding
self.weight = torch.clone(self.model.lm_head.weight).double().cuda().float() at the end of the init function for EncoderDecoder.py and adding self.model.lm_head.weight = torch.nn.Parameter(self.weight) at the beginning of the training_step function.

This causes the self.model.lm_head.weight() to consistently be 94070 in the train_step and predict function, but the accuracy from the log and from loading a validation checkpoint still do not match. I'm not sure why, but one potential further analysis is to look at the other weights of the model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accuracy could not match with the log when load_model #17

Accuracy could not match with the log when load_model #17

CaffreyR commented Jul 22, 2022 •

edited

Loading

dptam commented Jul 22, 2022

CaffreyR commented Jul 22, 2022 •

edited

Loading

dptam commented Jul 22, 2022

CaffreyR commented Jul 22, 2022

CaffreyR commented Jul 22, 2022

CaffreyR commented Jul 22, 2022

HaokunLiu commented Jul 22, 2022 •

edited

Loading

dptam commented Jul 23, 2022

CaffreyR commented Jul 23, 2022

dptam commented Jul 26, 2022

Accuracy could not match with the log when load_model #17

Accuracy could not match with the log when load_model #17

Comments

CaffreyR commented Jul 22, 2022 • edited Loading

dptam commented Jul 22, 2022

CaffreyR commented Jul 22, 2022 • edited Loading

dptam commented Jul 22, 2022

CaffreyR commented Jul 22, 2022

CaffreyR commented Jul 22, 2022

CaffreyR commented Jul 22, 2022

HaokunLiu commented Jul 22, 2022 • edited Loading

dptam commented Jul 23, 2022

CaffreyR commented Jul 23, 2022

dptam commented Jul 26, 2022

CaffreyR commented Jul 22, 2022 •

edited

Loading

CaffreyR commented Jul 22, 2022 •

edited

Loading

HaokunLiu commented Jul 22, 2022 •

edited

Loading