how to evaluate and test afterwards #4

jackfeinmann5 · 2021-09-09T10:43:32Z

Hi
I see the running commands in the repo, I find it hard how to automatically, when you find the checkpoint based on the loss, also evaluate that on the test set to report the results, could you kindly provide one example when you train and test on the best checkpoint automatically. thanks

shmsw25 · 2021-09-09T10:47:46Z

Hi @jackfeinmann5, thanks for the question. The most hyperparams can be used as they are as default. The only hyperparam which is chosen based the training loss is learning rate, and I chose it manually, because there are just three choices. So I run the training commands with three different learning rate values, and read the log file to see which one should be chosen -- which I actually wrote a script to automate it and it's not part of the released code but should be super easy to write. Other things like testing on the test set are automatic. Let me know if any steps are unclear - would love to hep!

jackfeinmann5 · 2021-09-09T11:21:07Z

thanks, could you kindly let me know how would you test on the checkpoint selected ? shall I run the command for train and then add do_check --split test and get the results like this? thanks a lot.

shmsw25 · 2021-09-09T11:24:38Z

Exactly! And in fact, even without --do_check it will automatically evaluate on the test set when the training is finished. So what I did just for an efficiency was as follows: (1) run training three times with three learning rates, which will automatically evaluate on the test data as well, (2) write a script that read the log file, looks at the training loss at global_step=100, find the one with the lowest value, and return the result on the test data. Does it make sense? Let me know if things are unclear.

jackfeinmann5 · 2021-09-09T12:19:08Z

Hi
Thank you so much, I am seeing the codes are written in a way that train/dev set are considered, and one should not set do_train and do_check together, I need to also compute the test performance at the same time, so train/dev/test, I see majority of the codes like prepare_data are for either test/dev, I was not sure what is the cleanest way to get all three accuracies. thank you so much again.

shmsw25 · 2021-09-09T12:38:51Z

OK, so there're a few corrections I should make

It is possible to (and in fact, you must) specify --do_train and --do_check together. --do_train just means you are using the trained model (not the original GPT2 checkpoint). If you specify --do_train but not --do_check, as in the command line in README, it will train the model on the train data and evaluate it on the test data. Once you trained the model, you can add --do_check to the command line without changing anything else (so, keep specifying --do_train) in order to load the trained model and just run the inference, instead of re-training the model. In other words, if you think you will not need to run inference multiple times and just running inference once after training, you don't have to care about --do_check at all. Please use the command in README as it is.
We are not using the dev data - we only use the train data and test data. This is mentioned in Section 5.2 of the paper, along with the detailed motivation. Here's the summary of the motivation: when you have 16 examples given to you, you can split it to training and validation sets as you wish (assuming a separate validation set is not a true few-shot setup, as claimed in Perez et al 2021). In order to have the validation data for 16-shot learning, the training data has to be smaller than 16. Nonetheless, previous work (e.g. Perez et al 2021) has found that choosing hyperparams based on cross-validation is not much better than random hyperparams. Therefore, we thought that using more datapoints for training would be much better than having the validation set, especially as having more training examples is so crucial for training the model (as shown in Figure 4). This is also a reason we choose learning rate based on training loss, not on the validation loss or validation accuracy.

Given all these, the answer to your question "what is the cleanest way to get all accuracies" => you can just use the command line in README - it will train the model on the training data and give you the test result.

jackfeinmann5 · 2021-09-09T16:45:43Z

Hi
Thanks a lot, sorry but I am really confused, from what I see specifying both do_train and do_check gives an error. thanks.

shmsw25 · 2021-09-09T16:48:38Z

Have you ran the command line without ‘—do_check’ first? If yes, what error do you see when running the command with both ‘—do_train’ and ‘—do_check’?

…

On Fri, Sep 10, 2021 at 1:45 AM jackfeinmann5 ***@***.***> wrote: Hi Thanks a lot, sorry but I am really confused, from what I see specifying both do_train and do_check gives an error. thanks. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACCKDCK6TLGGVJMJEUTBGETUBDQEDANCNFSM5DW2UIUA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

jackfeinmann5 · 2021-09-09T18:29:36Z

Hi
Here please find the command and output I get when using both do_train and do_check

python main.py     --task trec     --split test     --data_dir data     --out_dir out1     --gpt2 gpt2     --do_train     --method channel  --prompt_tune --do_check
09/09/2021 20:21:23 - INFO - __main__ - Namespace(batch_size=32, data_dir='data', do_check=True, do_train=True, do_zeroshot=False, ensemble=False, gpt2='gpt2', head_tune=False, k='16', log_file=None, lr=1e-05, method='channel', n_prefix=20, out_dir='out1', prompt_tune=True, seed='100', split='test', task='trec', train_seed=1, train_task=None, transform_tune=False, use_calibration=False, use_demonstrations=False, warmup_steps=0)
09/09/2021 20:21:24 - INFO - __main__ - channel trec
09/09/2021 20:21:25 - INFO - __main__ - Checking the first example...
09/09/2021 20:21:25 - INFO - __main__ - Input:
09/09/2021 20:21:25 - INFO - __main__ - <TASK00> <TASK01> <TASK02> <TASK03> <TASK04> <TASK05> <TASK06> <TASK07> <TASK08> <TASK09> <TASK10> <TASK11> <TASK12> <TASK13> <TASK14> <TASK15> <TASK16> <TASK17> <TASK18> <TASK19> Description :
09/09/2021 20:21:25 - INFO - __main__ - Output:
09/09/2021 20:21:25 - INFO - __main__ -  How far is it from Denver to Aspen?<|endoftext|>
09/09/2021 20:21:25 - INFO - __main__ - out1/gpt2-prompt-ft/channel/trec/BS=32-k=16-t=0-seed=100-tseed=1-lr=1e-05/cache-test-100.pkl
09/09/2021 20:21:25 - INFO - __main__ - checkpoint out1/gpt2-prompt-ft/channel/trec/BS=32-k=16-t=0-seed=100-tseed=1-lr=1e-05/model-100.pt not found...
Traceback (most recent call last):
  File "main.py", line 413, in <module>
    main(logger, args)
  File "main.py", line 77, in main
    acc = run(logger, args.do_train, args.do_zeroshot,
  File "main.py", line 305, in run
    assert False
AssertionError

shmsw25 · 2021-09-09T20:13:57Z

The log prints out the error message: checkpoint out1/gpt2-prompt-ft/channel/trec/BS=32-k=16-t=0-seed=100-tseed=1-lr=1e-05/model-100.pt not found... so it looks like you have not run the command line without --do_check before running this command. As I mentioned you must have ran the command line with --do_train but without --do_check to train the model first. Once the process is completely done, you may specify --do_check in order to re-run the inference --- this is completely optional, and you should be able to get all the results you want without specifying --do_check at all.

jackfeinmann5 · 2021-09-10T16:29:04Z

Thank you so much, I managed to run them now.

jackfeinmann5 closed this as completed Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to evaluate and test afterwards #4

how to evaluate and test afterwards #4

jackfeinmann5 commented Sep 9, 2021

shmsw25 commented Sep 9, 2021 •

edited

jackfeinmann5 commented Sep 9, 2021

shmsw25 commented Sep 9, 2021

jackfeinmann5 commented Sep 9, 2021

shmsw25 commented Sep 9, 2021

jackfeinmann5 commented Sep 9, 2021

shmsw25 commented Sep 9, 2021 via email

jackfeinmann5 commented Sep 9, 2021

shmsw25 commented Sep 9, 2021 •

edited

jackfeinmann5 commented Sep 10, 2021

how to evaluate and test afterwards #4

how to evaluate and test afterwards #4

Comments

jackfeinmann5 commented Sep 9, 2021

shmsw25 commented Sep 9, 2021 • edited

jackfeinmann5 commented Sep 9, 2021

shmsw25 commented Sep 9, 2021

jackfeinmann5 commented Sep 9, 2021

shmsw25 commented Sep 9, 2021

jackfeinmann5 commented Sep 9, 2021

shmsw25 commented Sep 9, 2021 via email

jackfeinmann5 commented Sep 9, 2021

shmsw25 commented Sep 9, 2021 • edited

jackfeinmann5 commented Sep 10, 2021

shmsw25 commented Sep 9, 2021 •

edited

shmsw25 commented Sep 9, 2021 •

edited