Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to evaluate and test afterwards #4

Closed
jackfeinmann5 opened this issue Sep 9, 2021 · 10 comments
Closed

how to evaluate and test afterwards #4

jackfeinmann5 opened this issue Sep 9, 2021 · 10 comments

Comments

@jackfeinmann5
Copy link

Hi
I see the running commands in the repo, I find it hard how to automatically, when you find the checkpoint based on the loss, also evaluate that on the test set to report the results, could you kindly provide one example when you train and test on the best checkpoint automatically. thanks

@shmsw25
Copy link
Owner

shmsw25 commented Sep 9, 2021

Hi @jackfeinmann5, thanks for the question. The most hyperparams can be used as they are as default. The only hyperparam which is chosen based the training loss is learning rate, and I chose it manually, because there are just three choices. So I run the training commands with three different learning rate values, and read the log file to see which one should be chosen -- which I actually wrote a script to automate it and it's not part of the released code but should be super easy to write. Other things like testing on the test set are automatic. Let me know if any steps are unclear - would love to hep!

@jackfeinmann5
Copy link
Author

thanks, could you kindly let me know how would you test on the checkpoint selected ? shall I run the command for train and then add do_check --split test and get the results like this? thanks a lot.

@shmsw25
Copy link
Owner

shmsw25 commented Sep 9, 2021

Exactly! And in fact, even without --do_check it will automatically evaluate on the test set when the training is finished. So what I did just for an efficiency was as follows: (1) run training three times with three learning rates, which will automatically evaluate on the test data as well, (2) write a script that read the log file, looks at the training loss at global_step=100, find the one with the lowest value, and return the result on the test data. Does it make sense? Let me know if things are unclear.

@jackfeinmann5
Copy link
Author

Hi
Thank you so much, I am seeing the codes are written in a way that train/dev set are considered, and one should not set do_train and do_check together, I need to also compute the test performance at the same time, so train/dev/test, I see majority of the codes like prepare_data are for either test/dev, I was not sure what is the cleanest way to get all three accuracies. thank you so much again.

@shmsw25
Copy link
Owner

shmsw25 commented Sep 9, 2021

OK, so there're a few corrections I should make

  1. It is possible to (and in fact, you must) specify --do_train and --do_check together. --do_train just means you are using the trained model (not the original GPT2 checkpoint). If you specify --do_train but not --do_check, as in the command line in README, it will train the model on the train data and evaluate it on the test data. Once you trained the model, you can add --do_check to the command line without changing anything else (so, keep specifying --do_train) in order to load the trained model and just run the inference, instead of re-training the model. In other words, if you think you will not need to run inference multiple times and just running inference once after training, you don't have to care about --do_check at all. Please use the command in README as it is.

  2. We are not using the dev data - we only use the train data and test data. This is mentioned in Section 5.2 of the paper, along with the detailed motivation. Here's the summary of the motivation: when you have 16 examples given to you, you can split it to training and validation sets as you wish (assuming a separate validation set is not a true few-shot setup, as claimed in Perez et al 2021). In order to have the validation data for 16-shot learning, the training data has to be smaller than 16. Nonetheless, previous work (e.g. Perez et al 2021) has found that choosing hyperparams based on cross-validation is not much better than random hyperparams. Therefore, we thought that using more datapoints for training would be much better than having the validation set, especially as having more training examples is so crucial for training the model (as shown in Figure 4). This is also a reason we choose learning rate based on training loss, not on the validation loss or validation accuracy.

Given all these, the answer to your question "what is the cleanest way to get all accuracies" => you can just use the command line in README - it will train the model on the training data and give you the test result.

@jackfeinmann5
Copy link
Author

Hi
Thanks a lot, sorry but I am really confused, from what I see specifying both do_train and do_check gives an error. thanks.

@shmsw25
Copy link
Owner

shmsw25 commented Sep 9, 2021 via email

@jackfeinmann5
Copy link
Author

Hi
Here please find the command and output I get when using both do_train and do_check

python main.py     --task trec     --split test     --data_dir data     --out_dir out1     --gpt2 gpt2     --do_train     --method channel  --prompt_tune --do_check
09/09/2021 20:21:23 - INFO - __main__ - Namespace(batch_size=32, data_dir='data', do_check=True, do_train=True, do_zeroshot=False, ensemble=False, gpt2='gpt2', head_tune=False, k='16', log_file=None, lr=1e-05, method='channel', n_prefix=20, out_dir='out1', prompt_tune=True, seed='100', split='test', task='trec', train_seed=1, train_task=None, transform_tune=False, use_calibration=False, use_demonstrations=False, warmup_steps=0)
09/09/2021 20:21:24 - INFO - __main__ - channel trec
09/09/2021 20:21:25 - INFO - __main__ - Checking the first example...
09/09/2021 20:21:25 - INFO - __main__ - Input:
09/09/2021 20:21:25 - INFO - __main__ - <TASK00> <TASK01> <TASK02> <TASK03> <TASK04> <TASK05> <TASK06> <TASK07> <TASK08> <TASK09> <TASK10> <TASK11> <TASK12> <TASK13> <TASK14> <TASK15> <TASK16> <TASK17> <TASK18> <TASK19> Description :
09/09/2021 20:21:25 - INFO - __main__ - Output:
09/09/2021 20:21:25 - INFO - __main__ -  How far is it from Denver to Aspen?<|endoftext|>
09/09/2021 20:21:25 - INFO - __main__ - out1/gpt2-prompt-ft/channel/trec/BS=32-k=16-t=0-seed=100-tseed=1-lr=1e-05/cache-test-100.pkl
09/09/2021 20:21:25 - INFO - __main__ - checkpoint out1/gpt2-prompt-ft/channel/trec/BS=32-k=16-t=0-seed=100-tseed=1-lr=1e-05/model-100.pt not found...
Traceback (most recent call last):
  File "main.py", line 413, in <module>
    main(logger, args)
  File "main.py", line 77, in main
    acc = run(logger, args.do_train, args.do_zeroshot,
  File "main.py", line 305, in run
    assert False
AssertionError

@shmsw25
Copy link
Owner

shmsw25 commented Sep 9, 2021

The log prints out the error message: checkpoint out1/gpt2-prompt-ft/channel/trec/BS=32-k=16-t=0-seed=100-tseed=1-lr=1e-05/model-100.pt not found... so it looks like you have not run the command line without --do_check before running this command. As I mentioned you must have ran the command line with --do_train but without --do_check to train the model first. Once the process is completely done, you may specify --do_check in order to re-run the inference --- this is completely optional, and you should be able to get all the results you want without specifying --do_check at all.

@jackfeinmann5
Copy link
Author

Thank you so much, I managed to run them now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants