Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

commands for generation #46

Closed
Harry-hash opened this issue Aug 22, 2021 · 3 comments
Closed

commands for generation #46

Harry-hash opened this issue Aug 22, 2021 · 3 comments

Comments

@Harry-hash
Copy link

How should I run codes for generation tasks such as cnn-dailymail?

@timoschick
Copy link
Owner

Hi @Harry-hash, first, you'll need to checkout the feature/genpet branch for that. There's a couple of new features GenPET uses that, unfortunately, are not mentioned in the paper on arXiv due to an ongoing anonymity period, but to train a model (with all those features enabled) using the same hyperparameters as the paper, you can use the following command:

python3 cli.py \
	--method pet \
	--wrapper_type generative \
	--pattern_ids 2 3 4 5 \
	--data_dir . \
	--model_type pegasus \
	--model_name_or_path google/pegasus-large \
	--task_name ${TASK} \
	--output_dir ${OUTPUT_DIR} \
	--train_examples ${NUM_EXAMPLES} \
	--test_examples 10000 \
	--unlabeled_examples 1000 \
	--do_eval \
	--learning_rate 1e-4 \
	--eval_set test \
	--pet_per_gpu_eval_batch_size 32 \
	--pet_per_gpu_train_batch_size 2 \
	--pet_gradient_accumulation_steps 4 \
	--output_max_seq_length ${OUTPUT_MAX_SEQ_LENGTH} \
	--pet_max_steps 250 \
	--pet_max_seq_length 512 \
	--sc_per_gpu_train_batch_size 2 \
	--sc_gradient_accumulation_steps 4 \
	--sc_per_gpu_eval_batch_size 32 \
	--sc_max_steps 250 \
	--sc_max_seq_length 512 \
	--optimizer adafactor \
	--epsilon 0.1 \
	--do_train \
	--pet_repetitions 1 \
	--train_data_seed ${TRAIN_DATA_SEED} \
	--multi_pattern_training \
	--untrained_model_scoring \
	--cutoff_percentage 0.2

Here,

  • ${TASK} is the name of the task (e.g., cnn-dailymail, see here);
  • ${OUTPUT_DIR} is the output directory;
  • ${NUM_EXAMPLES} is the number of training examples to use (in the paper, we experimented with 0, 10 and 100);
  • ${OUTPUT_MAX_SEQ_LENGTH} is the maximum length of the generated output sequence (32 for aeslc and gigaword, 64 for xsum and 128 for all other tasks);
  • ${TRAIN_DATA_SEED} is the seed used for initializing the RNG that selects the ${NUM_EXAMPLES} training examples. In the paper, we've used 0, 42 and 100.

If you don't want to use the new features mentioned above, simply remove the last three lines (i.e., do not use --multi_pattern_training and --untrained_model_scoring and do not provide a --cutoff_percentage).

@Harry-hash
Copy link
Author

Thank you very much for your detailed instructions! @timoschick

But when I was running the codes, a lot of error messages appear in the terminal, which says "Token indices sequence length is longer than the specified maximum sequence length for this model (1070 > 1024). Running this sequence through the model will result in indexing errors". Is it because the parameter max_length is not specified somewhere in tokenization? I am using transformer==3.3.1

@timoschick
Copy link
Owner

If everything else works as expected, you can ignore this error message. It's because PET has its own truncation logic to ensure that the mask token and the pattern are never truncated. Before applying this logic, the entire sequence is tokenized without any truncation, which is why some resulting sequences are longer than the model's maximum sequence length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants