Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the back translations. #13

Closed
callmeYe opened this issue Nov 18, 2020 · 9 comments
Closed

Question about the back translations. #13

callmeYe opened this issue Nov 18, 2020 · 9 comments

Comments

@callmeYe
Copy link

Can I not do data augmentation on unlabelled data?

@jiaaoc
Copy link
Member

jiaaoc commented Nov 18, 2020

We use back-translation to create paraphrases for unlabeled data and perform consistency training. You could use other ways to generate paraphrases.

@callmeYe
Copy link
Author

callmeYe commented Nov 19, 2020

So I have to create paraphrases, right? In addition, When I look at the code, I find that only the first 100,000 pieces of data in the data set have been back translated. Do I not need to perform back translation for all the datasets?

@jiaaoc
Copy link
Member

jiaaoc commented Nov 19, 2020

It depends on the size of the unlabeled data you are going to use. In this work, we used 100,000 unlabeled data, so we just did back translations on them, not the whole dataset.

@callmeYe
Copy link
Author

Sorry, I'm still a little confused.
When I test with:
python ./code/train.py --gpu 0,1 --n-labeled 10 \ --data-path ./data/yahoo_answers_csv/ --batch-size 2 --batch-size-u 4 --epochs 20 --val-iteration 1000 \ --lambda-u 1 --T 0.5 --alpha 16 --mix-layers-set 7 9 12 \ --lrmain 0.000005 --lrlast 0.0005
The number of unlabeled data per class seems to be 5,000. Do they add up to exactly 100,000?

@jiaaoc
Copy link
Member

jiaaoc commented Nov 19, 2020

You could use up to 100,000

@jiaaoc
Copy link
Member

jiaaoc commented Nov 19, 2020

10,000

@jiaaoc
Copy link
Member

jiaaoc commented Nov 19, 2020

Anyway, the number of data you need to paraphrase only depends on the number of unlabeled data you are going to use.

@callmeYe
Copy link
Author

Are they one-to-one correspondence?

@jiaaoc
Copy link
Member

jiaaoc commented Nov 19, 2020

one unlabeled data could be associated with multiple paraphrases. Please refer to the paper/codes for details.

@jiaaoc jiaaoc closed this as completed Nov 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants