nlp_data_aug

Experiments with NLP Data Augmentation

Data Augmentation Techniques

(some ideas from this thread in fast.ai forums)

Noun,Verb and Adjective replacement in the IMDB dataset from the EDA paper
Pronoun replacement to come soon
"Back Translation": Translating from one language to another and then back to the original to utilize the “noise” in back translation as augmented text.
~~Character Perturbation for augmentation (more from Mike)~~ Surface Pattern Perturbation
- Because ULMFit doesn't model characters, I will switch to surface patterns of word tokens instead. UNK token perturbation will be committed soon.

Issues with the codebase

~~Cuda dropout non-deterministic?? (more from Mike)~~ Most of cuDNN nondeterministic issues are almost implicitly solved with fastai
- For dropout, a deterministic usage is like what is done at fastai-1.0.55: awd_lstm.py#L105. A non-determinstic way is to apply torch.nn.LSTM(..., dropout=dropout_prob_here,...) directly, although it is much faster.
- Other cuDNN/PyTorch reproducibility issues are addressed in #1-control_random_factors/imdb.ipynb. They require explicit measures, as https://docs.fast.ai/dev/test.html#getting-reproducible-results briefly described.
Hyperparameters
- from the ULMFiT paper (using pre-1.0 fastai):
  
  We use the AWD-LSTM language model (Merity et al., 2017a) with an embedding size of 400, 3 layers, 1150 hidden activations per layer, and a BPTT batch size of 70. We apply dropout of 0.4 to layers, 0.3 to RNN layers, 0.4 to input embedding layers, 0.05 to embedding layers, and weight dropout of 0.5 to the RNN hidden-to-hidden matrix. The classifier has a hidden layer of size 50. We use Adam with β1 = 0.7 instead of the default β1 = 0.9 and β2 = 0.99, similar to (Dozat and Manning, 2017). We use a batch size of 64, a base learning rate of 0.004 and 0.01 for fine-tuning the LM and the classifier respectively, and tune the number of epochs on the validation set of each task.
  
  On small datasets such as TREC-6, we fine-tune the LM only for 15 epochs without overfitting, while we can fine-tune longer on larger datasets. We found 50 epochs to be a good default for fine-tuning the classifier.
- from fastai/course-v3/nbs/dl1/lesson3-imdb.ipynb (May 3, 2019)
  
  learner bs lr bptt wd clip drop_mult to_fp16()
  
  lm 48 1e-2 70 0.01 None 0.3 No
  
  cf 48 2e-2 70 0.01 None 0.5 No
- from fastai/fastai/examples/ULMFit.ipynb (June 11, 2019)
  
  Fine-tuning a forward and backward langauge model to get to 95.4% accuracy on the IMDB movie reviews dataset. This tutorial is done with fastai v1.0.53.
  
  The example was run on a Titan RTX (24 GB of RAM) so you will probably need to adjust the batch size accordinly. If you divide it by 2, don't forget to divide the learning rate by 2 as well in the following cells. You can also reduce a little bit the bptt to gain a bit of memory.
  
  learner bs lr bptt wd clip drop_mult to_fp16()
  
  lm 256 2e-2 80 0.1 0.1 1.0 Yes
  
  cf 128 1e-1 80 0.1† None 0.5 No
  
  † forward cf's wd used default value 0.01
- from fastai/course-nlp/nn-imdb-more.ipynb (June 12, 2019)
  
  learner bs lr bptt wd clip drop_mult to_fp16()
  
  lm 128 1e-2*bs/48 70 0.01 None 1.0 Yes
  
  cf 128 2e-2*bs/48 70 0.01 None 0.5 Yes

Tasks

Use data augmentation techniques in succession on IMDB classification task to report performance
Make use of baseline translation model from OpenNMT for "back translation" task

Misc

Blog post on research directions for data augmentation in NLP
For visual diff between Jupyter notebooks, try https://app.reviewnb.com

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
AG_baseline.ipynb		AG_baseline.ipynb
DataAugmentation.ipynb		DataAugmentation.ipynb
IMDb_baseline.ipynb		IMDb_baseline.ipynb
LICENSE		LICENSE
README.md		README.md
TREC6.ipynb		TREC6.ipynb
TREC6_baseline.ipynb		TREC6_baseline.ipynb
imdb.ipynb		imdb.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nlp_data_aug

Data Augmentation Techniques

Issues with the codebase

Tasks

Misc

About

Releases

Packages

Contributors 2

Languages

learner	bs	lr	bptt	wd	clip	drop_mult	`to_fp16()`
lm	128	1e-2*bs/48	70	0.01	None	1.0	Yes
cf	128	2e-2*bs/48	70	0.01	None	0.5	Yes

License

the-asir/nlp_data_aug

Folders and files

Latest commit

History

Repository files navigation

nlp_data_aug

Data Augmentation Techniques

Issues with the codebase

Tasks

Misc

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages