-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Expose BacktranslationDataset from fairseq.data #324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
dpacgopinath
wants to merge
1
commit into
facebookresearch:master
from
dpacgopinath:export-D10412717
Closed
Expose BacktranslationDataset from fairseq.data #324
dpacgopinath
wants to merge
1
commit into
facebookresearch:master
from
dpacgopinath:export-D10412717
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Summary: BacktranslationDataset was introduced recently but was not exposed as part of the fairseq.data module Reviewed By: liezl200 Differential Revision: D10412717 fbshipit-source-id: 55df9c2491f5583e807f9c6c6f3b37ddf7622e63
jxhe
pushed a commit
to salesforce/ctrl-sum
that referenced
this pull request
Nov 3, 2020
Architecture settings and readme updates
More fixes
Small fix
Better training support when GPUs are in "exclusive mode"
Issue #2, Checking size attribute of dst when dst is None
Fix generation when vocabulary is small relative to beam size (fixes #7)
Fix handling of partially-empty initial batch (#11)
Refactor PaddingCollater
Update progress_bar to be more robust to changes in tqdm (#21)
Fix call ordering to ATen addmm and sum (#22)
BPE transformation for IWSLT
Don't generate during training, add --quiet to generate.py
Ignore invalid sentences in test and valid
Update README.md
Update PyTorch install instructions
Update README.md
Don't suggest Miniconda (see #24)
Fix --no-progress-bar option in generate.py (#115)
Update En2Fr model
Move helper functions from generate.py to fairseq/dictionary.py
Support configurable BPE symbol
Fix flake8 warnings
Don't save/restore convolutional layers in incremental inference
Allow --max-len-a to be a float
Add optimizer history to checkpoints (and rearrange criterions slightly)
Better logging from criterions
Add support for NCCL v2
Add attention matrix to output of SequenceGenerator
Ignore generated files for temporal convolution tbc
Fix smoothed (sentence-level) BLEU calculation
More flexible gradient normalization
Refactor model saving/loading to be more reusable
Refactor code in Tokenizer
Add support for additional optimizers
Simplify deps of build_model to only depend on dict (instead of dataset)
Fix language inference in generate.py
Fix handling of continuation tokens that precede <unk> in generate.py
Prevent math overflow when loss is too high
Set seed after each epoch to improve consistency when resuming
Fix for building under clang: specify C++ build and use C++ linkage (#42)
Update README with note about Docker (#49)
Force UTF-8 encoding for dictionary files ( #41 )
Only save most recent optimizer state in checkpoints (#53)
Only consider EOS in beam search if it's among top-k candidates
Fix description for `--sample-without-replacement` option
Support custom dictionary in preprocess.py
Add `--curriculum` option
Refactor model definitions
* Move some functionality out of FConvModel into FairseqModel base class
* Move incremental decoding functionality into FairseqIncrementalDecoder module
* Refactor positional embeddings to be more specific to FConvModel
Added -unkpen flag to generate.py following logic of Lua/Torch version
Support different max_source_positions and max_target_positions
Fix call to non-existing to_string method
Fix seed so that data is properly shuffled between epochs
Upgrade args with max_source_positions and max_target_positions
Refactor generation
* Split generate.py to generate.py and interactive.py and refactor code
The main motivation behind these changes is to try to decorrelate use
cases in order to implement future improvements such as unk replacement
with original string during evaluation on test and writing predictions
to output file.
The previous implementation worked well but I found it difficult to
integrate these future improvements.
* Add --replace-unk arg to be used without align dict
Replacing <unk> tokens can be beneficial even without an alignment
dictionary.
Left pad source and right pad target
Improvements to data loader
Fix interactive.py
Use `--lrshrink` as the reduction factor in ReduceLROnPlateau
Fix flake8 lint
Add dim to F.softmax calls
Update README with interactive.py and fix it
Add --max-sentence option for batching based on # sentences
Loop over evaluation dataloader in descending order
Replace unk with original string
* Add <eos> for unk replacement
* Add IndexedRawTextDataset to load raw text files
* Replace unk with original string
* Add load_raw_text_dataset() and --output-format
* Move has_binary_files to data.py
Revert `dim` in `F.softmax` for backwards compatibility
Rename LabelSmoothedCrossEntropy to LabelSmoothedNLLLoss
Add LSTM
Don't call forward directly (prefer module(x) to module.forward(x))
Add `--log-format` option and JSON logger
Fix max_positions_valid in train.py
Fixes for `--log-format`
Fix all-reduce for new versions of PyTorch
We previously assumed that once a model parameter's gradient buffer was allocated, it stayed fixed during training.
However, this assumption is violated in recent versions of PyTorch (i.e., the gradient buffer may be reallocated during
training), and it's no longer a safe assumption to make.
This is primarily relevant when we do the all-reduce, since we all-reduce a flattened (i.e., contiguous) copy of the
gradients. We can make this more robust by copying the result of the all-reduce back into the model parameter's gradient
buffers after each update. Intra-device copies are cheap, so this doesn't affect performance.
Version 0.1.0 -> 0.2.0
Release notes:
- 5c7f495: Added simple LSTM model with input feeding and attention
- 6e4b7e2: Refactored model definitions and incremental generation to be cleaner
- 7ae79c1: Split interactive generation out of generate.py and into a new binary: interactive.py
- 19a3865: Subtle correctness fix in beam search decoder. Previously, for a beam size of k, we might emit a hypotheses
if the <eos> was among the top 2*k candidates. Now we only emit hypotheses for which the <eos> is among the
top-k candidates. This may subtly change generation results, and in the case of k=1 we will now produce
strictly greedy outputs.
- 97d7fcb: Fixed bug in padding direction, where previously we right-padded the source and left-padded the target. We
now left-pad the source and right-pad the target. This should not effect existing trained models, but may
change (usually improves) the quality of new models.
- f442f89: Add support for batching based on the number of sentences (`--max-sentences`) in addition to the number of
tokens (`--max-tokens`). When batching by the number of sentences, one can optionally normalize the gradients
by the number of sentences with `--sentence-avg` (the default is to normalize by the number of tokens).
- c6d6256: Add `--log-format` option and JSON logger
Fallback to `--log-format=simple` for non-TTY terminals
Fix Flake8
Flush non-TTY logging output after each log interval
Make LSTM backwards compatible and fix incremental generation
Remove Python 3.6 format strings (fixes #55)
Remove more Python 3.6 format strings (fixes #57) (#58)
Remove Python3.6 format string from preprocess.py (fixes #60) (#61)
Update requirements.txt and fix flake8 (#62)
fix bug in lstm model (#68)
Fixed 2 typos (#75)
Improve error when resuming training with a different model architecture
Improve memory handling (recover from OOM and periodically empty caching allocator)
Allow --lr to specify a fixed learning rate schedule
Prefer command-line configuration over checkpoint for optimizer state
Save number of GPUs in args (and checkpoints)
Fix weight norm dimension in decoder (fixes #73)
Rebuild optimizer when loading checkpoints
Fix conv padding for even kernel widths
Directly decay weight instead of L2 penalty (#157)
See https://arxiv.org/pdf/1711.05101.pdf
Fix generation bug with large beam sizes (>50)
Add support for sharded generation
Fix BeamableMM
Better error message for --decoder-attention
Minor fix for strip_pad functions
Support deprecation of volatile Variables in latest PyTorch
Add explicit dimension to softmax calls
Output number of model parameters in train.py
Raise FileNotFoundError if dictionary files don't exist
Add reduce kwarg to criterions
Streamline data formatting utils
Add --max-sentences-valid to train.py
Add option to SequenceGenerator to retain dropout
Fix warning about deprecated `volatile` kwarg for Variables
Move positional embeddings into LearnedPositionalEmbedding module
Move normalization of model output (e.g., via LSM) into model definition
Fix LearnedPositionalEmbedding
Fix gradient clipping when --clip-norm=0
Save dictionary in model base classes
Fix training
Better support for torch.no_grad (since volatile is deprecated)
Share input/output embed
Report log likelihood for label smoothing
Momentum correction
ATen Fix
Better warning message for inputs that are too long
Fix max_positions calculation in train.py
Output correct perplexity when training with --sentence-avg
Fix tests
Fixed Weight Decay Regularization in Adam
See https://arxiv.org/abs/1711.05101
Ratio should be predlen/reflen not reflen/predlen
To be compatible with multi-bleu.
This seems to only affect the result_string.
Prepare scripts for WMT14
Switch to news-commentary-v12
Adding README and more parameters to En2De script
Update README with new models
spelling
Adjust weight decay by the current learning rate to make it work correctly during annealing
Allow larger maxlen (fixes #100) (#101)
fairseq-py goes distributed (#106)
This PR includes breaking API changes to modularize fairseq-py and adds support for distributed training across multiple nodes.
Changes:
- c7033ef: add support for distributed training! See updated README for usage.
- e016299: modularize fairseq-py, adding support for register_model, register_criterion, register_optimizer, etc.
- 154e440: update LSTM implementation to use PackedSequence objects in the encoder, better following best practices and improving perf
- 90c2973 and 1da6265: improve unit test coverage
Add OOM counter back to logging output
Fix tests and flake8
More unit test fixes
Add support to prefixes (#221)
* Add prefix
* Fixes
* Keep original scores with prefix
* Improve prefix code
* Replace 'repeat' with 'expand'
pytorch update: no need to rewrap variable in backward()
Fix LabelSmoothedCrossEntropy test
Refactor incremental generation to be more explicit and less magical (#222)
Making our code compatible with the latest pytorch (#223)
* Making our code compatible with the latest pytorch
* revert
* torch.nn.utils.clip_grad_norm now returns tensor
More fixes for recent PyTorch (incl. topk issue) (#113)
More updates for PyTorch (#114)
Use ATen built-in conv_tbc method (#66)
Remove custom ConvTBC code
Small fixes
Filter padding properly in LabelSmoothedCrossEntropyCriterion (#229)
Allow more flexible pre-processing and generation (#227)
* Allow more flexible pre-processing and generation
* Addressing CR comments
* small fix
Enforce upper-bound on maximum generation length (#121)
fix typo in data/README
Change "awailable" to "available".
fix typo in data/README (#131)
Change "awailable" to "available".
Update training commands
Update training commands in data/README to match the latest version of this project according to #132.
- Motivation: in the previous data/README, the commands are obsolete and will cause the error "unrecognized arguments: --label-smoothing 0.1 --force-anneal 50".
- What's changed: add arguments "--criterion label_smoothed_cross_entropy" and "--lr-scheduler fixed" to the training commands of all 3 datasets.
- Result: the new commands run without error on all 3 datasets.
Update training commands
Update training commands in data/README to match the latest version of this project according to #132.
Continue from 3c07295885c6283def573e7a6811464f250c3b28: add omitted "\".
Update training command for IWSLT14
specify a single GPU setup for IWSLT14
Merge internal changes (#136)
Changes:
- 7d19e36: Add `--sampling` flag to generate.py to sample instead of doing beam search
- c777340: Add `scripts/average_checkpoints.py` to average multiple checkpoints into a combined model
- 3ea882c: Add `--max-update` option to train.py to stop training after a given number of updates
- small bugfixes for distributed training, LSTM, inverse square root LR scheduler
make interactive mode print out alignment nicely
Disallow --batch-size in interactive.py
Update README.md
use implicit padding when possible (#152)
Add pretrained embedding support (#151)
Flake8
Fix old model checkpoints after #151 (fixes #156) (#157)
Update dataset code for use by https://github.com/pytorch/translate/pull/62 (#161)
Merge internal changes (#163)
0.4.0 -> 0.5.0
Remove sweep_log prefix from json progress bar
Faster fconv generation
Fix LSTM
fix optim history
address comments
Add Transformer model
Remove Google batching stategy (it's not needed)
Pass args around to cleanup parameter lists
Bug fixes
Fix flake8
Fix buffers in sinusoidal positional embeddings
caching v3 (cache keys, values, process only last time step) (#241)
- process only last time step during generation
- cache keys and values
- dont apply masking during generation
smarter way to avoid applying encoder key mask
Use PyTorch LayerNorm and improve weight init
More improvements to weight init and FP16 support
Simulated big batches
Use FP32 for multi-head attention softmax
better batching
Improve dataloader speed and deprecate concept of batch_offset (use --sample-without-replacement instead)
Allow schedule for update-freq
Fix batching during generation
Add FP16 support
Make dictionary size a multiple of 8
Revert "Make dictionary size a multiple of 8"
This reverts commit b2e119c209363e6ff6d2878a69c7d1a507a2e9be.
Pad dictionary to be a multiple of 8 in preprocessing
No more magical --fp16
remove completed sentences from batch
remove completed sentences from batch and allow batching uneven lengths (with fixes to make padded sequences work correctly in all models)
Fix Flake8
Small optimization for LSTM
Fix preprocess.py
Use eval() to parse args.lr
Fix embedding initialization for padding
Simplify train.py (merge with singleprocess_train.py)
Save and restore wall time in checkpoints
Support --warmup-updates with fixed LR schedule
Fix tests
Remove src-padding from generation output
make sure tensor used to index is cuda if on gpu
Fix --prefix-size
fix to adding tokens to dictionary while thresholding
make attn dropout 0.1 default for big en-de transformer
add support for averaging last n checkpoints
fix flag copy paste (decoder-normalize-before)
Remove padding from --score-reference
Fix --remove-bpe to strip trailing BPE symbols
Sampling doesn't work with interactive
implement batching in interactive mode
fix alignment when using uneven batches and left pad
Support integer learning rates
allow specifying max_tokens for generation
also report sentence/s timing when generating
default dropout to correct value for big transformer
ability to checkpoint when reaching certain number of updates
All-reduce in FP16
remove unused verbose option & make arguments to averaging script nicer
allow overwriting args for different architectures
Fix tests
use implicit padding when possible
Add pretrained embedding support
Flake8
Fix old model checkpoints
Merge OSS + internal changes
Conv lm implementation
This implements convolutional language model from https://arxiv.org/pdf/1612.08083.pdf
There are 3 modes for constructing batches:
- token block: fill each sample with a specified number of tokens without regard for sentence delimiters - this is what was used for training in the paper
- complete: fill each sample with a specified number of tokens but make sure it contains only complete sentences (i.e. if next sentence goes over token block limit, move it to the next sample) - this was used for evaluation in the paper
- eos: one sentence per sample (skip blank lines)
some results:
GCNN-13 - GBW - 37.46
GCNN-14B - GBW - 33.88
GCNN-8 - Wiki103 - 43.76
GCNN-14 - Wiki103 - 35.66
train:
python train.py /private/home/abaevski/data/wiki103 --save-dir /tmp --fp16 --max-epoch 35 --save-interval 1 --save-interval-updates 1000 --keep-interval-updates 25 --arch fconv_lm --optimizer nag --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 --decoder-embed-dim 280 --decoder-layers '[(850, 6)] * 3 + [(850,1)] + [(850,5)] * 4 + [(850,1)] + [(850,4)] * 3 + [(1024,4)] + [(2048, 4)]' --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion cross_entropy --max-tokens 1024 --max-target-positions 1024 --seed 1 --log-format json --log-interval 500
eval:
python eval_lm.py ~abaevski/data/wiki103 --path '/checkpoint02/abaevski/2018-04-27/lm_wiki.fp16.mxup300000.fconv.adam.lrs=reduce_lr_on_plateau.emb280.layers(850,6)*3+(850,1)+(850,5)*4+(850,1)+(850,4)*3+(1024,1)+(2048,4).lr0.0005.clp0.1.drp0.3.wd0.0.crt=cross_entropy.mxtk2048.smptk256.seed1.ngpu8/checkpoint_last.pt'
default normalization constant for older models
add big en_fr transformer architecture
Generalize eval_str_list
fix restoring from middle of epoch; fix defaulting transformer dropout params
record end_of_epoch in checkpoint
use adaptive softmax only with adaptive loss
fix default params
added multiscale gated self attention layer with multiple heads, and pretrained fusion models
modified writing prompts model parameters to make readme cleaner
minor parameter fixes for stories model
save best val loss in checkpoint
save best val loss in checkpoint and also print best so far
this way when training continues from an existing checkpoint, we dont immediately override checkpoint_best with a worse loss
fix model loading in eval_lm
Nits
Migrate all binaries to use options.parse_args_and_arch
Unify various sharding into ShardedIterator
Use symlinks for redundant checkpoints
Merge validate and val_loss functions (simplify train.py)
create examples dir and add conv lm + stories readme
Small fixes
Suppress stdout in test_train
Add more integration tests (LM, stories, transformer, lstm)
Update README.md
build optimizer only once, otherwise it leaks cuda memory
initialize normalization constant for fconv_lm
Fix length penalty when combined with --no-early-stop
Co-authored-by: pmichel31415 <pmichel@fb.com>
torch.arange default return type is changed in the latest pytorch version https://github.com/pytorch/pytorch/pull/7016
Add FairseqTask
A Task defines the data format, stores shared state (e.g., dictionaries) and provides helpers for building the model/criterion and calculating the loss.
Changes:
- Add TranslationTask and LanguageModelingTask. New tasks can be registered with @register_task decorator.
- Add EpochBatchIterator to encapsulate batching and saving/restoring dataloader position
- Remove LEFT_PAD_* constants and make them configurable per task
Updates for latest PyTorch
Fix tests
Fix bidirectional lstm
Faster generation when using a single model (rather than ensemble)
Change --path to be colon-separated instead of comma-separated
add default architecture for gbw fconv lm
add links to pretrained language models
Update README.md
Add transformer models and replace list with table
Update README.md
Fix preprocessed test set download links
add count for padding words (#180)
gzip instead of bzip change to stories download
added clarification on the newline token we model
fixed newline word not appearing
Fix translation README (fixes #186) (#189)
Fix `--output-format raw` option to preprocess.py (Fixes #188) (#190)
Two tiny changes to train/eval_lm. For train fix an off by one, while for eval_lm make it work when the task is translation'
Ignore files in examples (other than .sh and .md)
When downloading files in examples directory (e.g. when running
`prepare-iwslt14.sh`), git sees them as untracked files but they should
not be committed.
Add a .gitignore script that ignore everything in the examples
subdirectories except for .sh and .md files.
Better failure message when loss explodes during FP16 training
Store full checkpoints instead of symlinking
Fix a bug with using GloVe 840B tokens for initialization.
respect max tokens and ignore invalid inputs when evaluating lm
sort descending when evaluating lm because it is faster (17k wps vs 11k) and will fail early if oom
Support pretrained embeddings for Transformer.
Also show a nicer error message.
Support FP16 during inference
fix sinusoidal embedding init size
add sinusoidal pos initialization
Fix interpretation of --max-epoch
Move reorder_encoder_out to FairseqEncoder and fix non-incremental decoding
Add steps to reproduce WMT En-De results from Scaling NMT paper
Fix typo
Fix for Dictionary.finalize
Misc changes for pytorch-translate
Fix attention order in unit tests (fixes #195) (#197)
Remove more Variable() calls (#198)
Remove unnecessary assert (fixes #199) (#200)
Fix preprocessing for WMT14 En-De to replicate Scaling NMT paper (#203)
adding pretrained stories model
fix decoder_normalize_before typo (#205)
add model override argument from load_ensemble_for_inference at generation time, updating readme for stories
adding model arg override at generation time for interactive.py
assert that vocab size >= adaptive softmax cutoff (#214)
Fix up model defaults (#211)
Pass sampling-temperature trough to the generator in interactive.py
stories data preprocessing needs padding factor 1 to match pretrained model, updating readme
fixed output_proj's input_dim in attention (#226)
fix raw text for language modeling
make model access saner
fix token block rotation
Support tied embeddings in LSTM encoder/decoder
disable printing alignment by default (for perf) and add a flag to enable it
default need_attn to False
Fix bug when --share-all-embeddings but no --encoder-embed-path
Iterate on need_attn and fix tests
Output positional scores in interactive.py
Don't compute unnecessary attention averages during training
Transformer lm
This implements transformer based language model. It already obtains better perplexity on wikitext103 without any tuning. I will also train it on gbw where I also expect to get better ppl
Example training command:
python train.py /private/home/abaevski/data/wiki103 —save-dir /tmp —fp16 —max-epoch 80 —save-interval 1 —arch transformer_lm —task language_modeling —optimizer nag —lr 0.008 —lr-scheduler reduce_lr_on_plateau —lr-shrink 0.6 —dropout 0.2 —criterion adaptive_loss —adaptive-softmax-cutoff 10000,50000,200000 —max-tokens 512 —tokens-per-sample 512 —seed 1 —sample-break-mode none —log-format json —log-interval 50 —save-interval-updates 2500 —keep-interval-updates 25
small transformer got to 31.3 ppl on wiki text 103 (compared to 35 with fconv) while @myleott got a big transformer lm to 27 something ppl on wiki text 103
remove right-to-left lm support
default decoder_learned_pos for lm
option to print language model words and their log probs during evaluation
Update IWSLT configuration for transformer
Don't use 0-dimensional buffers in sinusoidal positional embeddings
Fix comment
Merge internal changes
Add load_optim option to load checkpoint but not optimizer state (#229)
Correct path in the pre-processing example (#230)
Correct the help name of the prefixes arguments (#234)
Fix bug when training with FP32 and --update-freq (#236)
Add ensemble for different architectures (#235)
add end-of-stack normalizations in case normalize_before has been set (#244)
Fix comment
Fix bidirectional LSTM concatenation (#249)
fix adaptive softmax indexing
option for a smaller adaptive softmax
character token embeddings for word level predictions
remove unneeded defaults
Always smaller soft
no need to have half-size option as behavior can be reproduced with existing flags
make adaptive softmax dropout an optional arg
add flag that allows keeping optimizer config
adds -reset-optimizer, --reset-lr-scheduler, and --optimizer-overrides flags
fix tests
make batching faster for monolingual dataset
load args from model for eval_lm
parameters to separate input/inner/out dims
cosine + triangular lr scheduler
Factor out search logic in SequenceGenerator
Reset gnorm after each epoch
fix tests
Increase max buffer size in all_gather_list
script to read binarized data
Move read_binarized.py to scripts/
Warn when using FP16 on pre-Volta GPUs
add warmup support back to cosine lr sched (important for mt)
Diverse Beam Search
Remove --normalization-constant from fconv
Fix adaptive softmax cutoff comment
disable final layer norm for transformer decoder as it makes things worse
Add training wall time meter
Old checkpoints can't be loaded because of a new meter
word stats in eval_lm
Merge internal changes
Fix FP16 version comparison
dont send dummy batch when reloading from checkpoint
also don't crash if param does not recieve grads
Add adaptive softmax changes for lstm model
Add --upsample-primary
Clean up FairseqTask so that it's easier to extend/add new tasks
fix max_positions comparison
Fix comment
Further generalize EpochBatchIterator and move iterators into new file
fix cosine lr sched for t_mult=1 with warmup
Test max_positions
Misc changes to simplify upcoming tutorial
Add documentation
Update documentation
modified stories readme to include sample preprocessing code to split stories to 1k tokens
Fix readme
Fix docs
Readme fix
Update readme with WMT'18 model (#433)
Generator: net_input instead of manual src_tokens.
Sequence generator bug fix.
Revert sequence generator changes
Switch to DistributedDataParallelC10d and bump version 0.5.0 -> 0.6.0
- no more FP16Trainer, we just have an FP16Optimizer wrapper
- most of the distributed code is moved to a new wrapper class called DistributedFairseqModel, which behaves like DistributedDataParallel and a FairseqModel at the same time
- Trainer now requires an extra dummy_batch argument at initialization, which we do fwd/bwd on when there's an uneven number of batches per worker. We hide the gradients from these dummy batches by multiplying the loss by 0
- Trainer.train_step now takes a list of samples, which will allow cleaner --update-freq
Disable c10d for AdaptiveLoss
Update LM test with --no-c10d
Pass encoder_input to generator, rather than src_tokens/src_lengths.
Fix validation loss
Add unit test to verify reproducibility after reloading checkpoints
Fix adaptive loss logging
Parallel preprocessing
Fix type of c10d bucket size
Better support for various c10d API changes
core changes to support latte collab
fix issue with truncated dict
Merge internal changes
Add back secondary set
Online backtranslation module
Co-authored-by: liezl200 <lie@fb.com>
fbshipit-source-id: 6a835d32f9dc5e0de118f1b46d365d0e0cc85e11
fbshipit-source-id: 17992f6a5908f078942544b769eda7a340a5e359
Merge internal changes (#295)
Summary:
Changelog:
- `90f52a1`: Support loading subsets of the data on each worker with the `--fix-batches-to-gpus` flag. This should fix #217 and #266.
- `6eda0a9`: Update README for replicating the "Scaling Neural Machine Translation" paper
- `b14c7cf`: Fallback to no_c10d backend for pytorch 0.4.1 (fixes #294)
Pull Request resolved: https://github.com/pytorch/fairseq/pull/295
Differential Revision: D10121559
Pulled By: myleott
fbshipit-source-id: 41c84d0ee4cdd113544b5d3aa38ae8b23acc2c27
Merge internal changes
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/296
Differential Revision: D10121830
Pulled By: alexeib
fbshipit-source-id: 1b73430bdfdcb20a9a6123abfca3472a0d307b3b
Explicitly list out generation args for backtranslation dataset
Summary:
Using argparse Namespace hides the actual args that are expected and makes code harder to read.
Note the difference in style for the args list
def __init__(
self,
tgt_dataset,
tgt_dict,
backtranslation_model,
unkpen,
sampling,
beam,
max_len_a,
max_len_b,
):
instead of
def __init__(
self, tgt_dataset, tgt_dict, backtranslation_model, unkpen, sampling,
beam, max_len_a, max_len_b,
):
Reviewed By: dpacgopinath
Differential Revision: D10152331
fbshipit-source-id: 6539ccba09d48acf23759996b7e32fb329b3e3f6
Update README.md
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/300
Differential Revision: D10154711
Pulled By: edunov
fbshipit-source-id: 859d1ac59923b67c1547b6f7acb94f801b0c3318
Pass in kwargs and SequenceGenerator class to init BacktranslationDataset
Summary: This generalizes BacktranslationDataset to allow us to use any SequenceGenerator class. For example, if we want to use this model in PyTorch Translate, we can pass the following to BacktraanslationDataset init: (1) a PyTorch Translate SequenceGenerator class as generator_class and (2) the appropriate args for initializing that class as kwargs.
Reviewed By: xianxl
Differential Revision: D10156552
fbshipit-source-id: 0495d825bf4727da96d0d9a40dc434135ff3486c
Fix proxying in DistributedFairseqModel
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/302
Differential Revision: D10174608
Pulled By: myleott
fbshipit-source-id: 4e2dfc76eae97afc5488f29b47e74f9897a643ff
Option to remove EOS at source in backtranslation dataset
Summary:
If we want our parallel data to have EOS at the end of source, we keep the EOS at the end of the generated source dialect backtranslation.
If we don't want our parallel data to have EOS at the end of source, we **remove** the EOS at the end of the generated source dialect backtranslation.
Note: we always want EOS at the end of our target / reference in parallel data so our model can learn to generate a sentence at any arbitrary length. So we make sure that the original target has an EOS before returning a batch of {generated src, original target}. If our original targets in tgt dataset doesn't have an EOS, we append EOS to each tgt sample before collating.
We only do this for the purpose of collating a {generated src, original tgt} batch AFTER generating the backtranslations. We don't enforce any EOS before passing tgt to the tgt->src model for generating the backtranslation. The users of this dataset is expected to format tgt dataset examples in the correct format that the tgt->src model expects.
Reviewed By: jmp84
Differential Revision: D10157725
fbshipit-source-id: eb6a15f13c651f7c435b8db28103c9a8189845fb
multihead_attention: pre-transpose incremental state (#232)
Summary:
Pull Request resolved: https://github.com/pytorch/translate/pull/232
Though transpose operations are essentially free during PyTorch execution, they can result in costly operations when exported to Caffe2 inference nets via ONNX tracing, especially when applied repeatedly to large tensors.
For this reason, we update `MultiheadAttention` to store its incremental state with shape (bsz, num_heads, seq_len, head_dim), that is after transposing the projected input. This should result in non-trivially faster exported models without changing the semantics or speed of PyTorch execution.
Reviewed By: myleott
Differential Revision: D10186506
fbshipit-source-id: 8a42712423ee767ea49ed88d2a4653f900d14fba
Have noising account for sentences with and without EOS (#305)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/305
Previously, noising code assumed that every sentence had an EOS which had to be excluded from noising operations (since we shouldn't drop, blank, or shuffle EOS). This logic allows the noising module to handle sentences with EOS and without EOS
Reviewed By: xianxl
Differential Revision: D10114425
fbshipit-source-id: 04ec8547343eb94266bda1ac7fca3d8a1991c9f4
Add denoising dataset for denoising autoencoder (#306)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/306
This uses a source dataset to generate a batch of {source: noisy source, target: original clean source} which allows us to train a denoising autoencoding component as part of a seq2seq model.
Reviewed By: xianxl
Differential Revision: D10078981
fbshipit-source-id: 026225984d4a97062ac05dc3a36e79b5c841fe9c
fix make_positions() typo (#316)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/316
This code should actually be keeping the padded positions as `padding_idx` (though note that this is on the ONNX export path, and it has no effect in the most common case when using the exported network to do un-batched inference).
Reviewed By: myleott
Differential Revision: D10431872
fbshipit-source-id: 79fe4ac27cafcd4701e0f2a90e29d1b7362dc6f8
Update upgrade_state_dict in transformer.py to upgrade_state_dict_named (#317)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/317
When upgrading `state_dict` variable, `upgrade_state_dict` function in TransformerEncoder/TransformerDecoder doesn't handle multiple encoders/decoders, however, D10052908 will be the case.
Before the change, we will hit error message [1] when loading checkpoint for multilingual_transformer model in D10052908. This diff will fix it.
Reviewed By: myleott, liezl200
Differential Revision: D10375418
fbshipit-source-id: 7104c1a463e78f3fa33d8479a37c51608be50610
Manually port pull request 385
Summary:
Manually port fairinternal fairseq-py pull request #385 [1] to fbcode.
Resolve the merge conflict of removing fp16_trainer per offline discussion with Myle. Also updated codes to make generate.py works.
[1] https://github.com/fairinternal/fairseq-py/pull/385/commits/18fa6e154781cf0c4b1596429dba7e753a545069
Reviewed By: liezl200
Differential Revision: D10052908
fbshipit-source-id: c3c378d78dc1e9ac087c815f359e78c0048ff2f5
Fix another distributed syncing issue
Summary:
This is another failure due to distributed GPU's getting out of sync.
We are running save_and_eval (which has the inter-gpu communication calls) by
looking at number of updates. But number of updates means weight updates. Whenever
there is an issue in the training and weights can't be updated, nodes go
out of sync and nodes start failing. So we should check number of iterations instead.
I am, again, making a small change to save the day, but we should decouple/refactor
save_and_eval logic from the training, to have less headache in future.
Planning, working on that in future. But this should solve some of the
issues for now.
Reviewed By: jhcross
Differential Revision: D10478427
fbshipit-source-id: b9deacfea252b2fb66b81c799fa78e2439fa514c
Expose BacktranslationDataset from fairseq.data (#324)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/324
BacktranslationDataset was introduced recently but was not exposed as part of the fairseq.data module
Reviewed By: liezl200
Differential Revision: D10412717
fbshipit-source-id: 8a9d4ecd43fd376e895c450d00e765a869c95eff
Add size method to BacktranslationDataset + misc fixes (#325)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/325
RoundRobinZipDataset requires size(index) method implemented in every dataset used. Also added missing return statements in a few methods.
Reviewed By: liezl200
Differential Revision: D10457159
fbshipit-source-id: 01856eb455f2f3a21e7fb723129ff35fbe29e0ae
make fairseq models compatible with character inputs and use character inputs for elmo in pytext
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/321
Reviewed By: alexeib
Differential Revision: D10430186
fbshipit-source-id: 9cc8fe0f202cc49370cecf36312bcc9bf0b4deee
LanguagePairDataset and BacktranslationDataset changes for semi supervised task setup (#330)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/330
As part of the semi sueprvised task setup (https://github.com/pytorch/translate/pull/243), this diff adds the ability for LanguagePairDataset to remove EOS from source or append EOS to target. This functionality is required by BacktranslationDataset to use translations as source data.
Also added changes to BacktranslationDataset to make it work on GPU. We needed to transfer back-translated sentences back to CPU for the LanguagePairDataset to collate.
Reviewed By: liezl200
Differential Revision: D10846294
fbshipit-source-id: b015ecb5fcef26fba507c30f8a4992bdbc54899f
Fix print & add more informative logging
Summary: Fix fairseq's `force` option for disabling print suppression (otherwise, `print(..., force=True)` fails on master since the force kwarg gets passed to the builtin print).
Reviewed By: dpacgopinath
Differential Revision: D10522058
fbshipit-source-id: bbc10c021a7d21396ebfbb1bf007f6b9b162f4fd
Extend WordShuffle noising function to apply to non-bpe tokens
Summary:
We'd like to resue the noising functions and DenoisingDataset in
adversarial training. However, current noising functions assume the input are
subword tokens. The goal of this diff is to extend it so the noising can be
applied to word tokens. Since we're mostly interested in the word shuffle
noising, so I only modified the WordShuffle class.
Reviewed By: liezl200
Differential Revision: D10523177
fbshipit-source-id: 1e5d27362850675010e73cd38850c890d42652ab
transformer onnx trace: skip no-op transpose (#333)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/333
A tiny hack to speed up inference slightly for transformer beam search after export to graph mode. Specifically, there is no need to transpose a dimension with size 1 (the sequence length of a single decoder time step during beam search) with its neighbor immediately before a view/reshape.
Reviewed By: jmp84
Differential Revision: D12833011
fbshipit-source-id: f9c344a9ad595e6e48a8a65b31cf2b1392f9b938
match examples/stories/writingPrompts scripts to correct folder
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/290
Differential Revision: D12876759
Pulled By: myleott
fbshipit-source-id: 9f6d1c9de27dad29368a7edb923dfcf770355938
Update bleu.py (#320)
Summary:
Modify Error message of bleu.
Fix the issue: https://github.com/pytorch/fairseq/issues/284
Pull Request resolved: https://github.com/pytorch/fairseq/pull/320
Differential Revision: D12876721
Pulled By: myleott
fbshipit-source-id: df25885a94a584cbf4b86a1665e3e513c7eb8e9a
Fix tests + style nits + Python 3.5 compat
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/336
Differential Revision: D12876709
Pulled By: myleott
fbshipit-source-id: a31536e2eb93f752600b9940c28e9b9fcefc8b86
Denoising autoencoder task (#251)
Summary:
Pull Request resolved: https://github.com/pytorch/translate/pull/251
We should use shared encoder and separate decoders as in:
https://fb.facebook.com/groups/2156114531381111/permalink/2169028113423086/
Generation is a hack, ideally the net input should have the lang pair info so that when we pass the sample to the model, it can select the correct encoder/decoder pair.
diff [2/2] will be for flow integration for basic experimentation
TODO in a future diff: figure out how to generalize this so export will work??
This works with vocab reduction, but we only support vocab reduction for src-tgt, not src-src model. A future (lowpri) task could be to add word prediction vocab reduction for src-src model to speed up training.
Reviewed By: xianxl
Differential Revision: D10512576
fbshipit-source-id: 545d96cad8e814b9da7be102a48cc5cac358b758
Move fairseq part of D10478427 directly into pytorch-translate (#337)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/337
Pull Request resolved: https://github.com/pytorch/translate/pull/250
Reviewed By: akinh
Differential Revision: D12880352
fbshipit-source-id: 61e9888a9cc3df07e805820b74a5fcf359dfe0ea
Fix "ignore-case" behavior (#339)
Summary:
Currently, if `ignore-case` is set, the same line will be yielded twice - once as lower-cased version, once as original version, leading to lower than expected uncased scores.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/339
Differential Revision: D12890386
Pulled By: myleott
fbshipit-source-id: 0570e5f6e8f848f2c6439d615e70aca6df097eef
Black formatting in fairseq/test_noising (#341)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/341
Use black formatting in test_noising.py
Reviewed By: xianxl
Differential Revision: D12810285
fbshipit-source-id: 5517dd5d2f086831f487d88acf6bc2fa18820297
Refactor fairseq/test_noising with a word shuffle helper function (#340)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/340
This allows us to do a lot less copy paste when adding new word shuffle function tests
Reviewed By: xianxl
Differential Revision: D12810304
fbshipit-source-id: a56b5df093d17be2b73837897c526978cab92b70
Support BPE end of word marker suffix in fairseq noising module
Summary:
There are 2 ways to implement BPE:
1. use a continuation marker suffix to indicate that there is at least one more subtoken left in the word
2. use a end of word marker suffix to indicate that there is no more subtokens left in the word
This adds some logic to account for either kind of BPE marker suffix. This diff adds a corresponding test. I also refactored the test setup to reduce the number of boolean args when setting up test data.
Reviewed By: xianxl
Differential Revision: D12919428
fbshipit-source-id: 405e9f346dce6e736c1305288721dfc7b63e872a
Merge internal changes
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/352
Differential Revision: D12956930
Pulled By: myleott
fbshipit-source-id: 39334a79544bac570feb04be9103269d7c1563f9
Fix error when training multilingual_translation task with multi-GPU
Summary:
D10052908 introduce multilingual_translation task, but it raises exception when training with multiple-GPUs: P60202593
With Myle's help, we found that it is because of improperly handled dummy batch data type, and it causes optimizer.backward() is not executed same number of times cross different GPUs.
Reviewed By: xianxl
Differential Revision: D12964263
fbshipit-source-id: 4991039030bf373f0c484e131acc4736487be4d8
pipeline for LM training
Summary:
step 2 of pipeline for LM training
assumes tokenized text data as input. Splits it into train/validation/test, and runs binarization
(step a_ii in https://fb.quip.com/kazzAxvZHBj9)
Reviewed By: borguz
Differential Revision: D10454705
fbshipit-source-id: 74e8679041f5507c4e404c1b719547c2ae9ed983
Support for BPE vocabs + denoising autoencoder in PyTorch Translate (#362)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/362
Pull Request resolved: https://github.com/pytorch/translate/pull/254
This actually uses the fairseq logic which supports BPE cont / end word marker suffixes.
Reviewed By: xianxl
Differential Revision: D12952766
fbshipit-source-id: 35a1bbc38240e4145bec0fc419f2d0a6a73ae2e5
Fix dummy batch when --max-tokens is small (fixes #347)
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/366
Differential Revision: D13058513
Pulled By: myleott
fbshipit-source-id: a146d2cfb345d404775ed8d6b8e4a4ad4e7a33b4
make dictionary optional
Reviewed By: jingfeidu
Differential Revision: D13104360
fbshipit-source-id: 9636f5ee2721818f98b33af559fa24292534a72f
Add LegacyDistributedDataParallel in place of no_c10d (#370)
Summary:
This should bring back the speedup with --update-freq that we reported in the Scaling Neural Machine Translation paper.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/370
Differential Revision: D13100281
Pulled By: myleott
fbshipit-source-id: 4a81b51bb7390a197add314a4be5512bbf68c085
Fix build for docs
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/372
Differential Revision: D13114426
Pulled By: myleott
fbshipit-source-id: 6c24b96a3556a0ecd3d1f350642a884254a40bd3
Merge small fixes from internal
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/374
Differential Revision: D13116074
Pulled By: myleott
fbshipit-source-id: 485724cc5a40e8360d21e4bf9c35821baa0ddc57
Protect against failures in case of OOMs
Summary: Fixing some distributed failures that happen when OOMs are observed.
Reviewed By: myleott
Differential Revision: D13121054
fbshipit-source-id: f71a0a695332acbaa1797e89887b8b7c7ddaa727
Refactor BacktranslationDataset to be more reusable (#354)
Summary:
- generalize AppendEosDataset -> TransformEosDataset
- remove EOS logic from BacktranslationDataset (use TransformEosDataset instead)
- BacktranslationDataset takes a backtranslation_fn instead of building the SequenceGenerator itself
Pull Request resolved: https://github.com/pytorch/fairseq/pull/354
Reviewed By: liezl200
Differential Revision: D12970233
Pulled By: myleott
fbshipit-source-id: d5c5b0e0a75eca1bd3a50382ac24621f35c32f36
Fix some recursive functions (e.g., reorder_incremental_state) to only touch each module once (#379)
Summary:
This can happen if a module is registered in more than one place in the network.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/379
Differential Revision: D13154498
Pulled By: myleott
fbshipit-source-id: a35575d1956a46cd35ac8b16a719ad20ac3e380a
onnx bi-transformer (#385)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/385
Pull Request resolved: https://github.com/facebookresearch/pytext/pull/6
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14292
Reviewed By: jingfeidu
Differential Revision: D10517864
fbshipit-source-id: 81008b5cc6aab70e23329c187392fb72ee057d78
Decoder embedding sharing in PyTorch Translate for denoising autoencoder (#386)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/386
Pull Request resolved: https://github.com/pytorch/translate/pull/266
This allows decoder embedding sharing for denoising autoencoder modules with different decoders (one for src decoding and one for tgt decoding)
Reviewed By: dpacgopinath
Differential Revision: D13133015
fbshipit-source-id: 3c98be639d705744ccf5ba3a8fd7d10ddc7aef4a
Fix --ddp-backend=no_c10d for params that don't require grads
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/388
Reviewed By: theweiho
Differential Revision: D13244869
fbshipit-source-id: d22c18f63f9a691ccc7245e06bc9a5b776a192b5
fixes on bi-transformer onnx
Summary: replace dynamic index put with copying and creating a new tensor
Reviewed By: wanchaol
Differential Revision: D13244573
fbshipit-source-id: 909f7913ad579ed035f29bb52321ff01e09a2c60
fixed torch 0.4.0 , "RuntimeError: Expected object of type torch.cuda… (#393)
Summary:
….LongTensor but found type torch.cuda.FloatTensor for argument #3 'index' " error
in the torch.__version__ == 0.4.0 ,
new_order = torch.arange(bsz).view(-1, 1).repeat(1, beam_size).view(-1)
will return a float dtype Tensor, when exec the "line 321: fairseq/fairseq/models/fconv.py " will throw a RuntimeError
Pull Request resolved: https://github.com/pytorch/fairseq/pull/393
Differential Revision: D13276496
Pulled By: myleott
fbshipit-source-id: e7986246fbe2c79fff61bcab0e5bec9dd63e0afd
Better error message if workers fall out of sync (#396)
Summary:
This kind of issue should be rare, but the exception that was thrown before ("UnpicklingError: invalid load key") was very opaque, so let's use something a bit clearer.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/396
Differential Revision: D13325600
Pulled By: myleott
fbshipit-source-id: 2e7093752d45d6b04a3d506aca8d5694b72ab638
Enable check_reduction for imagenet flow and fairseq
Summary:
As the title says, better to enable this for certain use cases to make
sure things are right
Reviewed By: myleott, pietern
Differential Revision: D13351753
fbshipit-source-id: cf495960fda71ebd679c23212e19703c93a9dbdc
Add check that --encoder-layers matches --decoder-layers for LSTM (fixes #394)
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/398
Differential Revision: D13358876
Pulled By: myleott
fbshipit-source-id: 57673f2643aac01492cb8f5728bb9f1a34ba6aa7
Fix arg formatting in preprocess.py and add fmt control for black formatting (#399)
Summary:
Not switching to Black formatting just yet, but adding fmt: off directives in case we decide to later.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/399
Differential Revision: D13364674
Pulled By: myleott
fbshipit-source-id: a20a11a18be3d583ee30eff770278fb4bd05b93c
Warn when using --update-freq on a single machine and --ddp-backend != no_c10d
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/400
Differential Revision: D13366996
Pulled By: myleott
fbshipit-source-id: b4907815e7cc1b4a2aceab11210bf64cb3d814c9
Take a dummy train step under OOM to keep multiprocessing in sync
Summary: This is not a guaranteed solution (since processes may still get out of sync if OOM happens after an all_gather/all_reduce has been done) - but should still make multiprocessing training more robust in practice since it seems we usually OOM early enough.
Reviewed By: myleott
Differential Revision: D13086018
fbshipit-source-id: feb1b01c2eb8818797cfdabc0faac8056ba1b4ee
Add --fp16-scale-tolerance (#397)
Summary:
Let's only decrease the loss scale if a large enough percentage of batches overflow.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/397
Differential Revision: D13355159
Pulled By: myleott
fbshipit-source-id: e17dde73d34a639519b4348c013fdd19d2b314e6
fix data checking report bug (#403)
Summary:
The original code reports the size of a valid sample instead of an invalid one when raising an Exception , which will make people confused.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/403
Differential Revision: D13391431
Pulled By: myleott
fbshipit-source-id: 4642ed027c0f664424fc5a9baf4363791144feaf
Loading PreTrained Models (#406)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/406
Static helper function in TranslationTask to load pretrained models
Reviewed By: myleott
Differential Revision: D13345276
fbshipit-source-id: 3a675ee1a144ceb8b010f30e1a6163ef670b53f3
data per gpu change
Summary: Avoid loading entire data set per gpu to reduce memory footprint
Reviewed By: rutyrinott
Differential Revision: D13163548
fbshipit-source-id: 4ba717c8021ba5723d02225bae5782e2c3a18640
Add BufferedIterator (#419)
Summary:
This improves performance for datasets that load data lazily. Enabled by default since it shouldn't compromise performance for non-lazy datasets.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/419
Differential Revision: D13546585
Pulled By: myleott
fbshipit-source-id: f6152e2047291b0d68cd7506cd772b0caafe95be
Improve memory efficiency of FP16 optimization (#404)
Summary:
Previously when training with --fp16, we stored a copy of the model parameters in FP32 for optimization, which consumed a lot of memory. An alternative is to just do the conversions to FP32 on the fly, which allows the caching allocator to reuse/save some memory.
This reduces peak memory usage by ~20% with a negligible reduction in training speed (~2% slower) when training a big transformer on 8 GPUs on wmt en-de with --update-freq=16.
This does not affect convergence, i.e., models will train exactly as they did before.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/404
Differential Revision: D13394376
Pulled By: myleott
fbshipit-source-id: 2b9f808548df4782110513c9cfc9f7c6159bcbbf
Add option to disable positional embeddings in TransformerModel (#421)
Summary:
Add argument `--no-token-positional-embeddings` to TransformerModel (currently only available in TransformerLanguageModel) to disable positional embeddings.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/421
Differential Revision: D13548450
Pulled By: myleott
fbshipit-source-id: b352c702ed1609e3b84d9a8404941d3274a7f883
Merge internal changes (#422)
Summary:
- 04cc608: Add `--match-source-len` option to generate.py to for sequence-tagging tasks
- 19f1a40: Add `--no-repeat-ngram-size` option to generate.py for ngram blocking
Pull Request resolved: https://github.com/pytorch/fairseq/pull/422
Differential Revision: D13548445
Pulled By: myleott
fbshipit-source-id: 26d1ae83993e428fcb020dac5ae358b0e36233d9
Fix backtranslation dataset on IndexedCachedDataset (#410)
Summary:
BacktranslationDataset would throw an error when the underlying dataset was an IndexedCachedDataset because prefetching was not handled correctly. This fixes the error.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/410
Differential Revision: D13557539
Pulled By: myleott
fbshipit-source-id: 398ab59a3ebdbf1c666d862b9f905654eece800c
Fix resuming from FP16 checkpoints (#424)
Summary:
This was broken in 03a57de.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/424
Differential Revision: D13557540
Pulled By: myleott
fbshipit-source-id: 62deda5353032aff20d35d046b0bb843da44d27c
Make multiprocessing_train.py work with multi-node setups
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/425
Differential Revision: D13558340
Pulled By: myleott
fbshipit-source-id: dff8c77027e821d8c80bfbd6a6ccce9ca1a44b78
Merge internal changes (#283)
Summary:
Pull Request resolved: https://github.com/pytorch/translate/pull/283
Pull Request resolved: https://github.com/pytorch/fairseq/pull/428
Differential Revision: D13564190
Pulled By: myleott
fbshipit-source-id: 3b62282d7069c288f5bdd1dd2c120788cee4abb5
rm fb_train.py (#432)
Cleanup more files
Update docs for --lazy-load and torch.distributed.launch
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/433
Differential Revision: D13588032
Pulled By: myleott
fbshipit-source-id: 0e5ff361e27b206c4490264f0f51863367499e81
Fix broken link in README.md (#436)
Summary:
https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset is not longer valid, redirects to a blog post listing page.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/436
Differential Revision: D13607961
Pulled By: myleott
fbshipit-source-id: 1a1074ffcbc454e29bc9d5aed84fdf2089a224bc
Misc fixes
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/439
Differential Revision: D13608151
Pulled By: myleott
fbshipit-source-id: 198b84995a6329f8329829cc91184d88f1eab947
Make error message for trying to train after make_generation_fast work correctly
Summary: https://github.com/pytorch/fairseq/blob/master/fairseq/trainer.py#L164 calls `train()` without any argument
Reviewed By: myleott
Differential Revision: D13599203
fbshipit-source-id: 3a096a6dd35a7a3f8309fbda3b54a36f606475e3
Fixes (#442)
Summary:
minor fixes:
1- adding fairseq logo
2- encoder padding for fconv self att
3- legacy ddp change
Pull Request resolved: https://github.com/pytorch/fairseq/pull/442
Differential Revision: D13651715
Pulled By: myleott
fbshipit-source-id: ac93c80f1dbffdfe03fbd4b8a8ea527aecb576a7
New command line option '--user-dir' (#440)
Summary:
Following discussion on official fairseq (https://github.com/pytorch/fairseq/issues/438), I added the `--user-dir` option to the command line. The user can now specify a path in order to import a custom module with proprietary tasks, architectures and so on.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/440
Differential Revision: D13651721
Pulled By: myleott
fbshipit-source-id: 38b87454487f1ffa5eaf19c4bcefa0b3b15a8f43
'--user-dir' documentation (correct) (#447)
Summary:
Command line option --user-dir documented in docs/overview.rst
Pull Request resolved: https://github.com/pytorch/fairseq/pull/447
Differential Revision: D13674744
Pulled By: myleott
fbshipit-source-id: 17049ee5c9f692f5298ef9fa7381ee583f269cde
Fixed wrong help message shown on '--help' (#446)
Summary:
Correct help message was obfuscated by the transient `ArgumentParser` used only for eagerly read `--user-dir` flag.
To reproduce just try:
```bash
python3 train.py --help
```
Pull Request resolved: https://github.com/pytorch/fairseq/pull/446
Differential Revision: D13674731
Pulled By: myleott
fbshipit-source-id: b9503a4d7ef26405be630d31c0ca02386d783031
optimizations for token_block_dataset
Summary:
optimizing memory use of token_block_dataset by replacing python data structures with numpy arrays.
applying needed parts from D13498973, instead of rebasing it on changes
Reviewed By: edunov
Differential Revision: D13678485
fbshipit-source-id: c0c827a8b95834a6a5456476040ebdc8e42136d4
Add --checkpoint-upper-bound to average_checkpoints.py (#452)
Summary:
This is useful for averaging the last N checkpoints, ending at some "best" checkpoint.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/452
Differential Revision: D13695407
Pulled By: myleott
fbshipit-source-id: 5d9d2bff3706834f01501e9259834c77fb335817
FIX: '--user-dir' on multi-gpu (#449)
Summary:
On a multi-gpu training scenario, the `train.py` script spawns new processes with `torch.multiprocessing.spawn`. Unfortunately those child processes don't inherit the modules imported with `--user-dir`.
This pull request fixes this problem: custom module import in now explicit on every `main()` function.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/449
Differential Revision: D13676922
Pulled By: myleott
fbshipit-source-id: 520358d66155697885b878a37e7d0484bddbc1c6
Fix initial learning rate (#453)
Summary:
There was a very subtle bug here 😢When we recently removed this line (7633129ba8d5f0e28bd6b6d6027b14352482ef31), it meant that the learning rate scheduler didn't get initialized until after the first update. Unfortunately pytorch optimizers store the learning rate in their internal state, so some learning rate schedulers use their `__init__` method to reset the learning rate to some sane initial value. This is especially problematic for LR schedulers that include a warmup, where the Optimizer is likely to contain the peak learning rate at initialization, and it's only in the LR scheduler's `__init__` that the (much smaller) warmup value is set.
For example, the inverse_sqrt scheduler resets the learning rate upon initialization:
https://github.com/pytorch/fairseq/blob/7853818c2e33a63ec17a31bcfe20e4fc75d94130/fairseq/optim/lr_scheduler/inverse_square_root_schedule.py#L48-L50
**Impact:** For the last ~1.5 weeks, the first training update would use the optimizer's default learning rate instead of the initial rate set by the LR scheduler. All subsequent updates used the correct learning rates. This primarily affects LR schedulers with warmups.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/453
Differential Revision: D13704453
Pulled By: myleott
fbshipit-source-id: a946da30100f837c66bdc6b9b77b014ab4eb8764
Fix stories generation
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/454
Differential Revision: D13708565
Pulled By: myleott
fbshipit-source-id: 5cd0e07e3e1885eef14e3a5e8074f24cf4bde632
Fix iteration bug in GroupedIterator. Correct sent size filter. (#455)
Summary:
Fix iterating from the beginning bug when initializing the GroupedIterator. (https://github.com/pytorch/fairseq/issues/441)
Correct filter criterion for dict type sentence size. (https://github.com/pytorch/fairseq/issues/451)
Pull Request resolved: https://github.com/pytorch/fairseq/pull/455
Differential Revision: D13725646
Pulled By: myleott
fbshipit-source-id: e698fa6f9b45460f95a75c9e9976a3aa3b6aa523
change f"{args}" to "{}".format(args) (#467)
Summary:
Although both are supported by Python 3.6, I think it would be better to unify the usage of string format function.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/467
Differential Revision: D13802506
Pulled By: myleott
fbshipit-source-id: 5c4877547b1c4ca806ab54c80ae483cfbaa7827a
Better error message for improperly formatted dictionaries
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/468
Differential Revision: D13802590
Pulled By: myleott
fbshipit-source-id: e374e38e74dc91bda0579ae41e26289fb0ba56a2
Enforce UTF-8 when open() text files (#460)
Summary:
When opening text files without specifying the encoding (i.e. `open(path, "r")` or `open(path, "w")`), python3 will use the preferred locale encoding (`locale.getpreferredencoding()`) so the result is platform dependent and can change from one machine to another.
I believe fairseq should enforce its standard (UTF-8 seems like the best choice to me). This pull request explicity specify UTF-8 encoding when reading text files.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/460
Differential Revision: D13802525
Pulled By: myleott
fbshipit-source-id: 672fd55707ee559ab36d74bc1c24026166ea2367
Print model and number of trained params
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/469
Differential Revision: D13802945
Pulled By: myleott
fbshipit-source-id: b6976506a8336b96ee40505c4a7638541cc99c95
LSTM improvements (fixes #414)
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/470
Differential Revision: D13803964
Pulled By: myleott
fbshipit-source-id: 91b66599e9a539833fcedea07c608b349ba3b449
Only use c10d distributed primitives
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/471
Differential Revision: D13818918
Pulled By: myleott
fbshipit-source-id: d3b8dc50e81ee1d2dcc5efc5815998be8461085f
Adafactor Optimizer (#472)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/472
Implementation of "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost" (https://arxiv.org/abs/1804.04235)
Differential Revision: D13388049
fbshipit-source-id: 24ad30f4bac248e6aeaced5064bb83784058f03d
refactor AdversarialTrainer factor out helper functions
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/474
Reviewed By: theweiho, akinh
Differential Revision: D13701447
fbshipit-source-id: 34036dce7601835b605e3b169210edc7a6715de6
Add code for "Pay Less Attention with Lightweight and Dynamic Convolutions" (#473)
Summary:
Changelog:
- `e330f56`: Add code for the "Pay Less Attention with Lightweight and Dynamic Convolutions" paper
- `5e3b98c`: Add scripts for computing tokenized BLEU with compound splitting and sacrebleu
- update READMEs
- misc fixes
Pull Request resolved: https://github.com/pytorch/fairseq/pull/473
Differential Revision: D13819717
Pulled By: myleott
fbshipit-source-id: f2dc12ea89a436b950cafec3593ed1b04af808e9
make dictionary class as input for fairseq preprocess functions (#482)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/482
With this change, we can use different dictionary classes when calling build_dictionary and build_and_save_dictionary
Reviewed By: liaimi
Differential Revision: D13855100
fbshipit-source-id: 62e6db310b5f078e05c547d2671252233be7b7f0
Merge internal changes (#483)
Summary:
Changelog:
- `4889802`: can now remove detokenize sentencepiece output with `--remove-bpe=sentencepiece` (fixes #331). Also added `--sacrebleu` for computing detokenized BLEU.
- `0d76427`: fix assertion error when training language model with dataset containing empty sentences
- minor bug and style fixes
Pull Request resolved: https://github.com/pytorch/fairseq/pull/483
Differential Revision: D13867899
Pulled By: myleott
fbshipit-source-id: 25c940b847fe270262ac8f5ac838407b3977fdda
Add --input option to interactive.py to support reading from file
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/484
Differential Revision: D13880636
Pulled By: myleott
fbshipit-source-id: 984b2e1c3b281c28243102eb971ea45ec891d94e
Do distributed init after data loading
Summary:
FACEBOOK
This switches back to torch.multiprocessing.spawn, instead of directly calling fb_train.par using a subprocess.Process. This has the advantage that exceptions are propagated properly. It also moves the distributed_init part to happen after data loading, which gets around the timeout issue.
The downside of this approach is that it's not so easy to pipe stdout to multiple places, which was nice when using the sweep.py scripts. I'm still working on a fix for that.
Reviewed By: rutyrinott, ngoyal2707
Differential Revision: D13873224
fbshipit-source-id: 08d593233b8d23590c01c723363630a79804a8b0
Support custom Dictionary implementations in 'preprocess.py' (#448)
Summary:
The `preprocess.py` script has been refactored in order to:
1. Use the `options` module for command line arguments parsing. This will give to `preprocess.py` the ability to load custom modules with `--user-dir` flag (already implemented to all other binaries)
2. Dictionary loading and building code has moved to Task implementation. This allows custom Dictionary classes to be used during the data generation step.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/448
Differential Revision: D13674819
Pulled By: myleott
fbshipit-source-id: b40648a98ed6c08284577e5ec25876e018d8c822
Add standalone binaries
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/489
Differential Revision: D13956810
Pulled By: myleott
fbshipit-source-id: 61ace179d1d3790226c38b3f3e47f5452b5ec514
Add CheckpointManager to keep avg checkpoint weights in memory to reduce disk read when averaging + various checkpoint refactoring
Summary: Pull Request resolved: https://github.com/pytorch/translate/pull/315
Reviewed By: akinh
Differential Revision: D13510446
fbshipit-source-id: 22a6594af9253130a93e638285a47183a974e0de
stitch preprocessing pipeline
Summary:
1. add call to binarization to com…
Harleen8118
pushed a commit
to Harleen8118/IBERT
that referenced
this pull request
Jun 26, 2025
Summary: Pull Request resolved: facebookresearch/fairseq#324 BacktranslationDataset was introduced recently but was not exposed as part of the fairseq.data module Reviewed By: liezl200 Differential Revision: D10412717 fbshipit-source-id: 8a9d4ecd43fd376e895c450d00e765a869c95eff
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary: BacktranslationDataset was introduced recently but was not exposed as part of the fairseq.data module
Reviewed By: liezl200
Differential Revision: D10412717