Merge internal changes #428

myleott · 2019-01-01T22:12:39Z

No description provided.

…-fp16

…fficient-fp16)

facebook-github-bot

@myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Pull Request resolved: #283 Pull Request resolved: facebookresearch/fairseq#428 Differential Revision: D13564190 Pulled By: myleott fbshipit-source-id: 3b62282d7069c288f5bdd1dd2c120788cee4abb5

Architecture settings and readme updates More fixes Small fix Better training support when GPUs are in "exclusive mode" Issue #2, Checking size attribute of dst when dst is None Fix generation when vocabulary is small relative to beam size (fixes #7) Fix handling of partially-empty initial batch (#11) Refactor PaddingCollater Update progress_bar to be more robust to changes in tqdm (#21) Fix call ordering to ATen addmm and sum (#22) BPE transformation for IWSLT Don't generate during training, add --quiet to generate.py Ignore invalid sentences in test and valid Update README.md Update PyTorch install instructions Update README.md Don't suggest Miniconda (see #24) Fix --no-progress-bar option in generate.py (#115) Update En2Fr model Move helper functions from generate.py to fairseq/dictionary.py Support configurable BPE symbol Fix flake8 warnings Don't save/restore convolutional layers in incremental inference Allow --max-len-a to be a float Add optimizer history to checkpoints (and rearrange criterions slightly) Better logging from criterions Add support for NCCL v2 Add attention matrix to output of SequenceGenerator Ignore generated files for temporal convolution tbc Fix smoothed (sentence-level) BLEU calculation More flexible gradient normalization Refactor model saving/loading to be more reusable Refactor code in Tokenizer Add support for additional optimizers Simplify deps of build_model to only depend on dict (instead of dataset) Fix language inference in generate.py Fix handling of continuation tokens that precede <unk> in generate.py Prevent math overflow when loss is too high Set seed after each epoch to improve consistency when resuming Fix for building under clang: specify C++ build and use C++ linkage (#42) Update README with note about Docker (#49) Force UTF-8 encoding for dictionary files ( #41 ) Only save most recent optimizer state in checkpoints (#53) Only consider EOS in beam search if it's among top-k candidates Fix description for `--sample-without-replacement` option Support custom dictionary in preprocess.py Add `--curriculum` option Refactor model definitions * Move some functionality out of FConvModel into FairseqModel base class * Move incremental decoding functionality into FairseqIncrementalDecoder module * Refactor positional embeddings to be more specific to FConvModel Added -unkpen flag to generate.py following logic of Lua/Torch version Support different max_source_positions and max_target_positions Fix call to non-existing to_string method Fix seed so that data is properly shuffled between epochs Upgrade args with max_source_positions and max_target_positions Refactor generation * Split generate.py to generate.py and interactive.py and refactor code The main motivation behind these changes is to try to decorrelate use cases in order to implement future improvements such as unk replacement with original string during evaluation on test and writing predictions to output file. The previous implementation worked well but I found it difficult to integrate these future improvements. * Add --replace-unk arg to be used without align dict Replacing <unk> tokens can be beneficial even without an alignment dictionary. Left pad source and right pad target Improvements to data loader Fix interactive.py Use `--lrshrink` as the reduction factor in ReduceLROnPlateau Fix flake8 lint Add dim to F.softmax calls Update README with interactive.py and fix it Add --max-sentence option for batching based on # sentences Loop over evaluation dataloader in descending order Replace unk with original string * Add <eos> for unk replacement * Add IndexedRawTextDataset to load raw text files * Replace unk with original string * Add load_raw_text_dataset() and --output-format * Move has_binary_files to data.py Revert `dim` in `F.softmax` for backwards compatibility Rename LabelSmoothedCrossEntropy to LabelSmoothedNLLLoss Add LSTM Don't call forward directly (prefer module(x) to module.forward(x)) Add `--log-format` option and JSON logger Fix max_positions_valid in train.py Fixes for `--log-format` Fix all-reduce for new versions of PyTorch We previously assumed that once a model parameter's gradient buffer was allocated, it stayed fixed during training. However, this assumption is violated in recent versions of PyTorch (i.e., the gradient buffer may be reallocated during training), and it's no longer a safe assumption to make. This is primarily relevant when we do the all-reduce, since we all-reduce a flattened (i.e., contiguous) copy of the gradients. We can make this more robust by copying the result of the all-reduce back into the model parameter's gradient buffers after each update. Intra-device copies are cheap, so this doesn't affect performance. Version 0.1.0 -> 0.2.0 Release notes: - 5c7f495: Added simple LSTM model with input feeding and attention - 6e4b7e2: Refactored model definitions and incremental generation to be cleaner - 7ae79c1: Split interactive generation out of generate.py and into a new binary: interactive.py - 19a3865: Subtle correctness fix in beam search decoder. Previously, for a beam size of k, we might emit a hypotheses if the <eos> was among the top 2*k candidates. Now we only emit hypotheses for which the <eos> is among the top-k candidates. This may subtly change generation results, and in the case of k=1 we will now produce strictly greedy outputs. - 97d7fcb: Fixed bug in padding direction, where previously we right-padded the source and left-padded the target. We now left-pad the source and right-pad the target. This should not effect existing trained models, but may change (usually improves) the quality of new models. - f442f89: Add support for batching based on the number of sentences (`--max-sentences`) in addition to the number of tokens (`--max-tokens`). When batching by the number of sentences, one can optionally normalize the gradients by the number of sentences with `--sentence-avg` (the default is to normalize by the number of tokens). - c6d6256: Add `--log-format` option and JSON logger Fallback to `--log-format=simple` for non-TTY terminals Fix Flake8 Flush non-TTY logging output after each log interval Make LSTM backwards compatible and fix incremental generation Remove Python 3.6 format strings (fixes #55) Remove more Python 3.6 format strings (fixes #57) (#58) Remove Python3.6 format string from preprocess.py (fixes #60) (#61) Update requirements.txt and fix flake8 (#62) fix bug in lstm model (#68) Fixed 2 typos (#75) Improve error when resuming training with a different model architecture Improve memory handling (recover from OOM and periodically empty caching allocator) Allow --lr to specify a fixed learning rate schedule Prefer command-line configuration over checkpoint for optimizer state Save number of GPUs in args (and checkpoints) Fix weight norm dimension in decoder (fixes #73) Rebuild optimizer when loading checkpoints Fix conv padding for even kernel widths Directly decay weight instead of L2 penalty (#157) See https://arxiv.org/pdf/1711.05101.pdf Fix generation bug with large beam sizes (>50) Add support for sharded generation Fix BeamableMM Better error message for --decoder-attention Minor fix for strip_pad functions Support deprecation of volatile Variables in latest PyTorch Add explicit dimension to softmax calls Output number of model parameters in train.py Raise FileNotFoundError if dictionary files don't exist Add reduce kwarg to criterions Streamline data formatting utils Add --max-sentences-valid to train.py Add option to SequenceGenerator to retain dropout Fix warning about deprecated `volatile` kwarg for Variables Move positional embeddings into LearnedPositionalEmbedding module Move normalization of model output (e.g., via LSM) into model definition Fix LearnedPositionalEmbedding Fix gradient clipping when --clip-norm=0 Save dictionary in model base classes Fix training Better support for torch.no_grad (since volatile is deprecated) Share input/output embed Report log likelihood for label smoothing Momentum correction ATen Fix Better warning message for inputs that are too long Fix max_positions calculation in train.py Output correct perplexity when training with --sentence-avg Fix tests Fixed Weight Decay Regularization in Adam See https://arxiv.org/abs/1711.05101 Ratio should be predlen/reflen not reflen/predlen To be compatible with multi-bleu. This seems to only affect the result_string. Prepare scripts for WMT14 Switch to news-commentary-v12 Adding README and more parameters to En2De script Update README with new models spelling Adjust weight decay by the current learning rate to make it work correctly during annealing Allow larger maxlen (fixes #100) (#101) fairseq-py goes distributed (#106) This PR includes breaking API changes to modularize fairseq-py and adds support for distributed training across multiple nodes. Changes: - c7033ef: add support for distributed training! See updated README for usage. - e016299: modularize fairseq-py, adding support for register_model, register_criterion, register_optimizer, etc. - 154e440: update LSTM implementation to use PackedSequence objects in the encoder, better following best practices and improving perf - 90c2973 and 1da6265: improve unit test coverage Add OOM counter back to logging output Fix tests and flake8 More unit test fixes Add support to prefixes (#221) * Add prefix * Fixes * Keep original scores with prefix * Improve prefix code * Replace 'repeat' with 'expand' pytorch update: no need to rewrap variable in backward() Fix LabelSmoothedCrossEntropy test Refactor incremental generation to be more explicit and less magical (#222) Making our code compatible with the latest pytorch (#223) * Making our code compatible with the latest pytorch * revert * torch.nn.utils.clip_grad_norm now returns tensor More fixes for recent PyTorch (incl. topk issue) (#113) More updates for PyTorch (#114) Use ATen built-in conv_tbc method (#66) Remove custom ConvTBC code Small fixes Filter padding properly in LabelSmoothedCrossEntropyCriterion (#229) Allow more flexible pre-processing and generation (#227) * Allow more flexible pre-processing and generation * Addressing CR comments * small fix Enforce upper-bound on maximum generation length (#121) fix typo in data/README Change "awailable" to "available". fix typo in data/README (#131) Change "awailable" to "available". Update training commands Update training commands in data/README to match the latest version of this project according to #132. - Motivation: in the previous data/README, the commands are obsolete and will cause the error "unrecognized arguments: --label-smoothing 0.1 --force-anneal 50". - What's changed: add arguments "--criterion label_smoothed_cross_entropy" and "--lr-scheduler fixed" to the training commands of all 3 datasets. - Result: the new commands run without error on all 3 datasets. Update training commands Update training commands in data/README to match the latest version of this project according to #132. Continue from 3c07295885c6283def573e7a6811464f250c3b28: add omitted "\". Update training command for IWSLT14 specify a single GPU setup for IWSLT14 Merge internal changes (#136) Changes: - 7d19e36: Add `--sampling` flag to generate.py to sample instead of doing beam search - c777340: Add `scripts/average_checkpoints.py` to average multiple checkpoints into a combined model - 3ea882c: Add `--max-update` option to train.py to stop training after a given number of updates - small bugfixes for distributed training, LSTM, inverse square root LR scheduler make interactive mode print out alignment nicely Disallow --batch-size in interactive.py Update README.md use implicit padding when possible (#152) Add pretrained embedding support (#151) Flake8 Fix old model checkpoints after #151 (fixes #156) (#157) Update dataset code for use by https://github.com/pytorch/translate/pull/62 (#161) Merge internal changes (#163) 0.4.0 -> 0.5.0 Remove sweep_log prefix from json progress bar Faster fconv generation Fix LSTM fix optim history address comments Add Transformer model Remove Google batching stategy (it's not needed) Pass args around to cleanup parameter lists Bug fixes Fix flake8 Fix buffers in sinusoidal positional embeddings caching v3 (cache keys, values, process only last time step) (#241) - process only last time step during generation - cache keys and values - dont apply masking during generation smarter way to avoid applying encoder key mask Use PyTorch LayerNorm and improve weight init More improvements to weight init and FP16 support Simulated big batches Use FP32 for multi-head attention softmax better batching Improve dataloader speed and deprecate concept of batch_offset (use --sample-without-replacement instead) Allow schedule for update-freq Fix batching during generation Add FP16 support Make dictionary size a multiple of 8 Revert "Make dictionary size a multiple of 8" This reverts commit b2e119c209363e6ff6d2878a69c7d1a507a2e9be. Pad dictionary to be a multiple of 8 in preprocessing No more magical --fp16 remove completed sentences from batch remove completed sentences from batch and allow batching uneven lengths (with fixes to make padded sequences work correctly in all models) Fix Flake8 Small optimization for LSTM Fix preprocess.py Use eval() to parse args.lr Fix embedding initialization for padding Simplify train.py (merge with singleprocess_train.py) Save and restore wall time in checkpoints Support --warmup-updates with fixed LR schedule Fix tests Remove src-padding from generation output make sure tensor used to index is cuda if on gpu Fix --prefix-size fix to adding tokens to dictionary while thresholding make attn dropout 0.1 default for big en-de transformer add support for averaging last n checkpoints fix flag copy paste (decoder-normalize-before) Remove padding from --score-reference Fix --remove-bpe to strip trailing BPE symbols Sampling doesn't work with interactive implement batching in interactive mode fix alignment when using uneven batches and left pad Support integer learning rates allow specifying max_tokens for generation also report sentence/s timing when generating default dropout to correct value for big transformer ability to checkpoint when reaching certain number of updates All-reduce in FP16 remove unused verbose option & make arguments to averaging script nicer allow overwriting args for different architectures Fix tests use implicit padding when possible Add pretrained embedding support Flake8 Fix old model checkpoints Merge OSS + internal changes Conv lm implementation This implements convolutional language model from https://arxiv.org/pdf/1612.08083.pdf There are 3 modes for constructing batches: - token block: fill each sample with a specified number of tokens without regard for sentence delimiters - this is what was used for training in the paper - complete: fill each sample with a specified number of tokens but make sure it contains only complete sentences (i.e. if next sentence goes over token block limit, move it to the next sample) - this was used for evaluation in the paper - eos: one sentence per sample (skip blank lines) some results: GCNN-13 - GBW - 37.46 GCNN-14B - GBW - 33.88 GCNN-8 - Wiki103 - 43.76 GCNN-14 - Wiki103 - 35.66 train: python train.py /private/home/abaevski/data/wiki103 --save-dir /tmp --fp16 --max-epoch 35 --save-interval 1 --save-interval-updates 1000 --keep-interval-updates 25 --arch fconv_lm --optimizer nag --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 --decoder-embed-dim 280 --decoder-layers '[(850, 6)] * 3 + [(850,1)] + [(850,5)] * 4 + [(850,1)] + [(850,4)] * 3 + [(1024,4)] + [(2048, 4)]' --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion cross_entropy --max-tokens 1024 --max-target-positions 1024 --seed 1 --log-format json --log-interval 500 eval: python eval_lm.py ~abaevski/data/wiki103 --path '/checkpoint02/abaevski/2018-04-27/lm_wiki.fp16.mxup300000.fconv.adam.lrs=reduce_lr_on_plateau.emb280.layers(850,6)*3+(850,1)+(850,5)*4+(850,1)+(850,4)*3+(1024,1)+(2048,4).lr0.0005.clp0.1.drp0.3.wd0.0.crt=cross_entropy.mxtk2048.smptk256.seed1.ngpu8/checkpoint_last.pt' default normalization constant for older models add big en_fr transformer architecture Generalize eval_str_list fix restoring from middle of epoch; fix defaulting transformer dropout params record end_of_epoch in checkpoint use adaptive softmax only with adaptive loss fix default params added multiscale gated self attention layer with multiple heads, and pretrained fusion models modified writing prompts model parameters to make readme cleaner minor parameter fixes for stories model save best val loss in checkpoint save best val loss in checkpoint and also print best so far this way when training continues from an existing checkpoint, we dont immediately override checkpoint_best with a worse loss fix model loading in eval_lm Nits Migrate all binaries to use options.parse_args_and_arch Unify various sharding into ShardedIterator Use symlinks for redundant checkpoints Merge validate and val_loss functions (simplify train.py) create examples dir and add conv lm + stories readme Small fixes Suppress stdout in test_train Add more integration tests (LM, stories, transformer, lstm) Update README.md build optimizer only once, otherwise it leaks cuda memory initialize normalization constant for fconv_lm Fix length penalty when combined with --no-early-stop Co-authored-by: pmichel31415 <pmichel@fb.com> torch.arange default return type is changed in the latest pytorch version https://github.com/pytorch/pytorch/pull/7016 Add FairseqTask A Task defines the data format, stores shared state (e.g., dictionaries) and provides helpers for building the model/criterion and calculating the loss. Changes: - Add TranslationTask and LanguageModelingTask. New tasks can be registered with @register_task decorator. - Add EpochBatchIterator to encapsulate batching and saving/restoring dataloader position - Remove LEFT_PAD_* constants and make them configurable per task Updates for latest PyTorch Fix tests Fix bidirectional lstm Faster generation when using a single model (rather than ensemble) Change --path to be colon-separated instead of comma-separated add default architecture for gbw fconv lm add links to pretrained language models Update README.md Add transformer models and replace list with table Update README.md Fix preprocessed test set download links add count for padding words (#180) gzip instead of bzip change to stories download added clarification on the newline token we model fixed newline word not appearing Fix translation README (fixes #186) (#189) Fix `--output-format raw` option to preprocess.py (Fixes #188) (#190) Two tiny changes to train/eval_lm. For train fix an off by one, while for eval_lm make it work when the task is translation' Ignore files in examples (other than .sh and .md) When downloading files in examples directory (e.g. when running `prepare-iwslt14.sh`), git sees them as untracked files but they should not be committed. Add a .gitignore script that ignore everything in the examples subdirectories except for .sh and .md files. Better failure message when loss explodes during FP16 training Store full checkpoints instead of symlinking Fix a bug with using GloVe 840B tokens for initialization. respect max tokens and ignore invalid inputs when evaluating lm sort descending when evaluating lm because it is faster (17k wps vs 11k) and will fail early if oom Support pretrained embeddings for Transformer. Also show a nicer error message. Support FP16 during inference fix sinusoidal embedding init size add sinusoidal pos initialization Fix interpretation of --max-epoch Move reorder_encoder_out to FairseqEncoder and fix non-incremental decoding Add steps to reproduce WMT En-De results from Scaling NMT paper Fix typo Fix for Dictionary.finalize Misc changes for pytorch-translate Fix attention order in unit tests (fixes #195) (#197) Remove more Variable() calls (#198) Remove unnecessary assert (fixes #199) (#200) Fix preprocessing for WMT14 En-De to replicate Scaling NMT paper (#203) adding pretrained stories model fix decoder_normalize_before typo (#205) add model override argument from load_ensemble_for_inference at generation time, updating readme for stories adding model arg override at generation time for interactive.py assert that vocab size >= adaptive softmax cutoff (#214) Fix up model defaults (#211) Pass sampling-temperature trough to the generator in interactive.py stories data preprocessing needs padding factor 1 to match pretrained model, updating readme fixed output_proj's input_dim in attention (#226) fix raw text for language modeling make model access saner fix token block rotation Support tied embeddings in LSTM encoder/decoder disable printing alignment by default (for perf) and add a flag to enable it default need_attn to False Fix bug when --share-all-embeddings but no --encoder-embed-path Iterate on need_attn and fix tests Output positional scores in interactive.py Don't compute unnecessary attention averages during training Transformer lm This implements transformer based language model. It already obtains better perplexity on wikitext103 without any tuning. I will also train it on gbw where I also expect to get better ppl Example training command: python train.py /private/home/abaevski/data/wiki103 —save-dir /tmp —fp16 —max-epoch 80 —save-interval 1 —arch transformer_lm —task language_modeling —optimizer nag —lr 0.008 —lr-scheduler reduce_lr_on_plateau —lr-shrink 0.6 —dropout 0.2 —criterion adaptive_loss —adaptive-softmax-cutoff 10000,50000,200000 —max-tokens 512 —tokens-per-sample 512 —seed 1 —sample-break-mode none —log-format json —log-interval 50 —save-interval-updates 2500 —keep-interval-updates 25 small transformer got to 31.3 ppl on wiki text 103 (compared to 35 with fconv) while @myleott got a big transformer lm to 27 something ppl on wiki text 103 remove right-to-left lm support default decoder_learned_pos for lm option to print language model words and their log probs during evaluation Update IWSLT configuration for transformer Don't use 0-dimensional buffers in sinusoidal positional embeddings Fix comment Merge internal changes Add load_optim option to load checkpoint but not optimizer state (#229) Correct path in the pre-processing example (#230) Correct the help name of the prefixes arguments (#234) Fix bug when training with FP32 and --update-freq (#236) Add ensemble for different architectures (#235) add end-of-stack normalizations in case normalize_before has been set (#244) Fix comment Fix bidirectional LSTM concatenation (#249) fix adaptive softmax indexing option for a smaller adaptive softmax character token embeddings for word level predictions remove unneeded defaults Always smaller soft no need to have half-size option as behavior can be reproduced with existing flags make adaptive softmax dropout an optional arg add flag that allows keeping optimizer config adds -reset-optimizer, --reset-lr-scheduler, and --optimizer-overrides flags fix tests make batching faster for monolingual dataset load args from model for eval_lm parameters to separate input/inner/out dims cosine + triangular lr scheduler Factor out search logic in SequenceGenerator Reset gnorm after each epoch fix tests Increase max buffer size in all_gather_list script to read binarized data Move read_binarized.py to scripts/ Warn when using FP16 on pre-Volta GPUs add warmup support back to cosine lr sched (important for mt) Diverse Beam Search Remove --normalization-constant from fconv Fix adaptive softmax cutoff comment disable final layer norm for transformer decoder as it makes things worse Add training wall time meter Old checkpoints can't be loaded because of a new meter word stats in eval_lm Merge internal changes Fix FP16 version comparison dont send dummy batch when reloading from checkpoint also don't crash if param does not recieve grads Add adaptive softmax changes for lstm model Add --upsample-primary Clean up FairseqTask so that it's easier to extend/add new tasks fix max_positions comparison Fix comment Further generalize EpochBatchIterator and move iterators into new file fix cosine lr sched for t_mult=1 with warmup Test max_positions Misc changes to simplify upcoming tutorial Add documentation Update documentation modified stories readme to include sample preprocessing code to split stories to 1k tokens Fix readme Fix docs Readme fix Update readme with WMT'18 model (#433) Generator: net_input instead of manual src_tokens. Sequence generator bug fix. Revert sequence generator changes Switch to DistributedDataParallelC10d and bump version 0.5.0 -> 0.6.0 - no more FP16Trainer, we just have an FP16Optimizer wrapper - most of the distributed code is moved to a new wrapper class called DistributedFairseqModel, which behaves like DistributedDataParallel and a FairseqModel at the same time - Trainer now requires an extra dummy_batch argument at initialization, which we do fwd/bwd on when there's an uneven number of batches per worker. We hide the gradients from these dummy batches by multiplying the loss by 0 - Trainer.train_step now takes a list of samples, which will allow cleaner --update-freq Disable c10d for AdaptiveLoss Update LM test with --no-c10d Pass encoder_input to generator, rather than src_tokens/src_lengths. Fix validation loss Add unit test to verify reproducibility after reloading checkpoints Fix adaptive loss logging Parallel preprocessing Fix type of c10d bucket size Better support for various c10d API changes core changes to support latte collab fix issue with truncated dict Merge internal changes Add back secondary set Online backtranslation module Co-authored-by: liezl200 <lie@fb.com> fbshipit-source-id: 6a835d32f9dc5e0de118f1b46d365d0e0cc85e11 fbshipit-source-id: 17992f6a5908f078942544b769eda7a340a5e359 Merge internal changes (#295) Summary: Changelog: - `90f52a1`: Support loading subsets of the data on each worker with the `--fix-batches-to-gpus` flag. This should fix #217 and #266. - `6eda0a9`: Update README for replicating the "Scaling Neural Machine Translation" paper - `b14c7cf`: Fallback to no_c10d backend for pytorch 0.4.1 (fixes #294) Pull Request resolved: https://github.com/pytorch/fairseq/pull/295 Differential Revision: D10121559 Pulled By: myleott fbshipit-source-id: 41c84d0ee4cdd113544b5d3aa38ae8b23acc2c27 Merge internal changes Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/296 Differential Revision: D10121830 Pulled By: alexeib fbshipit-source-id: 1b73430bdfdcb20a9a6123abfca3472a0d307b3b Explicitly list out generation args for backtranslation dataset Summary: Using argparse Namespace hides the actual args that are expected and makes code harder to read. Note the difference in style for the args list def __init__( self, tgt_dataset, tgt_dict, backtranslation_model, unkpen, sampling, beam, max_len_a, max_len_b, ): instead of def __init__( self, tgt_dataset, tgt_dict, backtranslation_model, unkpen, sampling, beam, max_len_a, max_len_b, ): Reviewed By: dpacgopinath Differential Revision: D10152331 fbshipit-source-id: 6539ccba09d48acf23759996b7e32fb329b3e3f6 Update README.md Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/300 Differential Revision: D10154711 Pulled By: edunov fbshipit-source-id: 859d1ac59923b67c1547b6f7acb94f801b0c3318 Pass in kwargs and SequenceGenerator class to init BacktranslationDataset Summary: This generalizes BacktranslationDataset to allow us to use any SequenceGenerator class. For example, if we want to use this model in PyTorch Translate, we can pass the following to BacktraanslationDataset init: (1) a PyTorch Translate SequenceGenerator class as generator_class and (2) the appropriate args for initializing that class as kwargs. Reviewed By: xianxl Differential Revision: D10156552 fbshipit-source-id: 0495d825bf4727da96d0d9a40dc434135ff3486c Fix proxying in DistributedFairseqModel Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/302 Differential Revision: D10174608 Pulled By: myleott fbshipit-source-id: 4e2dfc76eae97afc5488f29b47e74f9897a643ff Option to remove EOS at source in backtranslation dataset Summary: If we want our parallel data to have EOS at the end of source, we keep the EOS at the end of the generated source dialect backtranslation. If we don't want our parallel data to have EOS at the end of source, we **remove** the EOS at the end of the generated source dialect backtranslation. Note: we always want EOS at the end of our target / reference in parallel data so our model can learn to generate a sentence at any arbitrary length. So we make sure that the original target has an EOS before returning a batch of {generated src, original target}. If our original targets in tgt dataset doesn't have an EOS, we append EOS to each tgt sample before collating. We only do this for the purpose of collating a {generated src, original tgt} batch AFTER generating the backtranslations. We don't enforce any EOS before passing tgt to the tgt->src model for generating the backtranslation. The users of this dataset is expected to format tgt dataset examples in the correct format that the tgt->src model expects. Reviewed By: jmp84 Differential Revision: D10157725 fbshipit-source-id: eb6a15f13c651f7c435b8db28103c9a8189845fb multihead_attention: pre-transpose incremental state (#232) Summary: Pull Request resolved: https://github.com/pytorch/translate/pull/232 Though transpose operations are essentially free during PyTorch execution, they can result in costly operations when exported to Caffe2 inference nets via ONNX tracing, especially when applied repeatedly to large tensors. For this reason, we update `MultiheadAttention` to store its incremental state with shape (bsz, num_heads, seq_len, head_dim), that is after transposing the projected input. This should result in non-trivially faster exported models without changing the semantics or speed of PyTorch execution. Reviewed By: myleott Differential Revision: D10186506 fbshipit-source-id: 8a42712423ee767ea49ed88d2a4653f900d14fba Have noising account for sentences with and without EOS (#305) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/305 Previously, noising code assumed that every sentence had an EOS which had to be excluded from noising operations (since we shouldn't drop, blank, or shuffle EOS). This logic allows the noising module to handle sentences with EOS and without EOS Reviewed By: xianxl Differential Revision: D10114425 fbshipit-source-id: 04ec8547343eb94266bda1ac7fca3d8a1991c9f4 Add denoising dataset for denoising autoencoder (#306) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/306 This uses a source dataset to generate a batch of {source: noisy source, target: original clean source} which allows us to train a denoising autoencoding component as part of a seq2seq model. Reviewed By: xianxl Differential Revision: D10078981 fbshipit-source-id: 026225984d4a97062ac05dc3a36e79b5c841fe9c fix make_positions() typo (#316) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/316 This code should actually be keeping the padded positions as `padding_idx` (though note that this is on the ONNX export path, and it has no effect in the most common case when using the exported network to do un-batched inference). Reviewed By: myleott Differential Revision: D10431872 fbshipit-source-id: 79fe4ac27cafcd4701e0f2a90e29d1b7362dc6f8 Update upgrade_state_dict in transformer.py to upgrade_state_dict_named (#317) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/317 When upgrading `state_dict` variable, `upgrade_state_dict` function in TransformerEncoder/TransformerDecoder doesn't handle multiple encoders/decoders, however, D10052908 will be the case. Before the change, we will hit error message [1] when loading checkpoint for multilingual_transformer model in D10052908. This diff will fix it. Reviewed By: myleott, liezl200 Differential Revision: D10375418 fbshipit-source-id: 7104c1a463e78f3fa33d8479a37c51608be50610 Manually port pull request 385 Summary: Manually port fairinternal fairseq-py pull request #385 [1] to fbcode. Resolve the merge conflict of removing fp16_trainer per offline discussion with Myle. Also updated codes to make generate.py works. [1] https://github.com/fairinternal/fairseq-py/pull/385/commits/18fa6e154781cf0c4b1596429dba7e753a545069 Reviewed By: liezl200 Differential Revision: D10052908 fbshipit-source-id: c3c378d78dc1e9ac087c815f359e78c0048ff2f5 Fix another distributed syncing issue Summary: This is another failure due to distributed GPU's getting out of sync. We are running save_and_eval (which has the inter-gpu communication calls) by looking at number of updates. But number of updates means weight updates. Whenever there is an issue in the training and weights can't be updated, nodes go out of sync and nodes start failing. So we should check number of iterations instead. I am, again, making a small change to save the day, but we should decouple/refactor save_and_eval logic from the training, to have less headache in future. Planning, working on that in future. But this should solve some of the issues for now. Reviewed By: jhcross Differential Revision: D10478427 fbshipit-source-id: b9deacfea252b2fb66b81c799fa78e2439fa514c Expose BacktranslationDataset from fairseq.data (#324) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/324 BacktranslationDataset was introduced recently but was not exposed as part of the fairseq.data module Reviewed By: liezl200 Differential Revision: D10412717 fbshipit-source-id: 8a9d4ecd43fd376e895c450d00e765a869c95eff Add size method to BacktranslationDataset + misc fixes (#325) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/325 RoundRobinZipDataset requires size(index) method implemented in every dataset used. Also added missing return statements in a few methods. Reviewed By: liezl200 Differential Revision: D10457159 fbshipit-source-id: 01856eb455f2f3a21e7fb723129ff35fbe29e0ae make fairseq models compatible with character inputs and use character inputs for elmo in pytext Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/321 Reviewed By: alexeib Differential Revision: D10430186 fbshipit-source-id: 9cc8fe0f202cc49370cecf36312bcc9bf0b4deee LanguagePairDataset and BacktranslationDataset changes for semi supervised task setup (#330) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/330 As part of the semi sueprvised task setup (https://github.com/pytorch/translate/pull/243), this diff adds the ability for LanguagePairDataset to remove EOS from source or append EOS to target. This functionality is required by BacktranslationDataset to use translations as source data. Also added changes to BacktranslationDataset to make it work on GPU. We needed to transfer back-translated sentences back to CPU for the LanguagePairDataset to collate. Reviewed By: liezl200 Differential Revision: D10846294 fbshipit-source-id: b015ecb5fcef26fba507c30f8a4992bdbc54899f Fix print & add more informative logging Summary: Fix fairseq's `force` option for disabling print suppression (otherwise, `print(..., force=True)` fails on master since the force kwarg gets passed to the builtin print). Reviewed By: dpacgopinath Differential Revision: D10522058 fbshipit-source-id: bbc10c021a7d21396ebfbb1bf007f6b9b162f4fd Extend WordShuffle noising function to apply to non-bpe tokens Summary: We'd like to resue the noising functions and DenoisingDataset in adversarial training. However, current noising functions assume the input are subword tokens. The goal of this diff is to extend it so the noising can be applied to word tokens. Since we're mostly interested in the word shuffle noising, so I only modified the WordShuffle class. Reviewed By: liezl200 Differential Revision: D10523177 fbshipit-source-id: 1e5d27362850675010e73cd38850c890d42652ab transformer onnx trace: skip no-op transpose (#333) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/333 A tiny hack to speed up inference slightly for transformer beam search after export to graph mode. Specifically, there is no need to transpose a dimension with size 1 (the sequence length of a single decoder time step during beam search) with its neighbor immediately before a view/reshape. Reviewed By: jmp84 Differential Revision: D12833011 fbshipit-source-id: f9c344a9ad595e6e48a8a65b31cf2b1392f9b938 match examples/stories/writingPrompts scripts to correct folder Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/290 Differential Revision: D12876759 Pulled By: myleott fbshipit-source-id: 9f6d1c9de27dad29368a7edb923dfcf770355938 Update bleu.py (#320) Summary: Modify Error message of bleu. Fix the issue: https://github.com/pytorch/fairseq/issues/284 Pull Request resolved: https://github.com/pytorch/fairseq/pull/320 Differential Revision: D12876721 Pulled By: myleott fbshipit-source-id: df25885a94a584cbf4b86a1665e3e513c7eb8e9a Fix tests + style nits + Python 3.5 compat Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/336 Differential Revision: D12876709 Pulled By: myleott fbshipit-source-id: a31536e2eb93f752600b9940c28e9b9fcefc8b86 Denoising autoencoder task (#251) Summary: Pull Request resolved: https://github.com/pytorch/translate/pull/251 We should use shared encoder and separate decoders as in: https://fb.facebook.com/groups/2156114531381111/permalink/2169028113423086/ Generation is a hack, ideally the net input should have the lang pair info so that when we pass the sample to the model, it can select the correct encoder/decoder pair. diff [2/2] will be for flow integration for basic experimentation TODO in a future diff: figure out how to generalize this so export will work?? This works with vocab reduction, but we only support vocab reduction for src-tgt, not src-src model. A future (lowpri) task could be to add word prediction vocab reduction for src-src model to speed up training. Reviewed By: xianxl Differential Revision: D10512576 fbshipit-source-id: 545d96cad8e814b9da7be102a48cc5cac358b758 Move fairseq part of D10478427 directly into pytorch-translate (#337) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/337 Pull Request resolved: https://github.com/pytorch/translate/pull/250 Reviewed By: akinh Differential Revision: D12880352 fbshipit-source-id: 61e9888a9cc3df07e805820b74a5fcf359dfe0ea Fix "ignore-case" behavior (#339) Summary: Currently, if `ignore-case` is set, the same line will be yielded twice - once as lower-cased version, once as original version, leading to lower than expected uncased scores. Pull Request resolved: https://github.com/pytorch/fairseq/pull/339 Differential Revision: D12890386 Pulled By: myleott fbshipit-source-id: 0570e5f6e8f848f2c6439d615e70aca6df097eef Black formatting in fairseq/test_noising (#341) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/341 Use black formatting in test_noising.py Reviewed By: xianxl Differential Revision: D12810285 fbshipit-source-id: 5517dd5d2f086831f487d88acf6bc2fa18820297 Refactor fairseq/test_noising with a word shuffle helper function (#340) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/340 This allows us to do a lot less copy paste when adding new word shuffle function tests Reviewed By: xianxl Differential Revision: D12810304 fbshipit-source-id: a56b5df093d17be2b73837897c526978cab92b70 Support BPE end of word marker suffix in fairseq noising module Summary: There are 2 ways to implement BPE: 1. use a continuation marker suffix to indicate that there is at least one more subtoken left in the word 2. use a end of word marker suffix to indicate that there is no more subtokens left in the word This adds some logic to account for either kind of BPE marker suffix. This diff adds a corresponding test. I also refactored the test setup to reduce the number of boolean args when setting up test data. Reviewed By: xianxl Differential Revision: D12919428 fbshipit-source-id: 405e9f346dce6e736c1305288721dfc7b63e872a Merge internal changes Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/352 Differential Revision: D12956930 Pulled By: myleott fbshipit-source-id: 39334a79544bac570feb04be9103269d7c1563f9 Fix error when training multilingual_translation task with multi-GPU Summary: D10052908 introduce multilingual_translation task, but it raises exception when training with multiple-GPUs: P60202593 With Myle's help, we found that it is because of improperly handled dummy batch data type, and it causes optimizer.backward() is not executed same number of times cross different GPUs. Reviewed By: xianxl Differential Revision: D12964263 fbshipit-source-id: 4991039030bf373f0c484e131acc4736487be4d8 pipeline for LM training Summary: step 2 of pipeline for LM training assumes tokenized text data as input. Splits it into train/validation/test, and runs binarization (step a_ii in https://fb.quip.com/kazzAxvZHBj9) Reviewed By: borguz Differential Revision: D10454705 fbshipit-source-id: 74e8679041f5507c4e404c1b719547c2ae9ed983 Support for BPE vocabs + denoising autoencoder in PyTorch Translate (#362) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/362 Pull Request resolved: https://github.com/pytorch/translate/pull/254 This actually uses the fairseq logic which supports BPE cont / end word marker suffixes. Reviewed By: xianxl Differential Revision: D12952766 fbshipit-source-id: 35a1bbc38240e4145bec0fc419f2d0a6a73ae2e5 Fix dummy batch when --max-tokens is small (fixes #347) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/366 Differential Revision: D13058513 Pulled By: myleott fbshipit-source-id: a146d2cfb345d404775ed8d6b8e4a4ad4e7a33b4 make dictionary optional Reviewed By: jingfeidu Differential Revision: D13104360 fbshipit-source-id: 9636f5ee2721818f98b33af559fa24292534a72f Add LegacyDistributedDataParallel in place of no_c10d (#370) Summary: This should bring back the speedup with --update-freq that we reported in the Scaling Neural Machine Translation paper. Pull Request resolved: https://github.com/pytorch/fairseq/pull/370 Differential Revision: D13100281 Pulled By: myleott fbshipit-source-id: 4a81b51bb7390a197add314a4be5512bbf68c085 Fix build for docs Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/372 Differential Revision: D13114426 Pulled By: myleott fbshipit-source-id: 6c24b96a3556a0ecd3d1f350642a884254a40bd3 Merge small fixes from internal Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/374 Differential Revision: D13116074 Pulled By: myleott fbshipit-source-id: 485724cc5a40e8360d21e4bf9c35821baa0ddc57 Protect against failures in case of OOMs Summary: Fixing some distributed failures that happen when OOMs are observed. Reviewed By: myleott Differential Revision: D13121054 fbshipit-source-id: f71a0a695332acbaa1797e89887b8b7c7ddaa727 Refactor BacktranslationDataset to be more reusable (#354) Summary: - generalize AppendEosDataset -> TransformEosDataset - remove EOS logic from BacktranslationDataset (use TransformEosDataset instead) - BacktranslationDataset takes a backtranslation_fn instead of building the SequenceGenerator itself Pull Request resolved: https://github.com/pytorch/fairseq/pull/354 Reviewed By: liezl200 Differential Revision: D12970233 Pulled By: myleott fbshipit-source-id: d5c5b0e0a75eca1bd3a50382ac24621f35c32f36 Fix some recursive functions (e.g., reorder_incremental_state) to only touch each module once (#379) Summary: This can happen if a module is registered in more than one place in the network. Pull Request resolved: https://github.com/pytorch/fairseq/pull/379 Differential Revision: D13154498 Pulled By: myleott fbshipit-source-id: a35575d1956a46cd35ac8b16a719ad20ac3e380a onnx bi-transformer (#385) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/385 Pull Request resolved: https://github.com/facebookresearch/pytext/pull/6 Pull Request resolved: https://github.com/pytorch/pytorch/pull/14292 Reviewed By: jingfeidu Differential Revision: D10517864 fbshipit-source-id: 81008b5cc6aab70e23329c187392fb72ee057d78 Decoder embedding sharing in PyTorch Translate for denoising autoencoder (#386) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/386 Pull Request resolved: https://github.com/pytorch/translate/pull/266 This allows decoder embedding sharing for denoising autoencoder modules with different decoders (one for src decoding and one for tgt decoding) Reviewed By: dpacgopinath Differential Revision: D13133015 fbshipit-source-id: 3c98be639d705744ccf5ba3a8fd7d10ddc7aef4a Fix --ddp-backend=no_c10d for params that don't require grads Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/388 Reviewed By: theweiho Differential Revision: D13244869 fbshipit-source-id: d22c18f63f9a691ccc7245e06bc9a5b776a192b5 fixes on bi-transformer onnx Summary: replace dynamic index put with copying and creating a new tensor Reviewed By: wanchaol Differential Revision: D13244573 fbshipit-source-id: 909f7913ad579ed035f29bb52321ff01e09a2c60 fixed torch 0.4.0 , "RuntimeError: Expected object of type torch.cuda… (#393) Summary: ….LongTensor but found type torch.cuda.FloatTensor for argument #3 'index' " error in the torch.__version__ == 0.4.0 , new_order = torch.arange(bsz).view(-1, 1).repeat(1, beam_size).view(-1) will return a float dtype Tensor, when exec the "line 321: fairseq/fairseq/models/fconv.py " will throw a RuntimeError Pull Request resolved: https://github.com/pytorch/fairseq/pull/393 Differential Revision: D13276496 Pulled By: myleott fbshipit-source-id: e7986246fbe2c79fff61bcab0e5bec9dd63e0afd Better error message if workers fall out of sync (#396) Summary: This kind of issue should be rare, but the exception that was thrown before ("UnpicklingError: invalid load key") was very opaque, so let's use something a bit clearer. Pull Request resolved: https://github.com/pytorch/fairseq/pull/396 Differential Revision: D13325600 Pulled By: myleott fbshipit-source-id: 2e7093752d45d6b04a3d506aca8d5694b72ab638 Enable check_reduction for imagenet flow and fairseq Summary: As the title says, better to enable this for certain use cases to make sure things are right Reviewed By: myleott, pietern Differential Revision: D13351753 fbshipit-source-id: cf495960fda71ebd679c23212e19703c93a9dbdc Add check that --encoder-layers matches --decoder-layers for LSTM (fixes #394) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/398 Differential Revision: D13358876 Pulled By: myleott fbshipit-source-id: 57673f2643aac01492cb8f5728bb9f1a34ba6aa7 Fix arg formatting in preprocess.py and add fmt control for black formatting (#399) Summary: Not switching to Black formatting just yet, but adding fmt: off directives in case we decide to later. Pull Request resolved: https://github.com/pytorch/fairseq/pull/399 Differential Revision: D13364674 Pulled By: myleott fbshipit-source-id: a20a11a18be3d583ee30eff770278fb4bd05b93c Warn when using --update-freq on a single machine and --ddp-backend != no_c10d Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/400 Differential Revision: D13366996 Pulled By: myleott fbshipit-source-id: b4907815e7cc1b4a2aceab11210bf64cb3d814c9 Take a dummy train step under OOM to keep multiprocessing in sync Summary: This is not a guaranteed solution (since processes may still get out of sync if OOM happens after an all_gather/all_reduce has been done) - but should still make multiprocessing training more robust in practice since it seems we usually OOM early enough. Reviewed By: myleott Differential Revision: D13086018 fbshipit-source-id: feb1b01c2eb8818797cfdabc0faac8056ba1b4ee Add --fp16-scale-tolerance (#397) Summary: Let's only decrease the loss scale if a large enough percentage of batches overflow. Pull Request resolved: https://github.com/pytorch/fairseq/pull/397 Differential Revision: D13355159 Pulled By: myleott fbshipit-source-id: e17dde73d34a639519b4348c013fdd19d2b314e6 fix data checking report bug (#403) Summary: The original code reports the size of a valid sample instead of an invalid one when raising an Exception , which will make people confused. Pull Request resolved: https://github.com/pytorch/fairseq/pull/403 Differential Revision: D13391431 Pulled By: myleott fbshipit-source-id: 4642ed027c0f664424fc5a9baf4363791144feaf Loading PreTrained Models (#406) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/406 Static helper function in TranslationTask to load pretrained models Reviewed By: myleott Differential Revision: D13345276 fbshipit-source-id: 3a675ee1a144ceb8b010f30e1a6163ef670b53f3 data per gpu change Summary: Avoid loading entire data set per gpu to reduce memory footprint Reviewed By: rutyrinott Differential Revision: D13163548 fbshipit-source-id: 4ba717c8021ba5723d02225bae5782e2c3a18640 Add BufferedIterator (#419) Summary: This improves performance for datasets that load data lazily. Enabled by default since it shouldn't compromise performance for non-lazy datasets. Pull Request resolved: https://github.com/pytorch/fairseq/pull/419 Differential Revision: D13546585 Pulled By: myleott fbshipit-source-id: f6152e2047291b0d68cd7506cd772b0caafe95be Improve memory efficiency of FP16 optimization (#404) Summary: Previously when training with --fp16, we stored a copy of the model parameters in FP32 for optimization, which consumed a lot of memory. An alternative is to just do the conversions to FP32 on the fly, which allows the caching allocator to reuse/save some memory. This reduces peak memory usage by ~20% with a negligible reduction in training speed (~2% slower) when training a big transformer on 8 GPUs on wmt en-de with --update-freq=16. This does not affect convergence, i.e., models will train exactly as they did before. Pull Request resolved: https://github.com/pytorch/fairseq/pull/404 Differential Revision: D13394376 Pulled By: myleott fbshipit-source-id: 2b9f808548df4782110513c9cfc9f7c6159bcbbf Add option to disable positional embeddings in TransformerModel (#421) Summary: Add argument `--no-token-positional-embeddings` to TransformerModel (currently only available in TransformerLanguageModel) to disable positional embeddings. Pull Request resolved: https://github.com/pytorch/fairseq/pull/421 Differential Revision: D13548450 Pulled By: myleott fbshipit-source-id: b352c702ed1609e3b84d9a8404941d3274a7f883 Merge internal changes (#422) Summary: - 04cc608: Add `--match-source-len` option to generate.py to for sequence-tagging tasks - 19f1a40: Add `--no-repeat-ngram-size` option to generate.py for ngram blocking Pull Request resolved: https://github.com/pytorch/fairseq/pull/422 Differential Revision: D13548445 Pulled By: myleott fbshipit-source-id: 26d1ae83993e428fcb020dac5ae358b0e36233d9 Fix backtranslation dataset on IndexedCachedDataset (#410) Summary: BacktranslationDataset would throw an error when the underlying dataset was an IndexedCachedDataset because prefetching was not handled correctly. This fixes the error. Pull Request resolved: https://github.com/pytorch/fairseq/pull/410 Differential Revision: D13557539 Pulled By: myleott fbshipit-source-id: 398ab59a3ebdbf1c666d862b9f905654eece800c Fix resuming from FP16 checkpoints (#424) Summary: This was broken in 03a57de. Pull Request resolved: https://github.com/pytorch/fairseq/pull/424 Differential Revision: D13557540 Pulled By: myleott fbshipit-source-id: 62deda5353032aff20d35d046b0bb843da44d27c Make multiprocessing_train.py work with multi-node setups Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/425 Differential Revision: D13558340 Pulled By: myleott fbshipit-source-id: dff8c77027e821d8c80bfbd6a6ccce9ca1a44b78 Merge internal changes (#283) Summary: Pull Request resolved: https://github.com/pytorch/translate/pull/283 Pull Request resolved: https://github.com/pytorch/fairseq/pull/428 Differential Revision: D13564190 Pulled By: myleott fbshipit-source-id: 3b62282d7069c288f5bdd1dd2c120788cee4abb5 rm fb_train.py (#432) Cleanup more files Update docs for --lazy-load and torch.distributed.launch Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/433 Differential Revision: D13588032 Pulled By: myleott fbshipit-source-id: 0e5ff361e27b206c4490264f0f51863367499e81 Fix broken link in README.md (#436) Summary: https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset is not longer valid, redirects to a blog post listing page. Pull Request resolved: https://github.com/pytorch/fairseq/pull/436 Differential Revision: D13607961 Pulled By: myleott fbshipit-source-id: 1a1074ffcbc454e29bc9d5aed84fdf2089a224bc Misc fixes Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/439 Differential Revision: D13608151 Pulled By: myleott fbshipit-source-id: 198b84995a6329f8329829cc91184d88f1eab947 Make error message for trying to train after make_generation_fast work correctly Summary: https://github.com/pytorch/fairseq/blob/master/fairseq/trainer.py#L164 calls `train()` without any argument Reviewed By: myleott Differential Revision: D13599203 fbshipit-source-id: 3a096a6dd35a7a3f8309fbda3b54a36f606475e3 Fixes (#442) Summary: minor fixes: 1- adding fairseq logo 2- encoder padding for fconv self att 3- legacy ddp change Pull Request resolved: https://github.com/pytorch/fairseq/pull/442 Differential Revision: D13651715 Pulled By: myleott fbshipit-source-id: ac93c80f1dbffdfe03fbd4b8a8ea527aecb576a7 New command line option '--user-dir' (#440) Summary: Following discussion on official fairseq (https://github.com/pytorch/fairseq/issues/438), I added the `--user-dir` option to the command line. The user can now specify a path in order to import a custom module with proprietary tasks, architectures and so on. Pull Request resolved: https://github.com/pytorch/fairseq/pull/440 Differential Revision: D13651721 Pulled By: myleott fbshipit-source-id: 38b87454487f1ffa5eaf19c4bcefa0b3b15a8f43 '--user-dir' documentation (correct) (#447) Summary: Command line option --user-dir documented in docs/overview.rst Pull Request resolved: https://github.com/pytorch/fairseq/pull/447 Differential Revision: D13674744 Pulled By: myleott fbshipit-source-id: 17049ee5c9f692f5298ef9fa7381ee583f269cde Fixed wrong help message shown on '--help' (#446) Summary: Correct help message was obfuscated by the transient `ArgumentParser` used only for eagerly read `--user-dir` flag. To reproduce just try: ```bash python3 train.py --help ``` Pull Request resolved: https://github.com/pytorch/fairseq/pull/446 Differential Revision: D13674731 Pulled By: myleott fbshipit-source-id: b9503a4d7ef26405be630d31c0ca02386d783031 optimizations for token_block_dataset Summary: optimizing memory use of token_block_dataset by replacing python data structures with numpy arrays. applying needed parts from D13498973, instead of rebasing it on changes Reviewed By: edunov Differential Revision: D13678485 fbshipit-source-id: c0c827a8b95834a6a5456476040ebdc8e42136d4 Add --checkpoint-upper-bound to average_checkpoints.py (#452) Summary: This is useful for averaging the last N checkpoints, ending at some "best" checkpoint. Pull Request resolved: https://github.com/pytorch/fairseq/pull/452 Differential Revision: D13695407 Pulled By: myleott fbshipit-source-id: 5d9d2bff3706834f01501e9259834c77fb335817 FIX: '--user-dir' on multi-gpu (#449) Summary: On a multi-gpu training scenario, the `train.py` script spawns new processes with `torch.multiprocessing.spawn`. Unfortunately those child processes don't inherit the modules imported with `--user-dir`. This pull request fixes this problem: custom module import in now explicit on every `main()` function. Pull Request resolved: https://github.com/pytorch/fairseq/pull/449 Differential Revision: D13676922 Pulled By: myleott fbshipit-source-id: 520358d66155697885b878a37e7d0484bddbc1c6 Fix initial learning rate (#453) Summary: There was a very subtle bug here 😢When we recently removed this line (7633129ba8d5f0e28bd6b6d6027b14352482ef31), it meant that the learning rate scheduler didn't get initialized until after the first update. Unfortunately pytorch optimizers store the learning rate in their internal state, so some learning rate schedulers use their `__init__` method to reset the learning rate to some sane initial value. This is especially problematic for LR schedulers that include a warmup, where the Optimizer is likely to contain the peak learning rate at initialization, and it's only in the LR scheduler's `__init__` that the (much smaller) warmup value is set. For example, the inverse_sqrt scheduler resets the learning rate upon initialization: https://github.com/pytorch/fairseq/blob/7853818c2e33a63ec17a31bcfe20e4fc75d94130/fairseq/optim/lr_scheduler/inverse_square_root_schedule.py#L48-L50 **Impact:** For the last ~1.5 weeks, the first training update would use the optimizer's default learning rate instead of the initial rate set by the LR scheduler. All subsequent updates used the correct learning rates. This primarily affects LR schedulers with warmups. Pull Request resolved: https://github.com/pytorch/fairseq/pull/453 Differential Revision: D13704453 Pulled By: myleott fbshipit-source-id: a946da30100f837c66bdc6b9b77b014ab4eb8764 Fix stories generation Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/454 Differential Revision: D13708565 Pulled By: myleott fbshipit-source-id: 5cd0e07e3e1885eef14e3a5e8074f24cf4bde632 Fix iteration bug in GroupedIterator. Correct sent size filter. (#455) Summary: Fix iterating from the beginning bug when initializing the GroupedIterator. (https://github.com/pytorch/fairseq/issues/441) Correct filter criterion for dict type sentence size. (https://github.com/pytorch/fairseq/issues/451) Pull Request resolved: https://github.com/pytorch/fairseq/pull/455 Differential Revision: D13725646 Pulled By: myleott fbshipit-source-id: e698fa6f9b45460f95a75c9e9976a3aa3b6aa523 change f"{args}" to "{}".format(args) (#467) Summary: Although both are supported by Python 3.6, I think it would be better to unify the usage of string format function. Pull Request resolved: https://github.com/pytorch/fairseq/pull/467 Differential Revision: D13802506 Pulled By: myleott fbshipit-source-id: 5c4877547b1c4ca806ab54c80ae483cfbaa7827a Better error message for improperly formatted dictionaries Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/468 Differential Revision: D13802590 Pulled By: myleott fbshipit-source-id: e374e38e74dc91bda0579ae41e26289fb0ba56a2 Enforce UTF-8 when open() text files (#460) Summary: When opening text files without specifying the encoding (i.e. `open(path, "r")` or `open(path, "w")`), python3 will use the preferred locale encoding (`locale.getpreferredencoding()`) so the result is platform dependent and can change from one machine to another. I believe fairseq should enforce its standard (UTF-8 seems like the best choice to me). This pull request explicity specify UTF-8 encoding when reading text files. Pull Request resolved: https://github.com/pytorch/fairseq/pull/460 Differential Revision: D13802525 Pulled By: myleott fbshipit-source-id: 672fd55707ee559ab36d74bc1c24026166ea2367 Print model and number of trained params Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/469 Differential Revision: D13802945 Pulled By: myleott fbshipit-source-id: b6976506a8336b96ee40505c4a7638541cc99c95 LSTM improvements (fixes #414) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/470 Differential Revision: D13803964 Pulled By: myleott fbshipit-source-id: 91b66599e9a539833fcedea07c608b349ba3b449 Only use c10d distributed primitives Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/471 Differential Revision: D13818918 Pulled By: myleott fbshipit-source-id: d3b8dc50e81ee1d2dcc5efc5815998be8461085f Adafactor Optimizer (#472) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/472 Implementation of "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost" (https://arxiv.org/abs/1804.04235) Differential Revision: D13388049 fbshipit-source-id: 24ad30f4bac248e6aeaced5064bb83784058f03d refactor AdversarialTrainer factor out helper functions Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/474 Reviewed By: theweiho, akinh Differential Revision: D13701447 fbshipit-source-id: 34036dce7601835b605e3b169210edc7a6715de6 Add code for "Pay Less Attention with Lightweight and Dynamic Convolutions" (#473) Summary: Changelog: - `e330f56`: Add code for the "Pay Less Attention with Lightweight and Dynamic Convolutions" paper - `5e3b98c`: Add scripts for computing tokenized BLEU with compound splitting and sacrebleu - update READMEs - misc fixes Pull Request resolved: https://github.com/pytorch/fairseq/pull/473 Differential Revision: D13819717 Pulled By: myleott fbshipit-source-id: f2dc12ea89a436b950cafec3593ed1b04af808e9 make dictionary class as input for fairseq preprocess functions (#482) Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/482 With this change, we can use different dictionary classes when calling build_dictionary and build_and_save_dictionary Reviewed By: liaimi Differential Revision: D13855100 fbshipit-source-id: 62e6db310b5f078e05c547d2671252233be7b7f0 Merge internal changes (#483) Summary: Changelog: - `4889802`: can now remove detokenize sentencepiece output with `--remove-bpe=sentencepiece` (fixes #331). Also added `--sacrebleu` for computing detokenized BLEU. - `0d76427`: fix assertion error when training language model with dataset containing empty sentences - minor bug and style fixes Pull Request resolved: https://github.com/pytorch/fairseq/pull/483 Differential Revision: D13867899 Pulled By: myleott fbshipit-source-id: 25c940b847fe270262ac8f5ac838407b3977fdda Add --input option to interactive.py to support reading from file Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/484 Differential Revision: D13880636 Pulled By: myleott fbshipit-source-id: 984b2e1c3b281c28243102eb971ea45ec891d94e Do distributed init after data loading Summary: FACEBOOK This switches back to torch.multiprocessing.spawn, instead of directly calling fb_train.par using a subprocess.Process. This has the advantage that exceptions are propagated properly. It also moves the distributed_init part to happen after data loading, which gets around the timeout issue. The downside of this approach is that it's not so easy to pipe stdout to multiple places, which was nice when using the sweep.py scripts. I'm still working on a fix for that. Reviewed By: rutyrinott, ngoyal2707 Differential Revision: D13873224 fbshipit-source-id: 08d593233b8d23590c01c723363630a79804a8b0 Support custom Dictionary implementations in 'preprocess.py' (#448) Summary: The `preprocess.py` script has been refactored in order to: 1. Use the `options` module for command line arguments parsing. This will give to `preprocess.py` the ability to load custom modules with `--user-dir` flag (already implemented to all other binaries) 2. Dictionary loading and building code has moved to Task implementation. This allows custom Dictionary classes to be used during the data generation step. Pull Request resolved: https://github.com/pytorch/fairseq/pull/448 Differential Revision: D13674819 Pulled By: myleott fbshipit-source-id: b40648a98ed6c08284577e5ec25876e018d8c822 Add standalone binaries Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/489 Differential Revision: D13956810 Pulled By: myleott fbshipit-source-id: 61ace179d1d3790226c38b3f3e47f5452b5ec514 Add CheckpointManager to keep avg checkpoint weights in memory to reduce disk read when averaging + various checkpoint refactoring Summary: Pull Request resolved: https://github.com/pytorch/translate/pull/315 Reviewed By: akinh Differential Revision: D13510446 fbshipit-source-id: 22a6594af9253130a93e638285a47183a974e0de stitch preprocessing pipeline Summary: 1. add call to binarization to complete preprocessing pipeline 2. add ability to sp…

Summary: Pull Request resolved: pytorch/translate#283 Pull Request resolved: facebookresearch/fairseq#428 Differential Revision: D13564190 Pulled By: myleott fbshipit-source-id: 3b62282d7069c288f5bdd1dd2c120788cee4abb5

…ix Chinese chars and English BPE) (facebookresearch#428) * add pruned transducer stateless5 recipe for tal_csasr * do some changes for merging * change for conformer.py * add wer and cer for Chinese and English respectively * fix a error for conformer.py

NarineK and others added 10 commits January 1, 2019 14:12

Fix AttributeError: lr_scheduler in trainer.py during training with -…

a73177a

…-fp16

Minor fixes in the tutorial

8e78a5f

Improve documentation and style

b5d0f7b

Support training on CPU (mostly for unit tests)

6437f1a

Only use BufferedIterator in thread safe contexts

b82446a

Fix TokenBlockDataset

592a4a5

Add --lazy-load

171f79e

Switch to torch.multiprocessing.spawn

8107d88

Revert to faster FP16 and expose memory efficient version (--memory-e…

39cd2e2

…fficient-fp16)

Switch to torch.distributed.launch to support --num-workers

dbf4ccd

facebook-github-bot added the CLA Signed label Jan 1, 2019

facebook-github-bot reviewed Jan 1, 2019

View reviewed changes

myleott mentioned this pull request Jan 3, 2019

Merge internal changes pytorch/translate#283

Closed

felixgwu and others added 2 commits January 4, 2019 07:44

Keep the last N epochs

02ac1b1

Add back torch.multiprocessing.spawn method fallback

9b62353

facebook-github-bot reviewed Jan 4, 2019

View reviewed changes

facebook-github-bot closed this in 7633129 Jan 5, 2019

myleott deleted the merge_internal branch January 8, 2019 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge internal changes #428

Merge internal changes #428

myleott commented Jan 1, 2019

facebook-github-bot left a comment

facebook-github-bot left a comment

Merge internal changes #428

Merge internal changes #428

Conversation

myleott commented Jan 1, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment