Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge internal changes #428

Closed
wants to merge 12 commits into from
Closed

Merge internal changes #428

wants to merge 12 commits into from

Conversation

myleott
Copy link
Contributor

@myleott myleott commented Jan 1, 2019

No description provided.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit to pytorch/translate that referenced this pull request Jan 5, 2019
Summary:
Pull Request resolved: #283

Pull Request resolved: facebookresearch/fairseq#428

Differential Revision: D13564190

Pulled By: myleott

fbshipit-source-id: 3b62282d7069c288f5bdd1dd2c120788cee4abb5
@myleott myleott deleted the merge_internal branch January 8, 2019 17:55
jxhe pushed a commit to salesforce/ctrl-sum that referenced this pull request Nov 3, 2020
Architecture settings and readme updates

More fixes

Small fix

Better training support when GPUs are in "exclusive mode"

Issue #2, Checking size attribute of dst when dst is None

Fix generation when vocabulary is small relative to beam size (fixes #7)

Fix handling of partially-empty initial batch (#11)

Refactor PaddingCollater

Update progress_bar to be more robust to changes in tqdm (#21)

Fix call ordering to ATen addmm and sum (#22)

BPE transformation for IWSLT

Don't generate during training, add --quiet to generate.py

Ignore invalid sentences in test and valid

Update README.md

Update PyTorch install instructions

Update README.md

Don't suggest Miniconda (see #24)

Fix --no-progress-bar option in generate.py (#115)

Update En2Fr model

Move helper functions from generate.py to fairseq/dictionary.py

Support configurable BPE symbol

Fix flake8 warnings

Don't save/restore convolutional layers in incremental inference

Allow --max-len-a to be a float

Add optimizer history to checkpoints (and rearrange criterions slightly)

Better logging from criterions

Add support for NCCL v2

Add attention matrix to output of SequenceGenerator

Ignore generated files for temporal convolution tbc

Fix smoothed (sentence-level) BLEU calculation

More flexible gradient normalization

Refactor model saving/loading to be more reusable

Refactor code in Tokenizer

Add support for additional optimizers

Simplify deps of build_model to only depend on dict (instead of dataset)

Fix language inference in generate.py

Fix handling of continuation tokens that precede <unk> in generate.py

Prevent math overflow when loss is too high

Set seed after each epoch to improve consistency when resuming

Fix for building under clang: specify C++ build and use C++ linkage (#42)

Update README with note about Docker (#49)

Force UTF-8 encoding for dictionary files ( #41 )

Only save most recent optimizer state in checkpoints (#53)

Only consider EOS in beam search if it's among top-k candidates

Fix description for `--sample-without-replacement` option

Support custom dictionary in preprocess.py

Add `--curriculum` option

Refactor model definitions

* Move some functionality out of FConvModel into FairseqModel base class
* Move incremental decoding functionality into FairseqIncrementalDecoder module
* Refactor positional embeddings to be more specific to FConvModel

Added -unkpen flag to generate.py following logic of Lua/Torch version

Support different max_source_positions and max_target_positions

Fix call to non-existing to_string method

Fix seed so that data is properly shuffled between epochs

Upgrade args with max_source_positions and max_target_positions

Refactor generation

* Split generate.py to generate.py and interactive.py and refactor code

The main motivation behind these changes is to try to decorrelate use
cases in order to implement future improvements such as unk replacement
with original string during evaluation on test and writing predictions
to output file.
The previous implementation worked well but I found it difficult to
integrate these future improvements.

* Add --replace-unk arg to be used without align dict

Replacing <unk> tokens can be beneficial even without an alignment
dictionary.

Left pad source and right pad target

Improvements to data loader

Fix interactive.py

Use `--lrshrink` as the reduction factor in ReduceLROnPlateau

Fix flake8 lint

Add dim to F.softmax calls

Update README with interactive.py and fix it

Add --max-sentence option for batching based on # sentences

Loop over evaluation dataloader in descending order

Replace unk with original string

* Add <eos> for unk replacement
* Add IndexedRawTextDataset to load raw text files
* Replace unk with original string
* Add load_raw_text_dataset() and --output-format
* Move has_binary_files to data.py

Revert `dim` in `F.softmax` for backwards compatibility

Rename LabelSmoothedCrossEntropy to LabelSmoothedNLLLoss

Add LSTM

Don't call forward directly (prefer module(x) to module.forward(x))

Add `--log-format` option and JSON logger

Fix max_positions_valid in train.py

Fixes for `--log-format`

Fix all-reduce for new versions of PyTorch

We previously assumed that once a model parameter's gradient buffer was allocated, it stayed fixed during training.
However, this assumption is violated in recent versions of PyTorch (i.e., the gradient buffer may be reallocated during
training), and it's no longer a safe assumption to make.

This is primarily relevant when we do the all-reduce, since we all-reduce a flattened (i.e., contiguous) copy of the
gradients. We can make this more robust by copying the result of the all-reduce back into the model parameter's gradient
buffers after each update. Intra-device copies are cheap, so this doesn't affect performance.

Version 0.1.0 -> 0.2.0

Release notes:
- 5c7f495: Added simple LSTM model with input feeding and attention
- 6e4b7e2: Refactored model definitions and incremental generation to be cleaner
- 7ae79c1: Split interactive generation out of generate.py and into a new binary: interactive.py
- 19a3865: Subtle correctness fix in beam search decoder. Previously, for a beam size of k, we might emit a hypotheses
           if the <eos> was among the top 2*k candidates. Now we only emit hypotheses for which the <eos> is among the
           top-k candidates. This may subtly change generation results, and in the case of k=1 we will now produce
           strictly greedy outputs.
- 97d7fcb: Fixed bug in padding direction, where previously we right-padded the source and left-padded the target. We
           now left-pad the source and right-pad the target. This should not effect existing trained models, but may
           change (usually improves) the quality of new models.
- f442f89: Add support for batching based on the number of sentences (`--max-sentences`) in addition to the number of
           tokens (`--max-tokens`). When batching by the number of sentences, one can optionally normalize the gradients
           by the number of sentences with `--sentence-avg` (the default is to normalize by the number of tokens).
- c6d6256: Add `--log-format` option and JSON logger

Fallback to `--log-format=simple` for non-TTY terminals

Fix Flake8

Flush non-TTY logging output after each log interval

Make LSTM backwards compatible and fix incremental generation

Remove Python 3.6 format strings (fixes #55)

Remove more Python 3.6 format strings (fixes #57) (#58)

Remove Python3.6 format string from preprocess.py (fixes #60) (#61)

Update requirements.txt and fix flake8 (#62)

fix bug in lstm model (#68)

Fixed 2 typos (#75)

Improve error when resuming training with a different model architecture

Improve memory handling (recover from OOM and periodically empty caching allocator)

Allow --lr to specify a fixed learning rate schedule

Prefer command-line configuration over checkpoint for optimizer state

Save number of GPUs in args (and checkpoints)

Fix weight norm dimension in decoder (fixes #73)

Rebuild optimizer when loading checkpoints

Fix conv padding for even kernel widths

Directly decay weight instead of L2 penalty (#157)

See https://arxiv.org/pdf/1711.05101.pdf

Fix generation bug with large beam sizes (>50)

Add support for sharded generation

Fix BeamableMM

Better error message for --decoder-attention

Minor fix for strip_pad functions

Support deprecation of volatile Variables in latest PyTorch

Add explicit dimension to softmax calls

Output number of model parameters in train.py

Raise FileNotFoundError if dictionary files don't exist

Add reduce kwarg to criterions

Streamline data formatting utils

Add --max-sentences-valid to train.py

Add option to SequenceGenerator to retain dropout

Fix warning about deprecated `volatile` kwarg for Variables

Move positional embeddings into LearnedPositionalEmbedding module

Move normalization of model output (e.g., via LSM) into model definition

Fix LearnedPositionalEmbedding

Fix gradient clipping when --clip-norm=0

Save dictionary in model base classes

Fix training

Better support for torch.no_grad (since volatile is deprecated)

Share input/output embed

Report log likelihood for label smoothing

Momentum correction

ATen Fix

Better warning message for inputs that are too long

Fix max_positions calculation in train.py

Output correct perplexity when training with --sentence-avg

Fix tests

Fixed Weight Decay Regularization in Adam

See https://arxiv.org/abs/1711.05101

Ratio should be predlen/reflen not reflen/predlen

To be compatible with multi-bleu.
This seems to only affect the result_string.

Prepare scripts for WMT14

Switch to news-commentary-v12

Adding README and more parameters to En2De script

Update README with new models

spelling

Adjust weight decay by the current learning rate to make it work correctly during annealing

Allow larger maxlen (fixes #100) (#101)

fairseq-py goes distributed (#106)

This PR includes breaking API changes to modularize fairseq-py and adds support for distributed training across multiple nodes.

Changes:
- c7033ef: add support for distributed training! See updated README for usage.
- e016299: modularize fairseq-py, adding support for register_model, register_criterion, register_optimizer, etc.
- 154e440: update LSTM implementation to use PackedSequence objects in the encoder, better following best practices and improving perf
- 90c2973 and 1da6265: improve unit test coverage

Add OOM counter back to logging output

Fix tests and flake8

More unit test fixes

Add support to prefixes (#221)

* Add prefix

* Fixes

* Keep original scores with prefix

* Improve prefix code

* Replace 'repeat' with 'expand'

pytorch update: no need to rewrap variable in backward()

Fix LabelSmoothedCrossEntropy test

Refactor incremental generation to be more explicit and less magical (#222)

Making our code compatible with the latest pytorch (#223)

* Making our code compatible with the latest pytorch

* revert

* torch.nn.utils.clip_grad_norm now returns tensor

More fixes for recent PyTorch (incl. topk issue) (#113)

More updates for PyTorch (#114)

Use ATen built-in conv_tbc method (#66)

Remove custom ConvTBC code

Small fixes

Filter padding properly in LabelSmoothedCrossEntropyCriterion (#229)

Allow more flexible pre-processing and generation (#227)

* Allow more flexible pre-processing and generation

* Addressing CR comments

* small fix

Enforce upper-bound on maximum generation length (#121)

fix typo in data/README

Change "awailable" to "available".

fix typo in data/README (#131)

Change "awailable" to "available".

Update training commands

Update training commands in data/README to match the latest version of this project according to #132.

- Motivation: in the previous data/README, the commands are obsolete and will cause the error "unrecognized arguments: --label-smoothing 0.1 --force-anneal 50".
- What's changed: add arguments "--criterion label_smoothed_cross_entropy" and "--lr-scheduler fixed" to the training commands of all 3 datasets.
- Result: the new commands run without error on all 3 datasets.

Update training commands

Update training commands in data/README to match the latest version of this project according to #132.

Continue from 3c07295885c6283def573e7a6811464f250c3b28: add omitted "\".

Update training command for IWSLT14

specify a single GPU setup for IWSLT14

Merge internal changes (#136)

Changes:
- 7d19e36: Add `--sampling` flag to generate.py to sample instead of doing beam search
- c777340: Add `scripts/average_checkpoints.py` to average multiple checkpoints into a combined model
- 3ea882c: Add `--max-update` option to train.py to stop training after a given number of updates
- small bugfixes for distributed training, LSTM, inverse square root LR scheduler

make interactive mode print out alignment nicely

Disallow --batch-size in interactive.py

Update README.md

use implicit padding when possible (#152)

Add pretrained embedding support (#151)

Flake8

Fix old model checkpoints after #151 (fixes #156) (#157)

Update dataset code for use by https://github.com/pytorch/translate/pull/62 (#161)

Merge internal changes (#163)

0.4.0 -> 0.5.0

Remove sweep_log prefix from json progress bar

Faster fconv generation

Fix LSTM

fix optim history

address comments

Add Transformer model

Remove Google batching stategy (it's not needed)

Pass args around to cleanup parameter lists

Bug fixes

Fix flake8

Fix buffers in sinusoidal positional embeddings

caching v3 (cache keys, values, process only last time step) (#241)

- process only last time step during generation
- cache keys and values
- dont apply masking during generation

smarter way to avoid applying encoder key mask

Use PyTorch LayerNorm and improve weight init

More improvements to weight init and FP16 support

Simulated big batches

Use FP32 for multi-head attention softmax

better batching

Improve dataloader speed and deprecate concept of batch_offset (use --sample-without-replacement instead)

Allow schedule for update-freq

Fix batching during generation

Add FP16 support

Make dictionary size a multiple of 8

Revert "Make dictionary size a multiple of 8"

This reverts commit b2e119c209363e6ff6d2878a69c7d1a507a2e9be.

Pad dictionary to be a multiple of 8 in preprocessing

No more magical --fp16

remove completed sentences from batch

remove completed sentences from batch and allow batching uneven lengths (with fixes to make padded sequences work correctly in all models)

Fix Flake8

Small optimization for LSTM

Fix preprocess.py

Use eval() to parse args.lr

Fix embedding initialization for padding

Simplify train.py (merge with singleprocess_train.py)

Save and restore wall time in checkpoints

Support --warmup-updates with fixed LR schedule

Fix tests

Remove src-padding from generation output

make sure tensor used to index is cuda if on gpu

Fix --prefix-size

fix to adding tokens to dictionary while thresholding

make attn dropout 0.1 default for big en-de transformer

add support for averaging last n checkpoints

fix flag copy paste (decoder-normalize-before)

Remove padding from --score-reference

Fix --remove-bpe to strip trailing BPE symbols

Sampling doesn't work with interactive

implement batching in interactive mode

fix alignment when using uneven batches and left pad

Support integer learning rates

allow specifying max_tokens for generation

also report sentence/s timing when generating

default dropout to correct value for big transformer

ability to checkpoint when reaching certain number of updates

All-reduce in FP16

remove unused verbose option & make arguments to averaging script nicer

allow overwriting args for different architectures

Fix tests

use implicit padding when possible

Add pretrained embedding support

Flake8

Fix old model checkpoints

Merge OSS + internal changes

Conv lm implementation

This implements convolutional language model from https://arxiv.org/pdf/1612.08083.pdf

There are 3 modes for constructing batches:

- token block: fill each sample with a specified number of tokens without regard for sentence delimiters - this is what was used for training in the paper
- complete: fill each sample with a specified number of tokens but make sure it contains only complete sentences (i.e. if next sentence goes over token block limit, move it to the next sample) - this was used for evaluation in the paper
- eos: one sentence per sample (skip blank lines)

some results:

GCNN-13 - GBW - 37.46
GCNN-14B - GBW - 33.88
GCNN-8 - Wiki103 - 43.76
GCNN-14 - Wiki103 - 35.66

train:

python train.py /private/home/abaevski/data/wiki103 --save-dir /tmp --fp16 --max-epoch 35 --save-interval 1 --save-interval-updates 1000 --keep-interval-updates 25 --arch fconv_lm --optimizer nag --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 --decoder-embed-dim 280 --decoder-layers '[(850, 6)] * 3 + [(850,1)] + [(850,5)] * 4 + [(850,1)] + [(850,4)] * 3 + [(1024,4)] + [(2048, 4)]' --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion cross_entropy --max-tokens 1024 --max-target-positions 1024 --seed 1 --log-format json --log-interval 500

eval:

python eval_lm.py ~abaevski/data/wiki103 --path '/checkpoint02/abaevski/2018-04-27/lm_wiki.fp16.mxup300000.fconv.adam.lrs=reduce_lr_on_plateau.emb280.layers(850,6)*3+(850,1)+(850,5)*4+(850,1)+(850,4)*3+(1024,1)+(2048,4).lr0.0005.clp0.1.drp0.3.wd0.0.crt=cross_entropy.mxtk2048.smptk256.seed1.ngpu8/checkpoint_last.pt'

default normalization constant for older models

add big en_fr transformer architecture

Generalize eval_str_list

fix restoring from middle of epoch; fix defaulting transformer dropout params

record end_of_epoch in checkpoint

use adaptive softmax only with adaptive loss

fix default params

added multiscale gated self attention layer with multiple heads, and pretrained fusion models

modified writing prompts model parameters to make readme cleaner

minor parameter fixes for stories model

save best val loss in checkpoint

save best val loss in checkpoint and also print best so far

this way when training continues from an existing checkpoint, we dont immediately override checkpoint_best with a worse loss

fix model loading in eval_lm

Nits

Migrate all binaries to use options.parse_args_and_arch

Unify various sharding into ShardedIterator

Use symlinks for redundant checkpoints

Merge validate and val_loss functions (simplify train.py)

create examples dir and add conv lm + stories readme

Small fixes

Suppress stdout in test_train

Add more integration tests (LM, stories, transformer, lstm)

Update README.md

build optimizer only once, otherwise it leaks cuda memory

initialize normalization constant for fconv_lm

Fix length penalty when combined with --no-early-stop

Co-authored-by: pmichel31415 <pmichel@fb.com>

torch.arange default return type is changed in the latest pytorch version https://github.com/pytorch/pytorch/pull/7016

Add FairseqTask

A Task defines the data format, stores shared state (e.g., dictionaries) and provides helpers for building the model/criterion and calculating the loss.

Changes:
- Add TranslationTask and LanguageModelingTask. New tasks can be registered with @register_task decorator.
- Add EpochBatchIterator to encapsulate batching and saving/restoring dataloader position
- Remove LEFT_PAD_* constants and make them configurable per task

Updates for latest PyTorch

Fix tests

Fix bidirectional lstm

Faster generation when using a single model (rather than ensemble)

Change --path to be colon-separated instead of comma-separated

add default architecture for gbw fconv lm

add links to pretrained language models

Update README.md

Add transformer models and replace list with table

Update README.md

Fix preprocessed test set download links

add count for padding words (#180)

gzip instead of bzip change to stories download

added clarification on the newline token we model

fixed newline word not appearing

Fix translation README (fixes #186) (#189)

Fix `--output-format raw` option to preprocess.py (Fixes #188) (#190)

Two tiny changes to train/eval_lm. For train fix an off by one, while for eval_lm make it work when the task is translation'

Ignore files in examples (other than .sh and .md)

When downloading files in examples directory (e.g. when running
`prepare-iwslt14.sh`), git sees them as untracked files but they should
not be committed.
Add a .gitignore script that ignore everything in the examples
subdirectories except for .sh and .md files.

Better failure message when loss explodes during FP16 training

Store full checkpoints instead of symlinking

Fix a bug with using GloVe 840B tokens for initialization.

respect max tokens and ignore invalid inputs when evaluating lm

sort descending when evaluating lm because it is faster (17k wps vs 11k) and will fail early if oom

Support pretrained embeddings for Transformer.

Also show a nicer error message.

Support FP16 during inference

fix sinusoidal embedding init size

add sinusoidal pos initialization

Fix interpretation of --max-epoch

Move reorder_encoder_out to FairseqEncoder and fix non-incremental decoding

Add steps to reproduce WMT En-De results from Scaling NMT paper

Fix typo

Fix for Dictionary.finalize

Misc changes for pytorch-translate

Fix attention order in unit tests (fixes #195) (#197)

Remove more Variable() calls (#198)

Remove unnecessary assert (fixes #199) (#200)

Fix preprocessing for WMT14 En-De to replicate Scaling NMT paper (#203)

adding pretrained stories model

fix decoder_normalize_before typo (#205)

add model override argument from load_ensemble_for_inference at generation time, updating readme for stories

adding model arg override at generation time for interactive.py

assert that vocab size >= adaptive softmax cutoff (#214)

Fix up model defaults (#211)

Pass sampling-temperature trough to the generator in interactive.py

stories data preprocessing needs padding factor 1 to match pretrained model, updating readme

fixed output_proj's input_dim in attention (#226)

fix raw text for language modeling

make model access saner

fix token block rotation

Support tied embeddings in LSTM encoder/decoder

disable printing alignment by default (for perf) and add a flag to enable it

default need_attn to False

Fix bug when --share-all-embeddings but no --encoder-embed-path

Iterate on need_attn and fix tests

Output positional scores in interactive.py

Don't compute unnecessary attention averages during training

Transformer lm

This implements transformer based language model. It already obtains better perplexity on wikitext103 without any tuning. I will also train it on gbw where I also expect to get better ppl

Example training command:

python train.py /private/home/abaevski/data/wiki103 —save-dir /tmp —fp16 —max-epoch 80 —save-interval 1 —arch transformer_lm —task language_modeling —optimizer nag —lr 0.008 —lr-scheduler reduce_lr_on_plateau —lr-shrink 0.6 —dropout 0.2 —criterion adaptive_loss —adaptive-softmax-cutoff 10000,50000,200000 —max-tokens 512 —tokens-per-sample 512 —seed 1 —sample-break-mode none —log-format json —log-interval 50 —save-interval-updates 2500 —keep-interval-updates 25
small transformer got to 31.3 ppl on wiki text 103 (compared to 35 with fconv) while @myleott got a big transformer lm to 27 something ppl on wiki text 103

remove right-to-left lm support

default decoder_learned_pos for lm

option to print language model words and their log probs during evaluation

Update IWSLT configuration for transformer

Don't use 0-dimensional buffers in sinusoidal positional embeddings

Fix comment

Merge internal changes

Add load_optim option to load checkpoint but not optimizer state (#229)

Correct path in the pre-processing example (#230)

Correct the help name of the prefixes arguments (#234)

Fix bug when training with FP32 and --update-freq (#236)

Add ensemble for different architectures (#235)

add end-of-stack normalizations in case normalize_before has been set (#244)

Fix comment

Fix bidirectional LSTM concatenation (#249)

fix adaptive softmax indexing

option for a smaller adaptive softmax

character token embeddings for word level predictions

remove unneeded defaults

Always smaller soft

no need to have half-size option as behavior can be reproduced with existing flags

make adaptive softmax dropout an optional arg

add flag that allows keeping optimizer config

adds -reset-optimizer, --reset-lr-scheduler, and --optimizer-overrides flags

fix tests

make batching faster for monolingual dataset

load args from model for eval_lm

parameters to separate input/inner/out dims

cosine + triangular lr scheduler

Factor out search logic in SequenceGenerator

Reset gnorm after each epoch

fix tests

Increase max buffer size in all_gather_list

script to read binarized data

Move read_binarized.py to scripts/

Warn when using FP16 on pre-Volta GPUs

add warmup support back to cosine lr sched (important for mt)

Diverse Beam Search

Remove --normalization-constant from fconv

Fix adaptive softmax cutoff comment

disable final layer norm for transformer decoder as it makes things worse

Add training wall time meter

Old checkpoints can't be loaded because of a new meter

word stats in eval_lm

Merge internal changes

Fix FP16 version comparison

dont send dummy batch when reloading from checkpoint

also don't crash if param does not recieve grads

Add adaptive softmax changes for lstm model

Add --upsample-primary

Clean up FairseqTask so that it's easier to extend/add new tasks

fix max_positions comparison

Fix comment

Further generalize EpochBatchIterator and move iterators into new file

fix cosine lr sched for t_mult=1 with warmup

Test max_positions

Misc changes to simplify upcoming tutorial

Add documentation

Update documentation

modified stories readme to include sample preprocessing code to split stories to 1k tokens

Fix readme

Fix docs

Readme fix

Update readme with WMT'18 model (#433)

Generator: net_input instead of manual src_tokens.

Sequence generator bug fix.

Revert sequence generator changes

Switch to DistributedDataParallelC10d and bump version 0.5.0 -> 0.6.0

- no more FP16Trainer, we just have an FP16Optimizer wrapper
- most of the distributed code is moved to a new wrapper class called DistributedFairseqModel, which behaves like DistributedDataParallel and a FairseqModel at the same time
- Trainer now requires an extra dummy_batch argument at initialization, which we do fwd/bwd on when there's an uneven number of batches per worker. We hide the gradients from these dummy batches by multiplying the loss by 0
- Trainer.train_step now takes a list of samples, which will allow cleaner --update-freq

Disable c10d for AdaptiveLoss

Update LM test with --no-c10d

Pass encoder_input to generator, rather than src_tokens/src_lengths.

Fix validation loss

Add unit test to verify reproducibility after reloading checkpoints

Fix adaptive loss logging

Parallel preprocessing

Fix type of c10d bucket size

Better support for various c10d API changes

core changes to support latte collab

fix issue with truncated dict

Merge internal changes

Add back secondary set

Online backtranslation module

Co-authored-by: liezl200 <lie@fb.com>

fbshipit-source-id: 6a835d32f9dc5e0de118f1b46d365d0e0cc85e11

fbshipit-source-id: 17992f6a5908f078942544b769eda7a340a5e359

Merge internal changes (#295)

Summary:
Changelog:
- `90f52a1`: Support loading subsets of the data on each worker with the `--fix-batches-to-gpus` flag. This should fix #217 and #266.
- `6eda0a9`: Update README for replicating the "Scaling Neural Machine Translation" paper
- `b14c7cf`: Fallback to no_c10d backend for pytorch 0.4.1 (fixes #294)
Pull Request resolved: https://github.com/pytorch/fairseq/pull/295

Differential Revision: D10121559

Pulled By: myleott

fbshipit-source-id: 41c84d0ee4cdd113544b5d3aa38ae8b23acc2c27

Merge internal changes

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/296

Differential Revision: D10121830

Pulled By: alexeib

fbshipit-source-id: 1b73430bdfdcb20a9a6123abfca3472a0d307b3b

Explicitly list out generation args for backtranslation dataset

Summary:
Using argparse Namespace hides the actual args that are expected and makes code harder to read.

Note the difference in style for the args list

    def __init__(
        self,
        tgt_dataset,
        tgt_dict,
        backtranslation_model,
        unkpen,
        sampling,
        beam,
        max_len_a,
        max_len_b,
    ):

instead of

    def __init__(
        self, tgt_dataset, tgt_dict, backtranslation_model, unkpen, sampling,
        beam,  max_len_a, max_len_b,
    ):

Reviewed By: dpacgopinath

Differential Revision: D10152331

fbshipit-source-id: 6539ccba09d48acf23759996b7e32fb329b3e3f6

Update README.md

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/300

Differential Revision: D10154711

Pulled By: edunov

fbshipit-source-id: 859d1ac59923b67c1547b6f7acb94f801b0c3318

Pass in kwargs and SequenceGenerator class to init BacktranslationDataset

Summary: This generalizes BacktranslationDataset to allow us to use any SequenceGenerator class. For example, if we want to use this model in PyTorch Translate, we can pass the following to BacktraanslationDataset init: (1) a PyTorch Translate SequenceGenerator class as generator_class and (2) the appropriate args for initializing that class as kwargs.

Reviewed By: xianxl

Differential Revision: D10156552

fbshipit-source-id: 0495d825bf4727da96d0d9a40dc434135ff3486c

Fix proxying in DistributedFairseqModel

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/302

Differential Revision: D10174608

Pulled By: myleott

fbshipit-source-id: 4e2dfc76eae97afc5488f29b47e74f9897a643ff

Option to remove EOS at source in backtranslation dataset

Summary:
If we want our parallel data to have EOS at the end of source, we keep the EOS at the end of the generated source dialect backtranslation.
If we don't want our parallel data to have EOS at the end of source, we **remove** the EOS at the end of the generated source dialect backtranslation.

Note: we always want EOS at the end of our target / reference in parallel data so our model can learn to generate a sentence at any arbitrary length. So we make sure that the original target has an EOS before returning a batch of {generated src, original target}. If our original targets in tgt dataset doesn't have an EOS, we append EOS to each tgt sample before collating.
We only do this for the purpose of collating a {generated src, original tgt} batch AFTER generating the backtranslations. We don't enforce any EOS before passing tgt to the tgt->src model for generating the backtranslation. The users of this dataset is expected to format tgt dataset examples in the correct format that the tgt->src model expects.

Reviewed By: jmp84

Differential Revision: D10157725

fbshipit-source-id: eb6a15f13c651f7c435b8db28103c9a8189845fb

multihead_attention: pre-transpose incremental state (#232)

Summary:
Pull Request resolved: https://github.com/pytorch/translate/pull/232

Though transpose operations are essentially free during PyTorch execution, they can result in costly operations when exported to Caffe2 inference nets via ONNX tracing, especially when applied repeatedly to large tensors.

For this reason, we update `MultiheadAttention` to store its incremental state with shape (bsz, num_heads, seq_len, head_dim), that is after transposing the projected input. This should result in non-trivially faster exported models without changing the semantics or speed of PyTorch execution.

Reviewed By: myleott

Differential Revision: D10186506

fbshipit-source-id: 8a42712423ee767ea49ed88d2a4653f900d14fba

Have noising account for sentences with and without EOS (#305)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/305

Previously, noising code assumed that every sentence had an EOS which had to be excluded from noising operations (since we shouldn't drop, blank, or shuffle EOS). This logic allows the noising module to handle sentences with EOS and without EOS

Reviewed By: xianxl

Differential Revision: D10114425

fbshipit-source-id: 04ec8547343eb94266bda1ac7fca3d8a1991c9f4

Add denoising dataset for denoising autoencoder (#306)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/306

This uses a source dataset to generate a batch of {source: noisy source, target: original clean source} which allows us to train a denoising autoencoding component as part of a seq2seq model.

Reviewed By: xianxl

Differential Revision: D10078981

fbshipit-source-id: 026225984d4a97062ac05dc3a36e79b5c841fe9c

fix make_positions() typo (#316)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/316

This code should actually be keeping the padded positions as `padding_idx` (though note that this is on the ONNX export path, and it has no effect in the most common case when using the exported network to do un-batched inference).

Reviewed By: myleott

Differential Revision: D10431872

fbshipit-source-id: 79fe4ac27cafcd4701e0f2a90e29d1b7362dc6f8

Update upgrade_state_dict in transformer.py to upgrade_state_dict_named (#317)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/317

When upgrading `state_dict` variable, `upgrade_state_dict` function in TransformerEncoder/TransformerDecoder doesn't handle multiple encoders/decoders, however, D10052908 will be the case.

Before the change, we will hit error message [1] when loading checkpoint for multilingual_transformer model in D10052908. This diff will fix it.

Reviewed By: myleott, liezl200

Differential Revision: D10375418

fbshipit-source-id: 7104c1a463e78f3fa33d8479a37c51608be50610

Manually port pull request 385

Summary:
Manually port fairinternal fairseq-py pull request #385 [1] to fbcode.

Resolve the merge conflict of removing fp16_trainer per offline discussion with Myle. Also updated codes to make generate.py works.

[1] https://github.com/fairinternal/fairseq-py/pull/385/commits/18fa6e154781cf0c4b1596429dba7e753a545069

Reviewed By: liezl200

Differential Revision: D10052908

fbshipit-source-id: c3c378d78dc1e9ac087c815f359e78c0048ff2f5

Fix another distributed syncing issue

Summary:
This is another failure due to distributed GPU's getting out of sync.
We are running save_and_eval (which has the inter-gpu communication calls) by
looking at number of updates. But number of updates means weight updates. Whenever
there is an issue in the training and weights can't be updated, nodes go
out of sync and nodes start failing. So we should check number of iterations instead.

I am, again, making a small change to save the day, but we should decouple/refactor
save_and_eval logic from the training, to have less headache in future.
Planning, working on that in future. But this should solve some of the
issues for now.

Reviewed By: jhcross

Differential Revision: D10478427

fbshipit-source-id: b9deacfea252b2fb66b81c799fa78e2439fa514c

Expose BacktranslationDataset from fairseq.data (#324)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/324

BacktranslationDataset was introduced recently but was not exposed as part of the fairseq.data module

Reviewed By: liezl200

Differential Revision: D10412717

fbshipit-source-id: 8a9d4ecd43fd376e895c450d00e765a869c95eff

Add size method to BacktranslationDataset + misc fixes (#325)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/325

RoundRobinZipDataset requires size(index) method implemented in every dataset used. Also added missing return statements in a few methods.

Reviewed By: liezl200

Differential Revision: D10457159

fbshipit-source-id: 01856eb455f2f3a21e7fb723129ff35fbe29e0ae

make fairseq models compatible with character inputs and use character inputs for elmo in pytext

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/321

Reviewed By: alexeib

Differential Revision: D10430186

fbshipit-source-id: 9cc8fe0f202cc49370cecf36312bcc9bf0b4deee

LanguagePairDataset and BacktranslationDataset changes for semi supervised task setup (#330)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/330

As part of the semi sueprvised task setup (https://github.com/pytorch/translate/pull/243), this diff adds the ability for LanguagePairDataset to remove EOS from source or append EOS to target. This functionality is required by BacktranslationDataset to use translations as source data.

Also added changes to BacktranslationDataset to make it work on GPU. We needed to transfer back-translated sentences back to CPU for the LanguagePairDataset to collate.

Reviewed By: liezl200

Differential Revision: D10846294

fbshipit-source-id: b015ecb5fcef26fba507c30f8a4992bdbc54899f

Fix print & add more informative logging

Summary: Fix fairseq's `force` option for disabling print suppression (otherwise, `print(..., force=True)` fails on master since the force kwarg gets passed to the builtin print).

Reviewed By: dpacgopinath

Differential Revision: D10522058

fbshipit-source-id: bbc10c021a7d21396ebfbb1bf007f6b9b162f4fd

Extend WordShuffle noising function to apply to non-bpe tokens

Summary:
We'd like to resue the noising functions and DenoisingDataset in
adversarial training. However, current noising functions assume the input are
subword tokens. The goal of this diff is to extend it so the noising can be
applied to word tokens. Since we're mostly interested in the word shuffle
noising, so I only modified the WordShuffle class.

Reviewed By: liezl200

Differential Revision: D10523177

fbshipit-source-id: 1e5d27362850675010e73cd38850c890d42652ab

transformer onnx trace: skip no-op transpose (#333)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/333

A tiny hack to speed up inference slightly for transformer beam search after export to graph mode. Specifically, there is no need to transpose a dimension with size 1 (the sequence length of a single decoder time step during beam search) with its neighbor immediately before a view/reshape.

Reviewed By: jmp84

Differential Revision: D12833011

fbshipit-source-id: f9c344a9ad595e6e48a8a65b31cf2b1392f9b938

match examples/stories/writingPrompts scripts to correct folder

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/290

Differential Revision: D12876759

Pulled By: myleott

fbshipit-source-id: 9f6d1c9de27dad29368a7edb923dfcf770355938

Update bleu.py (#320)

Summary:
Modify Error message of bleu.
Fix the issue:  https://github.com/pytorch/fairseq/issues/284
Pull Request resolved: https://github.com/pytorch/fairseq/pull/320

Differential Revision: D12876721

Pulled By: myleott

fbshipit-source-id: df25885a94a584cbf4b86a1665e3e513c7eb8e9a

Fix tests + style nits + Python 3.5 compat

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/336

Differential Revision: D12876709

Pulled By: myleott

fbshipit-source-id: a31536e2eb93f752600b9940c28e9b9fcefc8b86

Denoising autoencoder task (#251)

Summary:
Pull Request resolved: https://github.com/pytorch/translate/pull/251

We should use shared encoder and separate decoders as in:

https://fb.facebook.com/groups/2156114531381111/permalink/2169028113423086/

Generation is a hack, ideally the net input should have the lang pair info so that when we pass the sample to the model, it can select the correct encoder/decoder pair.

diff [2/2] will be for flow integration for basic experimentation

TODO in a future diff: figure out how to generalize this so export will work??

This works with vocab reduction, but we only support vocab reduction for src-tgt, not src-src model. A future (lowpri) task could be to add word prediction vocab reduction for src-src model to speed up training.

Reviewed By: xianxl

Differential Revision: D10512576

fbshipit-source-id: 545d96cad8e814b9da7be102a48cc5cac358b758

Move fairseq part of D10478427 directly into pytorch-translate (#337)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/337

Pull Request resolved: https://github.com/pytorch/translate/pull/250

Reviewed By: akinh

Differential Revision: D12880352

fbshipit-source-id: 61e9888a9cc3df07e805820b74a5fcf359dfe0ea

Fix "ignore-case" behavior (#339)

Summary:
Currently, if `ignore-case` is set, the same line will be yielded twice - once as lower-cased version, once as original version, leading to lower than expected uncased scores.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/339

Differential Revision: D12890386

Pulled By: myleott

fbshipit-source-id: 0570e5f6e8f848f2c6439d615e70aca6df097eef

Black formatting in fairseq/test_noising (#341)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/341

Use black formatting in test_noising.py

Reviewed By: xianxl

Differential Revision: D12810285

fbshipit-source-id: 5517dd5d2f086831f487d88acf6bc2fa18820297

Refactor fairseq/test_noising with a word shuffle helper function (#340)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/340

This allows us to do a lot less copy paste when adding new word shuffle function tests

Reviewed By: xianxl

Differential Revision: D12810304

fbshipit-source-id: a56b5df093d17be2b73837897c526978cab92b70

Support BPE end of word marker suffix in fairseq noising module

Summary:
There are 2 ways to implement BPE:
1. use a continuation marker suffix to indicate that there is at least one more subtoken left in the word
2. use a end of word marker suffix to indicate that there is no more subtokens left in the word

This adds some logic to account for either kind of BPE marker suffix. This diff adds a corresponding test. I also refactored the test setup to reduce the number of boolean args when setting up test data.

Reviewed By: xianxl

Differential Revision: D12919428

fbshipit-source-id: 405e9f346dce6e736c1305288721dfc7b63e872a

Merge internal changes

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/352

Differential Revision: D12956930

Pulled By: myleott

fbshipit-source-id: 39334a79544bac570feb04be9103269d7c1563f9

Fix error when training multilingual_translation task with multi-GPU

Summary:
D10052908 introduce multilingual_translation task, but it raises exception when training with multiple-GPUs: P60202593

With Myle's help, we found that it is because of improperly handled dummy batch data type, and it causes optimizer.backward() is not executed same number of times cross different GPUs.

Reviewed By: xianxl

Differential Revision: D12964263

fbshipit-source-id: 4991039030bf373f0c484e131acc4736487be4d8

pipeline for LM training

Summary:
step 2 of pipeline for LM training
assumes tokenized text data as input. Splits it into train/validation/test, and runs binarization
(step a_ii in https://fb.quip.com/kazzAxvZHBj9)

Reviewed By: borguz

Differential Revision: D10454705

fbshipit-source-id: 74e8679041f5507c4e404c1b719547c2ae9ed983

Support for BPE vocabs + denoising autoencoder in PyTorch Translate (#362)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/362

Pull Request resolved: https://github.com/pytorch/translate/pull/254

This actually uses the fairseq logic which supports BPE cont / end word marker suffixes.

Reviewed By: xianxl

Differential Revision: D12952766

fbshipit-source-id: 35a1bbc38240e4145bec0fc419f2d0a6a73ae2e5

Fix dummy batch when --max-tokens is small (fixes #347)

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/366

Differential Revision: D13058513

Pulled By: myleott

fbshipit-source-id: a146d2cfb345d404775ed8d6b8e4a4ad4e7a33b4

make dictionary optional

Reviewed By: jingfeidu

Differential Revision: D13104360

fbshipit-source-id: 9636f5ee2721818f98b33af559fa24292534a72f

Add LegacyDistributedDataParallel in place of no_c10d (#370)

Summary:
This should bring back the speedup with --update-freq that we reported in the Scaling Neural Machine Translation paper.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/370

Differential Revision: D13100281

Pulled By: myleott

fbshipit-source-id: 4a81b51bb7390a197add314a4be5512bbf68c085

Fix build for docs

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/372

Differential Revision: D13114426

Pulled By: myleott

fbshipit-source-id: 6c24b96a3556a0ecd3d1f350642a884254a40bd3

Merge small fixes from internal

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/374

Differential Revision: D13116074

Pulled By: myleott

fbshipit-source-id: 485724cc5a40e8360d21e4bf9c35821baa0ddc57

Protect against failures in case of OOMs

Summary: Fixing some distributed failures that happen when OOMs are observed.

Reviewed By: myleott

Differential Revision: D13121054

fbshipit-source-id: f71a0a695332acbaa1797e89887b8b7c7ddaa727

Refactor BacktranslationDataset to be more reusable (#354)

Summary:
- generalize AppendEosDataset -> TransformEosDataset
- remove EOS logic from BacktranslationDataset (use TransformEosDataset instead)
- BacktranslationDataset takes a backtranslation_fn instead of building the SequenceGenerator itself
Pull Request resolved: https://github.com/pytorch/fairseq/pull/354

Reviewed By: liezl200

Differential Revision: D12970233

Pulled By: myleott

fbshipit-source-id: d5c5b0e0a75eca1bd3a50382ac24621f35c32f36

Fix some recursive functions (e.g., reorder_incremental_state) to only touch each module once (#379)

Summary:
This can happen if a module is registered in more than one place in the network.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/379

Differential Revision: D13154498

Pulled By: myleott

fbshipit-source-id: a35575d1956a46cd35ac8b16a719ad20ac3e380a

onnx bi-transformer (#385)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/385

Pull Request resolved: https://github.com/facebookresearch/pytext/pull/6

Pull Request resolved: https://github.com/pytorch/pytorch/pull/14292

Reviewed By: jingfeidu

Differential Revision: D10517864

fbshipit-source-id: 81008b5cc6aab70e23329c187392fb72ee057d78

Decoder embedding sharing in PyTorch Translate for denoising autoencoder (#386)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/386

Pull Request resolved: https://github.com/pytorch/translate/pull/266

This allows decoder embedding sharing for denoising autoencoder modules with different decoders (one for src decoding and one for tgt decoding)

Reviewed By: dpacgopinath

Differential Revision: D13133015

fbshipit-source-id: 3c98be639d705744ccf5ba3a8fd7d10ddc7aef4a

Fix --ddp-backend=no_c10d for params that don't require grads

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/388

Reviewed By: theweiho

Differential Revision: D13244869

fbshipit-source-id: d22c18f63f9a691ccc7245e06bc9a5b776a192b5

fixes on bi-transformer onnx

Summary: replace dynamic index put with copying and creating a new tensor

Reviewed By: wanchaol

Differential Revision: D13244573

fbshipit-source-id: 909f7913ad579ed035f29bb52321ff01e09a2c60

fixed torch 0.4.0 , "RuntimeError: Expected object of type torch.cuda… (#393)

Summary:
….LongTensor but found type torch.cuda.FloatTensor for argument #3 'index' " error

in the torch.__version__ == 0.4.0 ,
new_order = torch.arange(bsz).view(-1, 1).repeat(1, beam_size).view(-1)
will return a float dtype Tensor, when exec the "line 321: fairseq/fairseq/models/fconv.py " will throw a RuntimeError
Pull Request resolved: https://github.com/pytorch/fairseq/pull/393

Differential Revision: D13276496

Pulled By: myleott

fbshipit-source-id: e7986246fbe2c79fff61bcab0e5bec9dd63e0afd

Better error message if workers fall out of sync (#396)

Summary:
This kind of issue should be rare, but the exception that was thrown before ("UnpicklingError: invalid load key") was very opaque, so let's use something a bit clearer.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/396

Differential Revision: D13325600

Pulled By: myleott

fbshipit-source-id: 2e7093752d45d6b04a3d506aca8d5694b72ab638

Enable check_reduction for imagenet flow and fairseq

Summary:
As the title says, better to enable this for certain use cases to make
sure things are right

Reviewed By: myleott, pietern

Differential Revision: D13351753

fbshipit-source-id: cf495960fda71ebd679c23212e19703c93a9dbdc

Add check that --encoder-layers matches --decoder-layers for LSTM (fixes #394)

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/398

Differential Revision: D13358876

Pulled By: myleott

fbshipit-source-id: 57673f2643aac01492cb8f5728bb9f1a34ba6aa7

Fix arg formatting in preprocess.py and add fmt control for black formatting (#399)

Summary:
Not switching to Black formatting just yet, but adding fmt: off directives in case we decide to later.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/399

Differential Revision: D13364674

Pulled By: myleott

fbshipit-source-id: a20a11a18be3d583ee30eff770278fb4bd05b93c

Warn when using --update-freq on a single machine and --ddp-backend != no_c10d

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/400

Differential Revision: D13366996

Pulled By: myleott

fbshipit-source-id: b4907815e7cc1b4a2aceab11210bf64cb3d814c9

Take a dummy train step under OOM to keep multiprocessing in sync

Summary: This is not a guaranteed solution (since processes may still get out of sync if OOM happens after an all_gather/all_reduce has been done) - but should still make multiprocessing training more robust in practice since it seems we usually OOM early enough.

Reviewed By: myleott

Differential Revision: D13086018

fbshipit-source-id: feb1b01c2eb8818797cfdabc0faac8056ba1b4ee

Add --fp16-scale-tolerance (#397)

Summary:
Let's only decrease the loss scale if a large enough percentage of batches overflow.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/397

Differential Revision: D13355159

Pulled By: myleott

fbshipit-source-id: e17dde73d34a639519b4348c013fdd19d2b314e6

fix data checking report bug (#403)

Summary:
The original code reports the size of a valid sample instead of an invalid one when raising an Exception , which will make people confused.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/403

Differential Revision: D13391431

Pulled By: myleott

fbshipit-source-id: 4642ed027c0f664424fc5a9baf4363791144feaf

Loading PreTrained Models (#406)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/406

Static helper function in TranslationTask to load pretrained models

Reviewed By: myleott

Differential Revision: D13345276

fbshipit-source-id: 3a675ee1a144ceb8b010f30e1a6163ef670b53f3

data per gpu change

Summary: Avoid loading entire data set per gpu to reduce memory footprint

Reviewed By: rutyrinott

Differential Revision: D13163548

fbshipit-source-id: 4ba717c8021ba5723d02225bae5782e2c3a18640

Add BufferedIterator (#419)

Summary:
This improves performance for datasets that load data lazily. Enabled by default since it shouldn't compromise performance for non-lazy datasets.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/419

Differential Revision: D13546585

Pulled By: myleott

fbshipit-source-id: f6152e2047291b0d68cd7506cd772b0caafe95be

Improve memory efficiency of FP16 optimization (#404)

Summary:
Previously when training with --fp16, we stored a copy of the model parameters in FP32 for optimization, which consumed a lot of memory. An alternative is to just do the conversions to FP32 on the fly, which allows the caching allocator to reuse/save some memory.

This reduces peak memory usage by ~20% with a negligible reduction in training speed (~2% slower) when training a big transformer on 8 GPUs on wmt en-de with --update-freq=16.

This does not affect convergence, i.e., models will train exactly as they did before.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/404

Differential Revision: D13394376

Pulled By: myleott

fbshipit-source-id: 2b9f808548df4782110513c9cfc9f7c6159bcbbf

Add option to disable positional embeddings in TransformerModel (#421)

Summary:
Add argument `--no-token-positional-embeddings` to TransformerModel (currently only available in TransformerLanguageModel) to disable positional embeddings.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/421

Differential Revision: D13548450

Pulled By: myleott

fbshipit-source-id: b352c702ed1609e3b84d9a8404941d3274a7f883

Merge internal changes (#422)

Summary:
- 04cc608: Add `--match-source-len` option to generate.py to for sequence-tagging tasks
- 19f1a40: Add `--no-repeat-ngram-size` option to generate.py for ngram blocking
Pull Request resolved: https://github.com/pytorch/fairseq/pull/422

Differential Revision: D13548445

Pulled By: myleott

fbshipit-source-id: 26d1ae83993e428fcb020dac5ae358b0e36233d9

Fix backtranslation dataset on IndexedCachedDataset (#410)

Summary:
BacktranslationDataset would throw an error when the underlying dataset was an IndexedCachedDataset because prefetching was not handled correctly. This fixes the error.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/410

Differential Revision: D13557539

Pulled By: myleott

fbshipit-source-id: 398ab59a3ebdbf1c666d862b9f905654eece800c

Fix resuming from FP16 checkpoints (#424)

Summary:
This was broken in 03a57de.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/424

Differential Revision: D13557540

Pulled By: myleott

fbshipit-source-id: 62deda5353032aff20d35d046b0bb843da44d27c

Make multiprocessing_train.py work with multi-node setups

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/425

Differential Revision: D13558340

Pulled By: myleott

fbshipit-source-id: dff8c77027e821d8c80bfbd6a6ccce9ca1a44b78

Merge internal changes (#283)

Summary:
Pull Request resolved: https://github.com/pytorch/translate/pull/283

Pull Request resolved: https://github.com/pytorch/fairseq/pull/428

Differential Revision: D13564190

Pulled By: myleott

fbshipit-source-id: 3b62282d7069c288f5bdd1dd2c120788cee4abb5

rm fb_train.py (#432)

Cleanup more files

Update docs for --lazy-load and torch.distributed.launch

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/433

Differential Revision: D13588032

Pulled By: myleott

fbshipit-source-id: 0e5ff361e27b206c4490264f0f51863367499e81

Fix broken link in README.md (#436)

Summary:
https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset is not longer valid, redirects to a blog post listing page.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/436

Differential Revision: D13607961

Pulled By: myleott

fbshipit-source-id: 1a1074ffcbc454e29bc9d5aed84fdf2089a224bc

Misc fixes

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/439

Differential Revision: D13608151

Pulled By: myleott

fbshipit-source-id: 198b84995a6329f8329829cc91184d88f1eab947

Make error message for trying to train after make_generation_fast work correctly

Summary: https://github.com/pytorch/fairseq/blob/master/fairseq/trainer.py#L164 calls `train()` without any argument

Reviewed By: myleott

Differential Revision: D13599203

fbshipit-source-id: 3a096a6dd35a7a3f8309fbda3b54a36f606475e3

Fixes (#442)

Summary:
minor fixes:
1- adding fairseq logo
2- encoder padding for fconv self att
3- legacy ddp change
Pull Request resolved: https://github.com/pytorch/fairseq/pull/442

Differential Revision: D13651715

Pulled By: myleott

fbshipit-source-id: ac93c80f1dbffdfe03fbd4b8a8ea527aecb576a7

New command line option '--user-dir' (#440)

Summary:
Following discussion on official fairseq (https://github.com/pytorch/fairseq/issues/438), I added the `--user-dir` option to the command line. The user can now specify a path in order to import a custom module with proprietary tasks, architectures and so on.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/440

Differential Revision: D13651721

Pulled By: myleott

fbshipit-source-id: 38b87454487f1ffa5eaf19c4bcefa0b3b15a8f43

'--user-dir' documentation (correct) (#447)

Summary:
Command line option --user-dir documented in docs/overview.rst
Pull Request resolved: https://github.com/pytorch/fairseq/pull/447

Differential Revision: D13674744

Pulled By: myleott

fbshipit-source-id: 17049ee5c9f692f5298ef9fa7381ee583f269cde

Fixed wrong help message shown on '--help' (#446)

Summary:
Correct help message was obfuscated by the transient `ArgumentParser` used only for eagerly read `--user-dir` flag.

To reproduce just try:
```bash
python3 train.py --help
```
Pull Request resolved: https://github.com/pytorch/fairseq/pull/446

Differential Revision: D13674731

Pulled By: myleott

fbshipit-source-id: b9503a4d7ef26405be630d31c0ca02386d783031

optimizations for token_block_dataset

Summary:
optimizing memory use of token_block_dataset by replacing python data structures with numpy arrays.
applying needed parts from D13498973, instead of rebasing it on changes

Reviewed By: edunov

Differential Revision: D13678485

fbshipit-source-id: c0c827a8b95834a6a5456476040ebdc8e42136d4

Add --checkpoint-upper-bound to average_checkpoints.py (#452)

Summary:
This is useful for averaging the last N checkpoints, ending at some "best" checkpoint.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/452

Differential Revision: D13695407

Pulled By: myleott

fbshipit-source-id: 5d9d2bff3706834f01501e9259834c77fb335817

FIX: '--user-dir' on multi-gpu (#449)

Summary:
On a multi-gpu training scenario, the `train.py` script spawns new processes with `torch.multiprocessing.spawn`. Unfortunately those child processes don't inherit the modules imported with `--user-dir`.

This pull request fixes this problem: custom module import in now explicit on every `main()` function.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/449

Differential Revision: D13676922

Pulled By: myleott

fbshipit-source-id: 520358d66155697885b878a37e7d0484bddbc1c6

Fix initial learning rate (#453)

Summary:
There was a very subtle bug here 😢When we recently removed this line (7633129ba8d5f0e28bd6b6d6027b14352482ef31), it meant that the learning rate scheduler didn't get initialized until after the first update. Unfortunately pytorch optimizers store the learning rate in their internal state, so some learning rate schedulers use their `__init__` method to reset the learning rate to some sane initial value. This is especially problematic for LR schedulers that include a warmup, where the Optimizer is likely to contain the peak learning rate at initialization, and it's only in the LR scheduler's `__init__` that the (much smaller) warmup value is set.

For example, the inverse_sqrt scheduler resets the learning rate upon initialization:
https://github.com/pytorch/fairseq/blob/7853818c2e33a63ec17a31bcfe20e4fc75d94130/fairseq/optim/lr_scheduler/inverse_square_root_schedule.py#L48-L50

**Impact:** For the last ~1.5 weeks, the first training update would use the optimizer's default learning rate instead of the initial rate set by the LR scheduler. All subsequent updates used the correct learning rates. This primarily affects LR schedulers with warmups.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/453

Differential Revision: D13704453

Pulled By: myleott

fbshipit-source-id: a946da30100f837c66bdc6b9b77b014ab4eb8764

Fix stories generation

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/454

Differential Revision: D13708565

Pulled By: myleott

fbshipit-source-id: 5cd0e07e3e1885eef14e3a5e8074f24cf4bde632

Fix iteration bug in GroupedIterator. Correct sent size filter. (#455)

Summary:
Fix iterating from the beginning bug when initializing the GroupedIterator. (https://github.com/pytorch/fairseq/issues/441)
 Correct filter criterion for dict type sentence size. (https://github.com/pytorch/fairseq/issues/451)
Pull Request resolved: https://github.com/pytorch/fairseq/pull/455

Differential Revision: D13725646

Pulled By: myleott

fbshipit-source-id: e698fa6f9b45460f95a75c9e9976a3aa3b6aa523

change f"{args}" to "{}".format(args) (#467)

Summary:
Although both are supported by Python 3.6, I think it would be better to unify the usage of string format function.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/467

Differential Revision: D13802506

Pulled By: myleott

fbshipit-source-id: 5c4877547b1c4ca806ab54c80ae483cfbaa7827a

Better error message for improperly formatted dictionaries

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/468

Differential Revision: D13802590

Pulled By: myleott

fbshipit-source-id: e374e38e74dc91bda0579ae41e26289fb0ba56a2

Enforce UTF-8 when open() text files (#460)

Summary:
When opening text files without specifying the encoding (i.e. `open(path, "r")` or `open(path, "w")`), python3 will use the preferred locale encoding (`locale.getpreferredencoding()`) so the result is platform dependent and can change from one machine to another.

I believe fairseq should enforce its standard (UTF-8 seems like the best choice to me). This pull request explicity specify UTF-8 encoding when reading text files.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/460

Differential Revision: D13802525

Pulled By: myleott

fbshipit-source-id: 672fd55707ee559ab36d74bc1c24026166ea2367

Print model and number of trained params

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/469

Differential Revision: D13802945

Pulled By: myleott

fbshipit-source-id: b6976506a8336b96ee40505c4a7638541cc99c95

LSTM improvements (fixes #414)

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/470

Differential Revision: D13803964

Pulled By: myleott

fbshipit-source-id: 91b66599e9a539833fcedea07c608b349ba3b449

Only use c10d distributed primitives

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/471

Differential Revision: D13818918

Pulled By: myleott

fbshipit-source-id: d3b8dc50e81ee1d2dcc5efc5815998be8461085f

Adafactor Optimizer (#472)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/472

Implementation of "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost" (https://arxiv.org/abs/1804.04235)

Differential Revision: D13388049

fbshipit-source-id: 24ad30f4bac248e6aeaced5064bb83784058f03d

refactor AdversarialTrainer factor out helper functions

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/474

Reviewed By: theweiho, akinh

Differential Revision: D13701447

fbshipit-source-id: 34036dce7601835b605e3b169210edc7a6715de6

Add code for "Pay Less Attention with Lightweight and Dynamic Convolutions" (#473)

Summary:
Changelog:
- `e330f56`: Add code for the "Pay Less Attention with Lightweight and Dynamic Convolutions" paper
- `5e3b98c`: Add scripts for computing tokenized BLEU with compound splitting and sacrebleu
- update READMEs
- misc fixes
Pull Request resolved: https://github.com/pytorch/fairseq/pull/473

Differential Revision: D13819717

Pulled By: myleott

fbshipit-source-id: f2dc12ea89a436b950cafec3593ed1b04af808e9

make dictionary class as input for fairseq preprocess functions (#482)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/482

With this change, we can use different dictionary classes when calling build_dictionary and build_and_save_dictionary

Reviewed By: liaimi

Differential Revision: D13855100

fbshipit-source-id: 62e6db310b5f078e05c547d2671252233be7b7f0

Merge internal changes (#483)

Summary:
Changelog:
- `4889802`: can now remove detokenize sentencepiece output with `--remove-bpe=sentencepiece` (fixes #331). Also added `--sacrebleu` for computing detokenized BLEU.
- `0d76427`: fix assertion error when training language model with dataset containing empty sentences
- minor bug and style fixes
Pull Request resolved: https://github.com/pytorch/fairseq/pull/483

Differential Revision: D13867899

Pulled By: myleott

fbshipit-source-id: 25c940b847fe270262ac8f5ac838407b3977fdda

Add --input option to interactive.py to support reading from file

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/484

Differential Revision: D13880636

Pulled By: myleott

fbshipit-source-id: 984b2e1c3b281c28243102eb971ea45ec891d94e

Do distributed init after data loading

Summary:
FACEBOOK

This switches back to torch.multiprocessing.spawn, instead of directly calling fb_train.par using a subprocess.Process. This has the advantage that exceptions are propagated properly. It also moves the distributed_init part to happen after data loading, which gets around the timeout issue.

The downside of this approach is that it's not so easy to pipe stdout to multiple places, which was nice when using the sweep.py scripts. I'm still working on a fix for that.

Reviewed By: rutyrinott, ngoyal2707

Differential Revision: D13873224

fbshipit-source-id: 08d593233b8d23590c01c723363630a79804a8b0

Support custom Dictionary implementations in 'preprocess.py' (#448)

Summary:
The `preprocess.py` script has been refactored in order to:

1. Use the `options` module for command line arguments  parsing. This will give to `preprocess.py` the ability to load custom modules with `--user-dir` flag (already implemented to all other binaries)
2. Dictionary loading and building code has moved to Task implementation. This allows custom Dictionary classes to be used during the data generation step.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/448

Differential Revision: D13674819

Pulled By: myleott

fbshipit-source-id: b40648a98ed6c08284577e5ec25876e018d8c822

Add standalone binaries

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/489

Differential Revision: D13956810

Pulled By: myleott

fbshipit-source-id: 61ace179d1d3790226c38b3f3e47f5452b5ec514

Add CheckpointManager to keep avg checkpoint weights in memory to reduce disk read when averaging + various checkpoint refactoring

Summary: Pull Request resolved: https://github.com/pytorch/translate/pull/315

Reviewed By: akinh

Differential Revision: D13510446

fbshipit-source-id: 22a6594af9253130a93e638285a47183a974e0de

stitch preprocessing pipeline

Summary:
1. add call to binarization to complete preprocessing pipeline
2. add ability to sp…
yzpang pushed a commit to yzpang/gold-off-policy-text-gen-iclr21 that referenced this pull request Feb 19, 2021
Summary:
Pull Request resolved: pytorch/translate#283

Pull Request resolved: facebookresearch/fairseq#428

Differential Revision: D13564190

Pulled By: myleott

fbshipit-source-id: 3b62282d7069c288f5bdd1dd2c120788cee4abb5
yzpang pushed a commit to yzpang/gold-off-policy-text-gen-iclr21 that referenced this pull request Feb 19, 2021
Summary:
Pull Request resolved: pytorch/translate#283

Pull Request resolved: facebookresearch/fairseq#428

Differential Revision: D13564190

Pulled By: myleott

fbshipit-source-id: 3b62282d7069c288f5bdd1dd2c120788cee4abb5
yfyeung pushed a commit to yfyeung/fairseq that referenced this pull request Dec 6, 2023
…ix Chinese chars and English BPE) (facebookresearch#428)

* add pruned transducer stateless5 recipe for tal_csasr

* do some changes for merging

* change for conformer.py

* add wer and cer for Chinese and English respectively

* fix a error for conformer.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants