The Natural Language Decathlon: A Multitask Challenge for NLP
Clone or download

decaNLP Logo

Build Status

The Natural Language Decathlon is a multitask challenge that spans ten tasks: question answering (SQuAD), machine translation (IWSLT), summarization (CNN/DM), natural language inference (MNLI), sentiment analysis (SST), semantic role labeling(QA‑SRL), zero-shot relation extraction (QA‑ZRE), goal-oriented dialogue (WOZ, semantic parsing (WikiSQL), and commonsense reasoning (MWSC). Each task is cast as question answering, which makes it possible to use our new Multitask Question Answering Network (MQAN). This model jointly learns all tasks in decaNLP without any task-specific modules or parameters in the multitask setting. For a more thorough introduction to decaNLP and the tasks, see the main website, our blog post, or the paper.

While the research direction associated with this repository focused on multitask learning, the framework itself is designed in a way that should make single-task training, transfer learning, and zero-shot evaluation simple. Similarly, the paper focused on multitask learning as a form of question answering, but this framework can be easily adapted for different approached to single-task or multitask learning.


MQAN 590.5 74.4 18.6 24.3 71.5 87.4 78.4 37.6 84.8 64.8 48.7
S2S 513.6 47.5 14.2 25.7 60.9 85.9 68.7 28.5 84.0 45.8 52.4

Getting Started

First, make sure you have docker and nvidia-docker installed. Then build the docker image:

cd dockerfiles && docker build -t decanlp . && cd -

You will also need to make a .data directory and move the examples for the Winograd Schemas into it:

mkdir -p .data/schema
cp local_data/schema.txt .data/schema/

You can run a command inside the docker image using

nvidia-docker run -it --rm -v `pwd`:/decaNLP/ -u $(id -u):$(id -g) decanlp bash -c "COMMAND"


For example, to train a Multitask Question Answering Network (MQAN) on the Stanford Question Answering Dataset (SQuAD):

nvidia-docker run -it --rm -v `pwd`:/decaNLP/ -u $(id -u):$(id -g) decanlp bash -c "python /decaNLP/ --train_tasks squad --gpu DEVICE_ID"

To multitask with the fully joint, round-robin training described in the paper, you can add multiple tasks:

nvidia-docker run -it --rm -v `pwd`:/decaNLP/ -u $(id -u):$(id -g) decanlp bash -c "python /decaNLP/ --train_tasks squad --train_iterations 1 --gpu DEVICE_ID"

To train on the entire Natural Language Decathlon:

nvidia-docker run -it --rm -v `pwd`:/decaNLP/ -u $(id -u):$(id -g) decanlp bash -c "python /decaNLP/ --train_tasks squad cnn_dailymail sst srl zre woz.en wikisql schema --train_iterations 1 --gpu DEVICE_ID"

To pretrain on n_jump_start=1 tasks for jump_start=75000 iterations before switching to round-robin sampling of all tasks in the Natural Language Decathlon:

nvidia-docker run -it --rm -v `pwd`:/decaNLP/ -u $(id -u):$(id -g) decanlp bash -c "python /decaNLP/ --n_jump_start 1 --jump_start 75000 --train_tasks squad cnn_dailymail sst srl zre woz.en wikisql schema --train_iterations 1 --gpu DEVICE_ID"

This jump starting (or pretraining) on a subset of tasks can be done for any set of tasks, not only the entirety of decaNLP.


If you would like to make use of tensorboard, run (typically in a tmux pane or equivalent):

docker run -it --rm -p -v `pwd`:/decaNLP/ decanlp bash -c "tensorboard --logdir /decaNLP/results"

If you are running the server on a remote machine, you can run the following on your local machine to forward to http://localhost:6006/:

ssh -4 -N -f -L 6006: YOUR_REMOTE_IP

If you are having trouble with the specified port on either machine, run lsof -if:6006 and kill the process if it is unnecessary. Otherwise, try changing the port numbers in the commands above. The first port number is the port the local machine tries to bind to, and and the second port is the one exposed by the remote machine (or docker container).

Notes on Training

  • On a single NVIDIA Volta GPU, the code should take about 3 days to complete 500k iterations. These should be sufficient to approximately reproduce the experiments in the paper. Training for about 7 days should be enough to fully replicate those scores, which should be only a few points higher than what is achieved by 500k iterations.
  • The model can be resumed using stored checkpoints using --load <PATH_TO_CHECKPOINT> and --resume. By default, models are stored every --save_every iterations in the results/ folder tree.
  • During training, validation can be slow! Especially when computing ROUGE scores. Use the --val_every flag to change the frequency of validation.
  • If you run out of GPU memory, reduce --train_batch_tokens and --val_batch_size.
  • If you run out of CPU memory, make sure that you are running the most recent version of the code that interns strings; if you are still running out of CPU memory, post an issue with the command you ran and your peak memory usage.
  • The first time you run, the code will download and cache all considered datasets. Please be advised that this might take a while, especially for some of the larger datasets.

Notes on Cached Data

  • In order to make data loading much quicker for repeated experiments, datasets are cached using code in text/torchtext/datasets/
  • If there is an update to this repository that touches any files in text/, then it might have changed the way a dataset is cached. If this is the case, then you'll need to delete all relevant cached files or you will not see the changes.
  • Paths to cached files should be printed out when a dataset is loaded, either in training or in prediction. Search the text logged to stdout for Loading cached data from or Caching data to in order to locate the relevant path names for data caches.


You can evaluate a model for a specific task with EVALUATION_TYPE as validation or test:

nvidia-docker run -it --rm -v `pwd`:/decaNLP/ -u $(id -u):$(id -g) decanlp bash -c "python /decaNLP/ --evaluate EVALUATION_TYPE --path PATH_TO_CHECKPOINT_DIRECTORY --gpu DEVICE_ID --tasks squad"

or evaluate on the entire decathlon by removing any task specification:

nvidia-docker run -it --rm -v `pwd`:/decaNLP/ -u $(id -u):$(id -g) decanlp bash -c "python /decaNLP/ --evaluate EVALUATION_TYPE --path PATH_TO_CHECKPOINT_DIRECTORY --gpu DEVICE_ID"

For test performance, please use the original SQuAD, MultiNLI, and WikiSQL evaluation systems. For WikiSQL, there is a detailed walk-through of how to get test numbers in the section of this document concerning pretrained models.

Pretrained Models

This model is the best MQAN trained on decaNLP so far. It was trained first on SQuAD and then on all of decaNLP. You can obtain this model and run it on the validation sets with the following.

tar -xvzf mqan_decanlp_qa_first.tar.gz
nvidia-docker run -it --rm -v `pwd`:/decaNLP/  decanlp bash -c "python /decaNLP/ --evaluate validation --path /decaNLP/mqan_decanlp_qa_first --checkpoint_name model.pth --gpu 0"

This model is the best MQAN trained on WikiSQL alone, which established a new state-of-the-art performance by several points on that task: 73.2 / 75.4 / 81.4 (ordered test logical form accuracy, unordered test logical form accuracy, test execution accuracy).

tar -xvzf mqan_wikisql.tar.gz
nvidia-docker run -it --rm -v `pwd`:/decaNLP/  decanlp bash -c "python /decaNLP/ --evaluate validation --path /decaNLP/mqan_wikisql --checkpoint_name model.pth --gpu 0 --tasks wikisql"
nvidia-docker run -it --rm -v `pwd`:/decaNLP/  decanlp bash -c "python /decaNLP/ --evaluate test --path /decaNLP/mqan_wikisql --checkpoint_name model.pth --gpu 0 --tasks wikisql"
docker run -it --rm -v `pwd`:/decaNLP/  decanlp bash -c "python /decaNLP/ /decaNLP/.data/ /decaNLP/mqan_wikisql/model/validation/wikisql.txt /decaNLP/mqan_wikisql/model/validation/wikisql.ids.txt /decaNLP/mqan_wikisql/model/validation/wikisql_logical_forms.jsonl valid"
docker run -it --rm -v `pwd`:/decaNLP/  decanlp bash -c "python /decaNLP/ /decaNLP/.data/ /decaNLP/mqan_wikisql/model/test/wikisql.txt /decaNLP/mqan_wikisql/model/test/wikisql.ids.txt /decaNLP/mqan_wikisql/model/test/wikisql_logical_forms.jsonl test"
git clone for ssh
docker run -it --rm -v `pwd`:/decaNLP/  decanlp bash -c "python /decaNLP/WikiSQL/ /decaNLP/.data/wikisql/data/dev.jsonl /decaNLP/.data/wikisql/data/dev.db /decaNLP/mqan_wikisql/model/validation/wikisql_logical_forms.jsonl" # assumes that you have data stored in .data
docker run -it --rm -v `pwd`:/decaNLP/  decanlp bash -c "python /decaNLP/WikiSQL/ /decaNLP/.data/wikisql/data/test.jsonl /decaNLP/.data/wikisql/data/test.db /decaNLP/mqan_wikisql/model/test/wikisql_logical_forms.jsonl" # assumes that you have data stored in .data

Inference on a Custom Dataset

Using a pretrained model or a model you have trained yourself, you can run on new, custom datasets easily by following the instructions below. In this example, we use the checkpoint for the best MQAN trained on the entirety of decaNLP (see the section on Pretrained Models to see how to get this checkpoint) to run on my_custom_dataset.

mkdir -p .data/my_custom_dataset/
touch .data/my_custom_dataset/val.jsonl
echo '{"context": "The answer is answer.", "question": "What is the answer?", "answer": "answer"}' >> .data/my_custom_dataset/val.jsonl 
# TODO add your own examples line by line to val.jsonl in the form of a JSON dictionary, as demonstrated above.
# Make sure to delete the first line if you don't want the demonstrated example.
nvidia-docker run -it --rm -v `pwd`:/decaNLP/  decanlp bash -c "python /decaNLP/ --evaluate valid --path /decaNLP/mqan_decanlp_qa_first --checkpoint_name model.pth --gpu 0 --tasks my_custom_dataset"

You should get output that ends with something like this:

**  /decaNLP/mqan_decanlp_qa_first/model/valid/my_custom_dataset.txt  already exists -- this is where predictions are stored **
**  /decaNLP/mqan_decanlp_qa_first/model/valid/  already exists -- this is where ground truth answers are stored **
**  /decaNLP/mqan_decanlp_qa_first/model/valid/my_custom_dataset.results.txt  already exists -- this is where metrics are stored **

{'em': 0.0, 'nf1': 100.0, 'nem': 100.0}
Prediction: the answer
Answer: answer

From this output, you can see where predictions are stored along with ground truth outputs and metrics. If you want to rerun using this model checkpoint on this particular dataset, you'll need to pass the --overwrite_predictions argument to If you do not want predictions and answers printed to stdout, then pass the --silent argument to

The metrics dictionary should have printed something like {'em': 0.0, 'nf1': 100.0, 'nem': 100.0}. Here em stands for exact match. This is the percentage of predictions that had every token match the ground truth answer exactly. The normalized version, nem, lowercases and strips punctuation -- all of our models are trained on lowercased data, so nem is a more accurate representation of performance than em for our models. For tasks that are typically treated as classification problems, these exact match scores should correspond to accuracy. nf1 is a normalized (lowercased; punctuation stripped) F1 score over the predicted and ground truth sequences. If you would like to add additional metrics that are already implemented you can try adding --bleu (the typical metric for machine translation) and --rouge (the typical metric for summarization). Other metrics can be implemented following the patterns in


If you use this in your work, please cite The Natural Language Decathlon: Multitask Learning as Question Answering.

  title={The Natural Language Decathlon: Multitask Learning as Question Answering},
  author={Bryan McCann and Nitish Shirish Keskar and Caiming Xiong and Richard Socher},
  journal={arXiv preprint arXiv:1806.08730},


Contact: and