Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Single-hop Reading Comprehension Model

This code is for the following paper:

Sewon Min*, Eric Wallace*, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, Luke Zettlemoyer. Compositional Questions Do Not Necessitate Multi-hop Reasoning In: Proceedings of ACL (short). Florence, Italy. 2019.

@inproceedings{ min2019compositional,
    title = { Compositional Questions Do Not Necessitate Multi-hop Reasoning },
    author = { Min, Sewon and Wallace, Eric and Singh, Sameer and Gardner, Matt and Hajishirzi, Hannaneh and Zettlemoyer, Luke },
    booktitle = { ACL },
    year = { 2019 }

This is a general-purpose reading comprehension model based on BERT, which takes a set of paragraphs as an input but is incapable of cross-paragraph reasoning.


This is primarily for HotpotQA. However, for any task which input is a question and (one or more) paragraphs and the output is the answer (span from the paragraph/yes/no) to the question, you can use this code.

For any question, please contact Sewon Min and Eric Wallace.


This code is based on a PyTorch version of Google's pretrained BERT model, from an earlier version of Hugging Face's PyTorch BERT.


  • Python 3.5+
  • Pytorch 1.1.0
  • Tensorflow 1.3.0 (Just for tokenization)


  1. Download Pretrained BERT and Convert to PyTorch

There are multiple BERT models: BERT-Base Uncased, BERT-Large Uncased, BERT-Base Cased and BERT-Large Cased. This code is tested on BERT-Base Uncased. Using the larger model may improve results.

First, download the pre-trained BERT Tensorflow models from here. This is coverted from Google, uncased base version. Please unzip this zip file and rename the directory to bert.

  1. Convert HotpotQA into SQuAD style

To use the same model in all the different datasets, we convert datasets into the SQuAD format. To run on HotpotQA, create a directory and download the training and validation sets into the directory. Then run:

python --data_dir PATH_TO_DATA_DIR --task hotpot-all

You can try different task setting using --task flag. (Please see the code for the details)


The main files used for training are:

  • main code for training and inference
  • preprocessing of the data.
  • actual pytorch model
  • to predict the span based on the logits from the model, and compute the F1 score

To train the model,

python --do_train --output_dir out/hotpot \
          --train_file PATH_TO_TRAIN_FILE \
          --predict_file PATH_TO_DEV_FILE \
          --init_checkpoint PATH_TO_PRETRAINED_BERT \
          --bert_config_file PATH_TO_BERT_CONFIG_FILE \
          --vocab_file PATH_TO_BERT_VOCAB_FILE

Make sure PATH_TO_TRAIN_FILE and PATH_TO_DEV_FILE are set to the output from the script (usually data/hotpot-all/train.json). This code will store the best model in out/hotpot/

To make an inference,

python --do_predict --output_dir out/hotpot \
        --predict_file PATH_TO_DEV_FILE \
        --init_checkpoint out/hotpot/ \
        --predict_batch_size 32 --max_seq_length 300 --prefix dev_

This will store dev_predictions.json and dev_nbest_predictions.json into the out/hotpot directory (prefix is the prefix of the files to store.).

PREFIX_predictions.json is a dictionary with the example id as the keys and the value is the prediction of the model and the ground-truth answer. PREFIX_nbest_predictions.json is the same except has the value as the top-k predictions of the model, as well as the logit values, probability values, no-answer value, and the evidence (the paragraph that the answer is from). You can also adjust which values to store in

Other potentially useful flags:

  • --train_batch_size, --predict_batch_size: batch size for training and evaluation.
  • --n_best_size: number of top k answers to store during inference. (default: 3)
  • --max_answer_length: maximum length of the answer during inference. (default: 30)
  • --eval_period: interval to test on the dev data during training. Please adjust this value depending on your batch_size. (default: 1000)
  • --debug: If you want to try on a subset of the data before running the code, to make sure the error does not occur, you can specify this flag.

When you are doing --do_predict, if you want to make an inference using an ensemble of models, you can specify several model paths to --init_checkpoint: --init_checkpoint out/hotpot1/,out/hotpot2/ Then, the code will make an inference of all models and do voting to get the final output.

Similarly, if you want to combine 2+ data for training and inference, you can specify them. e.g., If you want to train SQuAD and HotpotQA jointly, you can add --train_file SQUAD_TRAIN_FILE,HOTPOT_TRAIN_FILE --predict_file SQUAD_DEV_FILE,HOTPOT_DEV_FILE, then the code will train and test the model with combined data.


An original implementation of ACL 2019, "Compositional Questions Do Not Necessitate Multi-hop Reasoning" (Single-hop Reading Comprehension Model based on BERT)



No releases published


No packages published