This repo provides the source code & data of our paper GreaseLM: Graph REASoning Enhanced Language Models for Question Answering (ICLR 2022 spotlight). If you use any of our code, processed data or pretrained models, please cite:
@inproceedings{zhang2021greaselm,
title={GreaseLM: Graph REASoning Enhanced Language Models},
author={Zhang, Xikun and Bosselut, Antoine and Yasunaga, Michihiro and Ren, Hongyu and Liang, Percy and Manning, Christopher D and Leskovec, Jure},
booktitle={International Conference on Learning Representations},
year={2021}
}
- Python == 3.8
- PyTorch == 1.8.0
- transformers == 3.4.0
- torch-geometric == 1.7.0
Run the following commands to create a conda environment (assuming CUDA 10.1):
conda create -y -n greaselm python=3.8
conda activate greaselm
pip install numpy==1.18.3 tqdm
pip install torch==1.8.0+cu101 torchvision -f https://download.pytorch.org/whl/torch_stable.html
pip install transformers==3.4.0 nltk spacy
pip install wandb
conda install -y -c conda-forge tensorboardx
conda install -y -c conda-forge tensorboard
# for torch-geometric
pip install torch-scatter==2.0.7 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
pip install torch-cluster==1.5.9 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
pip install torch-sparse==0.6.9 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
pip install torch-spline-conv==1.2.1 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
pip install torch-geometric==1.7.0 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
Preprocessing the data yourself may take a long time, so if you want to directly download preprocessed data, please jump to the next subsection.
Download the raw ConceptNet, CommonsenseQA, and OpenBookQA data by using
./download_raw_data.sh
You can preprocess the raw data by running
CUDA_VISIBLE_DEVICES=0 python preprocess.py -p <num_processes>
You can specify the GPU you want to use at the beginning of the command CUDA_VISIBLE_DEVICES=...
. The script will:
- Setup ConceptNet (e.g., extract English relations from ConceptNet, merge the original 42 relation types into 17 types)
- Convert the QA datasets into .jsonl files (e.g., stored in
data/csqa/statement/
) - Identify all the mentioned concepts in the questions and answers
- Extract subgraphs for each q-a pair
The script to download and preprocess the MedQA-USMLE data and the biomedical knowledge graph based on Disease Database and DrugBank is provided in utils_biomed/
.
For your convenience, if you don't want to preprocess the data yourself, you can download all the preprocessed data here. Download them into the top-level directory of this repo and unzip them. Move the medqa_usmle
and ddb
folders into the data/
directory.
The resulting file structure should look like this:
.
├── README.md
├── data/
├── cpnet/ (preprocessed ConceptNet)
├── csqa/
├── train_rand_split.jsonl
├── dev_rand_split.jsonl
├── test_rand_split_no_answers.jsonl
├── statement/ (converted statements)
├── grounded/ (grounded entities)
├── graphs/ (extracted subgraphs)
├── ...
├── obqa/
├── medqa_usmle/
└── ddb/
To train GreaseLM on CommonsenseQA, run
CUDA_VISIBLE_DEVICES=0 ./run_greaselm.sh csqa --data_dir data/
You can specify up to 2 GPUs you want to use at the beginning of the command CUDA_VISIBLE_DEVICES=...
.
Similarly, to train GreaseLM on OpenbookQA, run
CUDA_VISIBLE_DEVICES=0 ./run_greaselm.sh obqa --data_dir data/
To train GreaseLM on MedQA-USMLE, run
CUDA_VISIBLE_DEVICES=0 ./run_greaselm__medqa_usmle.sh
You can download a pretrained GreaseLM model on CommonsenseQA here, which achieves an IH-dev acc. of 79.0
and an IH-test acc. of 74.0
.
You can also download a pretrained GreaseLM model on OpenbookQA here, which achieves a test acc. of 84.8
.
You can also download a pretrained GreaseLM model on MedQA-USMLE here, which achieves a test acc. of 38.5
.
To evaluate a pretrained GreaseLM model checkpoint on CommonsenseQA, run
CUDA_VISIBLE_DEVICES=0 ./eval_greaselm.sh csqa --data_dir data/ --load_model_path /path/to/checkpoint
Again, you can specify up to 2 GPUs you want to use in the beginning of the command CUDA_VISIBLE_DEVICES=...
.
Similarly, to evaluate a pretrained GreaseLM model checkpoint on OpenbookQA, run
CUDA_VISIBLE_DEVICES=0 ./eval_greaselm.sh obqa --data_dir data/ --load_model_path /path/to/checkpoint
To evaluate a pretrained GreaseLM model checkpoint on MedQA-USMLE, run
INHERIT_BERT=1 CUDA_VISIBLE_DEVICES=0 ./eval_greaselm.sh medqa_usmle --data_dir data/ --load_model_path /path/to/checkpoint
- Convert your dataset to
{train,dev,test}.statement.jsonl
in .jsonl format (seedata/csqa/statement/train.statement.jsonl
) - Create a directory in
data/{yourdataset}/
to store the .jsonl files - Modify
preprocess.py
and perform subgraph extraction for your data - Modify
utils/parser_utils.py
to support your own dataset
This repo is built upon the following work:
QA-GNN: Question Answering using Language Models and Knowledge Graphs
https://github.com/michiyasunaga/qagnn
Many thanks to the authors and developers!