This is the official dataset repository for SciReviewGen: A Large-scale Dataset for Automatic Literature Review Generation in ACL findings 2023.
- split_survey_df: The split version of SciReviewGen, which aims to generate literature review chapters
- original_survey_df: The original version of SciReviewGen, which aims to generate the entire text of literature reviews
- summarization_csv: CSV files suitable for summarization task. You can apply them to HuggingFace's official sample codes
- Row:
- literature review chapter or the entire text of literature review
- Column:
- paper_id: paper_id used in S2ORC
- title: title of the literature review
- abstract: abstract of the literature review
- section: chapter title
- text: body text of literature review chapter or literature review paper
- n_bibs: number of the cited papers that can be used as inputs
- n_nonbibs: number of the cited papers that cannot be used as inputs
- bib_titles: titles of the cited papers
- bib_abstracts: abstracts of the cited papers
- bib_citing_sentences: citing sentences that cite the cited papers
- split: train/val/test split
- Row:
- literature review chapter
- Column:
- reference:
literature review title <s> chapter title <s> abstract of cited paper 1 <s> BIB001 </s> literature review title <s> chapter title <s> abstract of cited paper 2 <s> BIB002 </s> ...
- target: literature review chapter
- reference:
- Python 3.9
- Run the following command to clone the repository and install the required packages
git clone https://github.com/tetsu9923/SciReviewGen.git
cd SciReviewGen
pip install -r requirements.txt
- Download S2ORC (We use the version released on 2020-07-05, which contains papers up until 2020-04-14)
- Run the following command:
python json_to_df.py \
-s2orc_path <Path to the S2ORC full dataset directory (Typically ".../s2orc/full/20200705v1/full")> \
-dataset_path <Path to the generated dataset> \
--field <Optional: the field of the literature reviews (mag_field_of_study in S2ORC, default="Computer Science")>
The metadata and pdf parses of the candidates for the literature reviews and the cited papers are stored in dataset_path (in the form of pandas dataframe).
- Run the following command:
python make_section_df.py \
-dataset_path <Path to the generated dataset> \
--version <Optional: the version of SciReviewGen ("split" or "original", default="split")>
The SciReviewGen dataset (split_survey_df.pkl or original_survey_df.pkl) is stored in dataset_path (in the form of pandas dataframe).
filtered_dict.pkl
gives the list of literature reviews after filtering by the SciBERT-based classifier (Section 3.2).
- Run the following command:
python make_summarization_csv.py \
-dataset_path <Path to the generated dataset>
The csv files for summarization (train.csv, val.csv, and test.csv) are stored in dataset_path.
If you train QFiD on the generated csv files, add --for_qfid
argument as below.
python make_summarization_csv.py \
-dataset_path <Path to the generated dataset> \
--for_qfid
We trained the SciBERT-based literature review classifier. The model weights are available here.
We proposed Query-weighted Fusion-in-Decoder (QFiD) that explicitly considers the relevance of each input document to the queries.
You can train QFiD on SciReviewGen csv data (Make sure that you passed --for_qfid
argument when executing make_summarization_csv.py
).
- Modify qfid/train.sh (CUDA_VISIBLE_DEVICES, csv file path, outpput_dir, and num_train_epochs)
- Run the following command:
cd qfid
./train.sh
- Modify qfid/test.sh (CUDA_VISIBLE_DEVICES, csv file path, outpput_dir, and num_train_epochs. Please set num_train_epochs as the number of epochs you trained in total)
- Run the following command:
./test.sh
- SciReviewGen is released under CC BY-NC 4.0. You can use SciReviewGen for only non-commercial purposes.
- SciReviewGen is created based on S2ORC. Note that S2ORC is released under CC BY-NC 4.0, which allows users to copy and redistribute for only non-commercial purposes.