GitHub - ruc-datalab/PASTA: This repository contains source code for the PASTA model, a pre-trained language model for table-based fact verification.

Introduction

This repository contains source code for the EMNLP'2022 paper "Table-Operations Aware Fact Verification via Sentence-Table Cloze Pre-training". In this paper we introduce PASTA, a novel state-of-the-art framework for table-based fact verification via pre-training with synthesized sentence–table cloze questions. In particular, we design six types of common sentence–table cloze tasks, including Filter, Aggregation, Superlative, Comparative, Ordinal, and Unique, based on which we synthesize a large corpus consisting of 1.2 million sentence–table pairs from WikiTables.

Code Structure

|-- data_cache # save the propossed data cache for tabfact/semtabfacts
|-- datasets # save the official datasets of tabfact/semtabfacts
|-- experimental_results # save the evaluation results (acc|f1_micro) of tabfact/semtabfacts
|-- pretrained_models # the model used to adapted to the downstream tasks
|-- save_checkpoints # save the fine-tuned checkpoints of tabfact/semtabfacts
|-- src # code for fine-tune/pre-train
    |-- scripts # configs for fine-tune/pre-train
        |-- train_semtabfacts.json # you need to modify the path
        |-- train_tabfact.json # you need to modify the path
        |-- pretrain.json # you need to modify the path
    |-- utils
        |-- args.py # Definition of different parameters
        |-- dataset.py # the class to load and preprocess the dataset
        |-- linearize.py # the class to flatten a table into a linearized form, it is adapted from https://github.com/microsoft/Table-Pretraining/blob/0b87efa253232d4aafa52c1f4725cb4f6e027877/tapex/processor/table_linearize.py.
        |-- entitylink.py # the class to identify entity links between the sentence and the table, it is adapted from https://github.com/wenhuchen/Table-Fact-Checking/blob/5ea13b8f6faf11557eec728c5f132534e7a22bf7/code/preprocess_data.py.
        |-- pasta_mlm_model.py # the class to define the operation-aware pretraing model
        |-- TabFV_model.py # the class to define the table-based fact verification model
    |-- run_finetune.py # the class to train the table-based fact verification model
    |-- run_pretrain.py # the class to pre-train the pasta model from scratch

Requirements

Before running the code, please make sure your Python version is above 3.8. Then install the necessary packages by:

pip install -r requirements.txt

Datasets Preparation

TabFact

Please download the TabFact dataset from the official GitHub repository and put it under the folder PASTA/datasets.

git clone git@github.com:wenhuchen/Table-Fact-Checking.git
mv Table-Fact-Checking tabfact
mv tabfact PASTA/datasets

SEM-TAB-FACTS

Please download the sem-tab-facts dataset from the official website:

Then refer to this repository for standardizing the table header, or you can directly download the dataset we have processed.

Finally, put the processed dataset under the folder PASTA/datasets, and name it semtabfacts.

Quick Start

Download PASTA

Download the PASTA model and put it under folder pretrained_models/.

Run TabFact

Fine-tune on the Tabfact dataset with the following command.

python src/run_finetune.py src/scripts/train_tabfact.json

Note that you need to modify the paths in the .json file to your own paths.

Run SEM-TAB-FACTS

Following LKA, we also use the trained model on the TabFact to initialize the training of SEM-TAB-FACTS. Therefore, You need to train on the TabFact dataset to get the checkpoint, or you can directly download the checkpoint we provide and put it under /save_checkpoints. Then, fine-tune on the SEM-TAB-FACTS dataset with the following command.

python src/run_finetune.py src/scripts/train_semtabfacts.json

Note that you need to modify the paths in the .json file to your own paths.

Pre-training From Scratch

Pre-training Corpus

Download the pre-training corpus, which consists mostly of 128K sentence-table cloze samples. Here is an example.

Input: Openration-aware Masked Sentence + Linearized Table

# Mask all of the the tokens corresponding to a span at once.
[MASK] [MASK] [MASK] has the highest attendance of all date [Header] date | visitor | score | home | leading scorer | attendance | record [Row] 1 april 2008 | knicks | 115 - 119 | bucks | quentin richardson (22) | 13579 | 20 - 54 [Row] ……

Output: Answer

8 april 2008

Put the unzip dataset under the folder PASTA/datasets, and name it pasta.

Run Pre-training

Pretrain the PASTA model with the following command.

python src/run_pretrain.py src/scripts/pretrain.json

Note that you need to modify the paths in the .json file to your own paths.

Reference

@article{DBLP:journals/corr/abs-2211-02816,
  author    = {Zihui Gu and
               Ju Fan and
               Nan Tang and
               Preslav Nakov and
               Xiaoman Zhao and
               Xiaoyong Du},
  title     = {{PASTA:} Table-Operations Aware Fact Verification via Sentence-Table
               Cloze Pre-training},
  journal   = {CoRR},
  volume    = {abs/2211.02816},
  year      = {2022},
  url       = {https://doi.org/10.48550/arXiv.2211.02816},
  doi       = {10.48550/arXiv.2211.02816},
  eprinttype = {arXiv},
  eprint    = {2211.02816},
  timestamp = {Wed, 09 Nov 2022 17:33:26 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2211-02816.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data_cache		data_cache
datasets		datasets
experimental_results		experimental_results
pretrained_models		pretrained_models
save_checkpints		save_checkpints
src		src
LICENSE		LICENSE
README.md		README.md
overview.png		overview.png
requirements.txt		requirements.txt

License

ruc-datalab/PASTA

Folders and files

Latest commit

History

Repository files navigation

Introduction

Code Structure

Requirements

Datasets Preparation

TabFact

SEM-TAB-FACTS

Quick Start

Download PASTA

Run TabFact

Run SEM-TAB-FACTS

Pre-training From Scratch

Pre-training Corpus

Run Pre-training

Reference

About

Resources

License

Stars

Watchers

Forks

Languages