TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables

This repository contains source code for the TaBERT model, a pre-trained language model for learning joint representations of natural language utterances and (semi-)structured tables for semantic parsing. TaBERT is pre-trained on a massive corpus of 26M Web tables and their associated natural language context, and could be used as a drop-in replacement of a semantic parsers original encoder to compute representations for utterances and table schemas (columns).

Installation

First, install the conda environment tabert with supporting libraries.

bash scripts/setup_env.sh

Once the conda environment is created, install TaBERT using the following command:

conda activate tabert
pip install --editable .

Integration with HuggingFace's pytorch-transformers Library is still WIP. While all the pre-trained models were developed based on the old version of the library pytorch-pretrained-bert, they are compatible with the the latest version transformers. The conda environment will install both versions of the transformers library, and TaBERT will use pytorch-pretrained-bert by default. You could uninstall the pytorch-pretrained-bert library if you prefer using TaBERT with the latest version of transformers.

Pre-trained Models

To be released.

Using a Pre-trained Model

To load a pre-trained model from a checkpoint file:

from table_bert import TableBertModel

model = TableBertModel.from_pretrained(
    'path/to/pretrained/model/checkpoint.bin',
)

To produce representations of natural language text and and its associated table:

from table_bert import Table, Column

table = Table(
    id='List of countries by GDP (PPP)',
    header=[
        Column('Nation', 'text', sample_value='United States'),
        Column('Gross Domestic Product', 'real', sample_value='21,439,453')
    ],
    data=[
        ['United States', '21,439,453'],
        ['China', '27,308,857'],
        ['European Union', '22,774,165'],
    ]
).tokenize(model.tokenizer)

# To visualize table in an IPython notebook:
# display(table.to_data_frame(), detokenize=True)

context = 'show me countries ranked by GDP'

# model takes batched, tokenized inputs
context_encoding, column_encoding, info_dict = model.encode(
    contexts=[model.tokenizer.tokenize(context)],
    tables=[table]
)

For the returned tuple, context_encoding and column_encoding are PyTorch tensors representing utterances and table columns, respectively. info_dict contains useful meta information (e.g., context/table masks, the original input tensors to BERT) for downstream application.

context_encoding.shape
>>> torch.Size([1, 7, 768])

column_encoding.shape
>>> torch.Size([1, 2, 768])

Use Vanilla BERT To initialize a TaBERT model from the parameters of BERT:

from table_bert import TableBertModel

model = TableBertModel.from_pretrained('bert-base-uncased')

Example Applications

TaBERT could be used as a general-purpose representation learning layer for semantic parsing tasks over database tables. Example applications could be found under the examples folder.

Reference

If you plan to use TaBERT in your project, please consider citing our paper:

@inproceedings{yin20acl,
    title = {Ta{BERT}: Pretraining for Joint Understanding of Textual and Tabular Data},
    author = {Pengcheng Yin and Graham Neubig and Wen-tau Yih and Sebastian Riedel},
    booktitle = {Annual Conference of the Association for Computational Linguistics (ACL)},
    month = {July},
    year = {2020}
}

License

TaBERT is CC-BY-NC 4.0 licensed as of now.

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
data		data
examples		examples
scripts		scripts
table_bert		table_bert
utils		utils
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

examples

examples

scripts

scripts

table_bert

table_bert

utils

utils

.gitignore

.gitignore

CODE_OF_CONDUCT.md

CODE_OF_CONDUCT.md

CONTRIBUTING.md

CONTRIBUTING.md

LICENSE.md

LICENSE.md

README.md

README.md

setup.py

setup.py

train.py

train.py

Repository files navigation

TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables

Installation

Pre-trained Models

Using a Pre-trained Model

Example Applications

Reference

License

About

Releases

Packages

Languages

License

stjordanis/TaBERT

Folders and files

Latest commit

History

Repository files navigation

TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables

Installation

Pre-trained Models

Using a Pre-trained Model

Example Applications

Reference

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages