Tabula

The code is for paper TabuLa: Harnessing Language Models for Tabular Data Synthesis. Tabula improves tabular data synthesis by leveraging language model structures without the burden of pre-trained model weights. It offers a faster training process by preprocessing tabular datato shorten token sequence, which sharply reducing training time while consistently delivering higher-quality synthetic data.

Prerequisite

Tabula requires Python version >= 3.9, we have need the library versions to be:

datasets >= 2.5.2
numpy >= 1.24.2
pandas >= 1.4.4
scikit_learn >= 1.1.1
torch >= 1.10.2
tqdm >= 4.64.1
transformers >= 4.22.1

Tabula quickstart

Follow the python notebook Tabula_on_insurance_dataset.ipynb for a training example with Insurance dataset. The Insurance dataset is also provided within the code. We do not hold the copyright of the dataset, the original dataset can also be download here. To download the pre-trained model on intrusion dataset as used in the paper. Download here. Do not forget to create a folder pretrained-model and put the downloaded model inside.

Acknowledgement

Our code adapts the training structure of GReaT. Also thanks HuggingFace for their LLM model.

Citation

Please use following bibtex to cite this paper:

@misc{zhao2023tabula,
      title={TabuLa: Harnessing Language Models for Tabular Data Synthesis}, 
      author={Zilong Zhao and Robert Birke and Lydia Chen},
      year={2023},
      eprint={2310.12746},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Real_Datasets		Real_Datasets
tabula		tabula
tabula_middle_padding		tabula_middle_padding
README.md		README.md
Tabula_middle_padding_on_adult_dataset.ipynb		Tabula_middle_padding_on_adult_dataset.ipynb
Tabula_on_insurance_dataset.ipynb		Tabula_on_insurance_dataset.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real_Datasets

Real_Datasets

tabula

tabula

tabula_middle_padding

tabula_middle_padding

README.md

README.md

Tabula_middle_padding_on_adult_dataset.ipynb

Tabula_middle_padding_on_adult_dataset.ipynb

Tabula_on_insurance_dataset.ipynb

Tabula_on_insurance_dataset.ipynb

Repository files navigation

Tabula

Prerequisite

Tabula quickstart

Acknowledgement

Citation

About

Releases

Packages

Contributors 2

Languages

zhao-zilong/Tabula

Folders and files

Latest commit

History

Repository files navigation

Tabula

Prerequisite

Tabula quickstart

Acknowledgement

Citation

About

Resources

Stars

Watchers

Forks

Languages