Skip to content

A novel approach for synthesizing tabular data using pretrained large language models

License

Notifications You must be signed in to change notification settings

stjordanis/be_great

 
 

Repository files navigation

Generation of Realistic Tabular data
with pretrained Transformer-based language models

     

Our GReaT framework utilizes the capabilities of pretrained large language Transformer models to synthesize realistic tabular data. New samples are generated with just a few lines of code, following an easy-to-use API. Please see our publication for more details.

GReaT Installation

The GReaT framework can be easily installed using with pip - requires a Python version >= 3.9:

pip install be-great

GReaT Quickstart

In the example below, we show how the GReaT approach is used to generate synthetic tabular data for the California Housing dataset.

from be_great import GReaT
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing(as_frame=True).frame

model = GReaT(llm='distilgpt2', epochs=50)
model.fit(data)
synthetic_data = model.sample(n_samples=100)

GReaT Citation

If you use GReaT, please link or cite our work:

@article{borisov2022language,
  title={Language Models are Realistic Tabular Data Generators},
  author={Borisov, Vadim and Se{\ss}ler, Kathrin and Leemann, Tobias and Pawelczyk, Martin and Kasneci, Gjergji},
  journal={arXiv preprint arXiv:2210.06280},
  year={2022}
}

GReaT Acknowledgements

We sincerely thank the HuggingFace 🤗 framework.

About

A novel approach for synthesizing tabular data using pretrained large language models

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%