Easy Synthetic Text Generator (aka ESTGen)

Easy Synthetic Text Generator (ESTGen) helps you create quality synthetic text data. Think of it as a toolkit for generating synthetic text data using LLM based on categories provided.

What is it?

ESTGen is a simple script execution framework that comes with three python scripts:

gen_prompts.py: Generate prompts using prompt_template and prompt_categories.json so that gen_data can generate synthetic data using LLM
gen_data.py: Generate data by executing pre-generated prompts against HuggingChat model or ChatGPT
filter_data.py: Filter the dataset by cosine similarity and analyze the data generated

ESTGen execution is configured using estgen_config.json. While gen_prompts and gen_data are supported natively, other python scripts can be added in the execution steps. When a new python script is added, the script is invoked using JSON parameters as arguments.

For example,

        {
            "script": "convert_to_prompt_categories.py",
            "description": "Convert scenarios to category seeds in JSON format",
            "input_filename": "scenarios.csv",
            "output_filename": "prompt_categories_seeded.json"
        },

would be executed as

python convert_to_prompt_categories.py --input_filename scenarios.csv --output_filename prompt_categories_seeded.json

Why should I care?

If you are looking to train an AI model to classify text communication data, then ESTGen can help you obtain your first set of labeled synthetic data fast. It will save you time and manual processing by automating many mandane tasks (like generating prompts, monitoring whether all prompts have been executed, re-running data generation process).

It offers the following features:

Prompt generation: generate prompts using template, using variable replacement and seeding
Data generation: multiple models, pick up where it left off, using earlier prompt response cache
Multiple language model support: Out-of-the-box support for free Hugging Chat APIs which offers 10 different LLMs including meta-llama/Llama-3.3-70B-Instruct and CohereForAI/c4ai-command-r-plus-08-2024; and Chat GPT APIs which offers GPT-4o and GPT-4o-Mini
Anti-hallucination support: Built-in 3 retry logic to recover from invalid JSON and incorrect number of elements in the JSON array
Cosine similarity filtering and data analysis reporting: A script to filter dataset by cosine similarity threshold so that dataset consists of sufficiently distinct data, and report to show shorted, median, and longest data along with cosine similarity threshold sensitivity analysis
Custom scripts with input and output: A place holder to introduce any custom python script that takes input and generates output

How do I use it?

You can run ESTGen in three easy steps:

Step 1: Clone the Repo https://github.com/jhkdes/Easy-Synthetic-Text-Generator
Step 2: Obtain API Keys from HuggingChat and ChatGPT, and populate them in estgen_config file
- login_email: Login email to HuggingChat
- login_passwd: Login passwd to HuggingChat
- api_key: API key to ChatGPT
Step 3: Install required packages, and run

pip install -r requirements
python run_estgen.py

It will start generating synthetic text conversations using 9 depression categories outlined in the prompt_categories.json file. My synthetic dataset copy which was generated using the same configs is available on HuggingFace

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
convert_chat_json_to_individual_msgs.py		convert_chat_json_to_individual_msgs.py
convert_chat_json_to_string.py		convert_chat_json_to_string.py
convert_to_prompt_categories.py		convert_to_prompt_categories.py
copy_file.py		copy_file.py
estgen_config.json		estgen_config.json
filter_data.py		filter_data.py
gen_data.py		gen_data.py
gen_prompts.py		gen_prompts.py
language_model.py		language_model.py
prompt_categories.json		prompt_categories.json
prompt_config.json		prompt_config.json
requirements.txt		requirements.txt
run_estgen.py		run_estgen.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Easy Synthetic Text Generator (aka ESTGen)

What is it?

Why should I care?

How do I use it?

About

Uh oh!

Releases 2

Packages

Languages

jhkdes/Easy-Synthetic-Text-Generator

Folders and files

Latest commit

History

Repository files navigation

Easy Synthetic Text Generator (aka ESTGen)

What is it?

Why should I care?

How do I use it?

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages