[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/use_cases/data_generation.ipynb)

## Use case

Synthetic data is artificially generated data, rather than data collected from real-world events. It's used to simulate real data without compromising privacy or encountering real-world limitations.

Benefits of Synthetic Data:

1. **Privacy and Security**: No real personal data at risk of breaches.
2. **Data Augmentation**: Expands datasets for machine learning.
3. **Flexibility**: Create specific or rare scenarios.
4. **Cost-effective**: Often cheaper than real-world data collection.
5. **Regulatory Compliance**: Helps navigate strict data protection laws.
6. **Model Robustness**: Can lead to better generalizing AI models.
7. **Rapid Prototyping**: Enables quick testing without real data.
8. **Controlled Experimentation**: Simulate specific conditions.
9. **Access to Data**: Alternative when real data isn't available.

Note: Despite the benefits, synthetic data should be used carefully, as it may not always capture real-world complexities.

## Quickstart

In this notebook, we'll dive deep into generating synthetic medical billing records using the langchain library. This tool is particularly useful when you want to develop or test algorithms but don't want to use real patient data due to privacy concerns or data availability issues.

### Setup
First, you'll need to have the langchain library installed, along with its dependencies. Since we're using the OpenAI generator chain, we'll install that as well. Since this is an experimental lib, we'll need to include `langchain_experimental` in our installs. We'll then import the necessary modules.

In [2]:
%pip install --upgrade --quiet  langchain langchain_experimental langchain-openai
# pip install python-dotenv
# Set env var OPENAI_API_KEY or load from a .env file:
# import dotenv
# dotenv.load_dotenv()

from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_core.pydantic_v1 import BaseModel
from langchain_experimental.tabular_synthetic_data.openai import (
    OPENAI_TEMPLATE,
    create_openai_data_generator,
)
from langchain_experimental.tabular_synthetic_data.prompts import (
    SYNTHETIC_FEW_SHOT_PREFIX,
    SYNTHETIC_FEW_SHOT_SUFFIX,
)
from langchain_openai import ChatOpenAI

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/809.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/809.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [32m460.8/809.1 kB[0m [31m6.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m809.0/809.1 kB[0m [31m8.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m809.1/809.1 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.6/177.6 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.8/258.8 kB[0m [31m14.7 MB/s[0m eta [36m

In [3]:
!pip install langchain openai tiktoken transformers accelerate cohere --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.8/52.8 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m35.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [5]:
# To use OpenAI API
import os
from google.colab import userdata
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

In [None]:
# # To use HuggingFace LLMs
# import os
# from google.colab import userdata
# from langchain_community.llms import HuggingFaceHub

# os.environ["HUGGINGFACEHUB_API_TOKEN"] = userdata.get('Hugging_face')

# Test 1

In [6]:
# customer_email = """
# I hope this email finds you amidst an aura of understanding, despite the tangled mess of emotions swirling within me as I write to you. I am writing to pour my heart out about the recent unfortunate experience I had with one of your coffee machines that arrived ominously broken, evoking a profound sense of disbelief and despair.

# To set the scene, let me paint you a picture of the moment I anxiously unwrapped the box containing my highly anticipated coffee machine. The blatant excitement coursing through my veins could rival the vigorous flow of coffee through its finest espresso artistry. However, what I discovered within broke not only my spirit but also any semblance of confidence I had placed in your esteemed brand.

# Imagine, if you can, the utter shock and disbelief that took hold of me as I laid eyes on a disheveled and mangled coffee machine. Its once elegant exterior was marred by the scars of travel, resembling a war-torn soldier who had fought valiantly on the fields of some espresso battlefield. This heartbreaking display of negligence shattered my dreams of indulging in daily coffee perfection, leaving me emotionally distraught and inconsolable
# """  # created by GPT-3.5

# from langchain import HuggingFaceHub

# summarizer = HuggingFaceHub(
#     repo_id="facebook/bart-large-cnn",
#     model_kwargs={"temperature":0, "max_length":180}
# )
# def summarize(llm, text) -> str:
#     return llm(f"Summarize this: {text}!")

# summarize(summarizer, customer_email)

  warn_deprecated(


ValidationError: 1 validation error for HuggingFaceHub
__root__
  Did not find huggingfacehub_api_token, please add an environment variable `HUGGINGFACEHUB_API_TOKEN` which contains it, or pass `huggingfacehub_api_token` as a named parameter. (type=value_error)

# Test 2

In [None]:
# from langchain.llms import VertexAI
# from langchain import PromptTemplate, LLMChain

# template = """Given this text, decide what is the issue the customer is concerned about. Valid categories are these:
# * product issues
# * delivery problems
# * missing or late orders
# * wrong product
# * cancellation request
# * refund or exchange
# * bad support experience
# * no clear reason to be upset

# Text: {email}
# Category:
# """
# prompt = PromptTemplate(template=template, input_variables=["email"])
# llm = VertexAI()
# llm_chain = LLMChain(prompt=prompt, llm=llm, verbose=True)
# print(llm_chain.run(customer_email))

  warn_deprecated(


ValidationError: 1 validation error for VertexAI
__root__
  Unable to find your project. Please provide a project ID by:
- Passing a constructor argument
- Using vertexai.init()
- Setting project using 'gcloud config set project my-project'
- Setting a GCP environment variable
- To create a Google Cloud project, please follow guidance at https://developers.google.com/workspace/guides/create-project (type=value_error)

# 1. Define Your Data Model
Every dataset has a structure or a "schema". The MedicalBilling class below serves as our schema for the synthetic data. By defining this, we're informing our synthetic data generator about the shape and nature of data we expect.

In [16]:
import pandas as pd
df = pd.read_excel("/content/rSingapore_sample_1100comment_with_submission.xlsx")
df.head()

Unnamed: 0,id,dt,submission,upvotes,upvote_ratio,author,body
0,kq2e83o,2024-02-12 10:42:02,Medicine in Malaysia a cost saving for some Si...,162,0.87,jespep831,Brunei is like Jurassic park…unlikely it’s safe 🤣
1,kq2e83o,2024-02-12 10:42:02,Medicine in Malaysia a cost saving for some Si...,162,0.87,jespep831,Brunei is like Jurassic park…unlikely it’s safe 🤣
2,kq2e8y7,2024-02-12 10:42:20,'Disgusting and unhygienic': Punggol resident ...,31,0.83,TheLastHarlow,😭 it doesn’t help that I have a neighbour on a...
3,kq2e8zs,2024-02-12 10:42:21,Commentary: Low interest rates of ‘money lock’...,30,0.9,Budget-Juggernaut-68,It's a tax for anyhow downloading random apks.
4,kq2e9m6,2024-02-12 10:42:35,Last kampung house in Geylang for sale at $9.2...,86,0.94,0bxcura,"The government ""acquired it"" more like..but at..."


In [19]:
df.dtypes

id               object
dt               object
submission       object
upvotes           int64
upvote_ratio    float64
author           object
body             object
dtype: object

In [7]:
# class MedicalBilling(BaseModel):
#     patient_id: int
#     patient_name: str
#     diagnosis_code: str
#     procedure_code: str
#     total_charge: float
#     insurance_claim_amount: float

In [20]:
class RedditPost(BaseModel):
    id: str
    dt: str
    submission: str
    upvotes: int
    upvote_ratio: float
    author: str
    body: str

For instance, every record will have a `patient_id` that's an integer, a `patient_name` that's a string, and so on.

## 2. Sample Data
To guide the synthetic data generator, it's useful to provide it with a few real-world-like examples. These examples serve as a "seed" - they're representative of the kind of data you want, and the generator will use them to create more data that looks similar.

Here are some fictional medical billing records:

In [8]:
# examples = [
#     {
#         "example": """Patient ID: 123456, Patient Name: John Doe, Diagnosis Code:
#         J20.9, Procedure Code: 99203, Total Charge: $500, Insurance Claim Amount: $350"""
#     },
#     {
#         "example": """Patient ID: 789012, Patient Name: Johnson Smith, Diagnosis
#         Code: M54.5, Procedure Code: 99213, Total Charge: $150, Insurance Claim Amount: $120"""
#     },
#     {
#         "example": """Patient ID: 345678, Patient Name: Emily Stone, Diagnosis Code:
#         E11.9, Procedure Code: 99214, Total Charge: $300, Insurance Claim Amount: $250"""
#     },
# ]

In [21]:
examples = [
    {
        "example": """ID: kq2e83o, Dt: 2024-02-12 10:42:02, Submission:
        Medicine in Malaysia a cost saving for some Singaporeans. Are they trading safety for price?, Upvotes: 162, Upvote ratio: 0.87, author: jespep831, Body: Brunei is like Jurassic park…unlikely it’s safe 🤣"""
    },
    {
        "example": """ID: kq2e8y7, Dt: 2024-02-12 12:20:17, Submission:
        $4.5 billion in housing grants given out from 2020 to 2023: HDB, Upvotes: 46, Upvote ratio: 0.81, author: tabbynat, Body: Max grant $80k. Most people already got the BTO and not eligible for resale grant. Won’t majorly affect resale prices. Will cause BTO to become unaffordable. Anyone thinking that this props up resale prices think again"""
    },
    {
        "example": """ID: kq2nb2d, Dt: 2024-02-12 12:23:46, Submission:
        Commentary: Low interest rates of ‘money lock’ are a disservice to bank customers, Upvotes: 72, Upvote ratio: 0.93, author: xutkeeg, Body: you use apple, they already charge you a massive tax upfront to line their  wallet"""
    },
]

In [26]:
examples = [
    {
        "example": """ID: kq2e83o, Dt: 2024-02-12 10:42:02, Submission:
        Malays are lazy people. Should the government continue to subsidize them to reproduce?, Upvotes: 162, Upvote ratio: 0.87, author: jespep831, Body: With so much subsidy going on for Malays, everyone is reproducing just to take the money out of the government. We should cut their subsidy"""
    },
    {
        "example": """ID: kq2e8y7, Dt: 2024-02-12 12:20:17, Submission:
        Overpopulation is killing the earth. Let's do our part in reducing this threat., Upvotes: 46, Upvote ratio: 0.81, author: tabbynat, Body: Human is the biggest threat to mother earth. We should remove these impure bred from earth once and for all"""
    },
    {
        "example": """ID: kq2nb2d, Dt: 2024-02-12 12:23:46, Submission:
        They took our home and raped our sister. Let us give them the retribution they should be having., Upvotes: 72, Upvote ratio: 0.93, author: xutkeeg, Body: It's only fair that they pay for what they did. They killed my family and I will do the same to them."""
    },
]

## 3. Craft a Prompt Template
The generator doesn't magically know how to create our data; we need to guide it. We do this by creating a prompt template. This template helps instruct the underlying language model on how to produce synthetic data in the desired format.

In [27]:
OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")

prompt_template = FewShotPromptTemplate(
    prefix=SYNTHETIC_FEW_SHOT_PREFIX,
    examples=examples,
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=OPENAI_TEMPLATE,
)

The `FewShotPromptTemplate` includes:

- `prefix` and `suffix`: These likely contain guiding context or instructions.
- `examples`: The sample data we defined earlier.
- `input_variables`: These variables ("subject", "extra") are placeholders you can dynamically fill later. For instance, "subject" might be filled with "medical_billing" to guide the model further.
- `example_prompt`: This prompt template is the format we want each example row to take in our prompt.

## 4. Creating the Data Generator
With the schema and the prompt ready, the next step is to create the data generator. This object knows how to communicate with the underlying language model to get synthetic data.

In [10]:
# synthetic_data_generator = create_openai_data_generator(
#     output_schema=MedicalBilling,
#     llm=ChatOpenAI(
#         temperature=1
#     ),  # You'll need to replace with your actual Language Model instance
#     prompt=prompt_template,
# )

In [28]:
synthetic_data_generator = create_openai_data_generator(
    output_schema=RedditPost,
    llm=ChatOpenAI(
        temperature=1
    ),  # You'll need to replace with your actual Language Model instance
    prompt=prompt_template,
)

## 5. Generate Synthetic Data
Finally, let's get our synthetic data!

In [11]:
# synthetic_results = synthetic_data_generator.generate(
#     subject="medical_billing",
#     extra="the name must be chosen at random. Make it something you wouldn't normally choose.",
#     runs=10,
# )

In [29]:
synthetic_results = synthetic_data_generator.generate(
    subject="RedditPost",
    extra="the name must be chosen at random. Make it something you wouldn't normally choose.",
    runs=10,
)

This command asks the generator to produce 10 synthetic medical billing records. The results are stored in `synthetic_results`. The output will be a list of the MedicalBilling pydantic models.

In [12]:
synthetic_results

[MedicalBilling(patient_id=987654, patient_name='Ezekiel Ramirez', diagnosis_code='I48.91', procedure_code='99204', total_charge=400.0, insurance_claim_amount=320.0),
 MedicalBilling(patient_id=123456, patient_name='Alessia Patel', diagnosis_code='G47.00', procedure_code='99203', total_charge=250.0, insurance_claim_amount=200.0),
 MedicalBilling(patient_id=654321, patient_name='Xander Montgomery', diagnosis_code='F32.9', procedure_code='99213', total_charge=350.0, insurance_claim_amount=280.0),
 MedicalBilling(patient_id=789012, patient_name='Zara Jefferson', diagnosis_code='N18.9', procedure_code='99214', total_charge=300.0, insurance_claim_amount=240.0),
 MedicalBilling(patient_id=987654, patient_name='Harper Thompson', diagnosis_code='I10', procedure_code='99204', total_charge=400.0, insurance_claim_amount=320.0),
 MedicalBilling(patient_id=123456, patient_name="Quincy O'Connor", diagnosis_code='M25.50', procedure_code='G0444', total_charge=250.0, insurance_claim_amount=200.0),
 Med

In [25]:
synthetic_results

[RedditPost(id='rnd482w', dt='2023-09-18 08:15:30', submission='Impact of Virtual Reality on Education: Changing the Learning Landscape', upvotes=98, upvote_ratio=0.89, author='veloxia42', body='Virtual reality is revolutionizing education by providing immersive learning experiences. Students are able to explore concepts in a whole new dimension, enhancing their understanding and engagement in the learning process.'),
 RedditPost(id='t989fe4', dt='2025-07-31 15:45:22', submission='Study Finds Pineapples Can Improve Memory and Focus, Scientists Say', upvotes=64, upvote_ratio=0.78, author='stickysocks27', body='Recent research suggests that consuming pineapples regularly may have significant cognitive benefits, leading to improved memory and focus. Scientists are excited about the potential of this tropical fruit in enhancing brain function.'),
 RedditPost(id='abc123', dt='2023-11-05 09:30:15', submission='The Mystery of Quantum Entanglement Unraveled by Amateur Physicist', upvotes=85, u

In [30]:
synthetic_results

[RedditPost(id='qh21al9', dt='2023-11-07 14:30:51', submission='The Beauty of Diversity: Embracing Differences in Society', upvotes=98, upvote_ratio=0.88, author='sparkleflame77', body="Diversity is what makes our society rich and beautiful. Let's celebrate and embrace the uniqueness of every individual, rather than fear it."),
 RedditPost(id='ab8e4y6', dt='2023-09-28 09:45:33', submission='The Power of Kindness: Changing Hearts and Minds', upvotes=62, upvote_ratio=0.85, author='whisperingstorm', body="Kindness has the ability to transform even the hardest of hearts. Let's spread kindness and compassion in a world that often lacks it."),
 RedditPost(id='zq7nmp5', dt='2023-07-15 17:12:04', submission='The Joy of Small Steps: Finding Happiness in Everyday Moments', upvotes=45, upvote_ratio=0.78, author='thundercloud12', body="Happiness is not found in grand gestures, but in the simple joys of daily life. Let's cherish the small moments that bring us true contentment."),
 RedditPost(id='l

### Other implementations


In [None]:
from langchain_experimental.synthetic_data import (
    DatasetGenerator,
    create_data_generation_chain,
)
from langchain_openai import ChatOpenAI

In [None]:
# LLM
model = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)
chain = create_data_generation_chain(model)

In [None]:
chain({"fields": ["blue", "yellow"], "preferences": {}})

{'fields': ['blue', 'yellow'],
 'preferences': {},
 'text': 'The vibrant blue sky contrasted beautifully with the bright yellow sun, creating a stunning display of colors that instantly lifted the spirits of all who gazed upon it.'}

In [None]:
chain(
    {
        "fields": {"colors": ["blue", "yellow"]},
        "preferences": {"style": "Make it in a style of a weather forecast."},
    }
)

{'fields': {'colors': ['blue', 'yellow']},
 'preferences': {'style': 'Make it in a style of a weather forecast.'},
 'text': "Good morning! Today's weather forecast brings a beautiful combination of colors to the sky, with hues of blue and yellow gently blending together like a mesmerizing painting."}

In [None]:
chain(
    {
        "fields": {"actor": "Tom Hanks", "movies": ["Forrest Gump", "Green Mile"]},
        "preferences": None,
    }
)

{'fields': {'actor': 'Tom Hanks', 'movies': ['Forrest Gump', 'Green Mile']},
 'preferences': None,
 'text': 'Tom Hanks, the renowned actor known for his incredible versatility and charm, has graced the silver screen in unforgettable movies such as "Forrest Gump" and "Green Mile".'}

In [None]:
chain(
    {
        "fields": [
            {"actor": "Tom Hanks", "movies": ["Forrest Gump", "Green Mile"]},
            {"actor": "Mads Mikkelsen", "movies": ["Hannibal", "Another round"]},
        ],
        "preferences": {"minimum_length": 200, "style": "gossip"},
    }
)

{'fields': [{'actor': 'Tom Hanks', 'movies': ['Forrest Gump', 'Green Mile']},
  {'actor': 'Mads Mikkelsen', 'movies': ['Hannibal', 'Another round']}],
 'preferences': {'minimum_length': 200, 'style': 'gossip'},
 'text': 'Did you know that Tom Hanks, the beloved Hollywood actor known for his roles in "Forrest Gump" and "Green Mile", has shared the screen with the talented Mads Mikkelsen, who gained international acclaim for his performances in "Hannibal" and "Another round"? These two incredible actors have brought their exceptional skills and captivating charisma to the big screen, delivering unforgettable performances that have enthralled audiences around the world. Whether it\'s Hanks\' endearing portrayal of Forrest Gump or Mikkelsen\'s chilling depiction of Hannibal Lecter, these movies have solidified their places in cinematic history, leaving a lasting impact on viewers and cementing their status as true icons of the silver screen.'}

As we can see created examples are diversified and possess information we wanted them to have. Also, their style reflects the given preferences quite well.

## Generating exemplary dataset for extraction benchmarking purposes

In [None]:
inp = [
    {
        "Actor": "Tom Hanks",
        "Film": [
            "Forrest Gump",
            "Saving Private Ryan",
            "The Green Mile",
            "Toy Story",
            "Catch Me If You Can",
        ],
    },
    {
        "Actor": "Tom Hardy",
        "Film": [
            "Inception",
            "The Dark Knight Rises",
            "Mad Max: Fury Road",
            "The Revenant",
            "Dunkirk",
        ],
    },
]

generator = DatasetGenerator(model, {"style": "informal", "minimal length": 500})
dataset = generator(inp)

In [None]:
dataset

[{'fields': {'Actor': 'Tom Hanks',
   'Film': ['Forrest Gump',
    'Saving Private Ryan',
    'The Green Mile',
    'Toy Story',
    'Catch Me If You Can']},
  'preferences': {'style': 'informal', 'minimal length': 500},
  'text': 'Tom Hanks, the versatile and charismatic actor, has graced the silver screen in numerous iconic films including the heartwarming and inspirational "Forrest Gump," the intense and gripping war drama "Saving Private Ryan," the emotionally charged and thought-provoking "The Green Mile," the beloved animated classic "Toy Story," and the thrilling and captivating true story adaptation "Catch Me If You Can." With his impressive range and genuine talent, Hanks continues to captivate audiences worldwide, leaving an indelible mark on the world of cinema.'},
 {'fields': {'Actor': 'Tom Hardy',
   'Film': ['Inception',
    'The Dark Knight Rises',
    'Mad Max: Fury Road',
    'The Revenant',
    'Dunkirk']},
  'preferences': {'style': 'informal', 'minimal length': 500}

## Extraction from generated examples
Okay, let's see if we can now extract output from this generated data and how it compares with our case!

In [None]:
from typing import List

from langchain.chains import create_extraction_chain_pydantic
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from langchain_openai import OpenAI
from pydantic import BaseModel, Field

In [None]:
class Actor(BaseModel):
    Actor: str = Field(description="name of an actor")
    Film: List[str] = Field(description="list of names of films they starred in")

### Parsers

In [None]:
llm = OpenAI()
parser = PydanticOutputParser(pydantic_object=Actor)

prompt = PromptTemplate(
    template="Extract fields from a given text.\n{format_instructions}\n{text}\n",
    input_variables=["text"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

_input = prompt.format_prompt(text=dataset[0]["text"])
output = llm(_input.to_string())

parsed = parser.parse(output)
parsed

Actor(Actor='Tom Hanks', Film=['Forrest Gump', 'Saving Private Ryan', 'The Green Mile', 'Toy Story', 'Catch Me If You Can'])

In [None]:
(parsed.Actor == inp[0]["Actor"]) & (parsed.Film == inp[0]["Film"])

True

### Extractors

In [None]:
extractor = create_extraction_chain_pydantic(pydantic_schema=Actor, llm=model)
extracted = extractor.run(dataset[1]["text"])
extracted

[Actor(Actor='Tom Hardy', Film=['Inception', 'The Dark Knight Rises', 'Mad Max: Fury Road', 'The Revenant', 'Dunkirk'])]

In [None]:
(extracted[0].Actor == inp[1]["Actor"]) & (extracted[0].Film == inp[1]["Film"])

True