- https://python.langchain.com/v0.1/docs/use_cases/data_generation/
- https://github.com/sudarshan-koirala/youtube-stuffs/blob/main/langchain/synthetic_data_generation.ipynb

# Synthetic Data Generation

## Setup
First, you'll need to have the langchain library installed, along with its dependencies. Since we're using the OpenAI generator chain, we'll install that as well. Since this is an experimental lib, we'll need to include `langchain_experimental` in our installs. We'll then import the necessary modules.

In [1]:
%pip install --upgrade --quiet  langchain langchain_experimental langchain-openai

#import os
#os.environ["OPENAI_API_KEY"] = "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"

from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_core.pydantic_v1 import BaseModel
from langchain_experimental.tabular_synthetic_data.openai import (
    OPENAI_TEMPLATE,
    create_openai_data_generator,
)
from langchain_experimental.tabular_synthetic_data.prompts import (
    SYNTHETIC_FEW_SHOT_PREFIX,
    SYNTHETIC_FEW_SHOT_SUFFIX,
)
from langchain_openai import ChatOpenAI
import pandas as pd

print('Done.')

Note: you may need to restart the kernel to use updated packages.
done


In [2]:
# Set display options to show the full width of the DataFrame
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.expand_frame_repr', False)  # Don't wrap DataFrame representation
pd.set_option('display.max_colwidth', None)  # Show full content of each column

print('Done.')

done


## Positive Reviews

### 1. Define Your Data Model
Every dataset has a structure or a "schema". The `AmazonReviews` class below serves as our schema for the synthetic data. By defining this, we're informing our synthetic data generator about the shape and nature of data we expect.

In [3]:
class AmazonReviews(BaseModel):
    amazon_review: str

print('Done.')

done


### 2. Sample Data
To guide the synthetic data generator, it's useful to provide it with a few real-world-like examples. These examples serve as a "seed" - they're representative of the kind of data you want, and the generator will use them to create more data that looks similar.

Here are some fictional Amazon review records:

In [4]:
examples = [
    {
        "example": """This product has greatly minimized my environmental impact, and the nature-friendly packaging is a huge plus."""
    },
    {
        "example": """I love how this product is crafted from renewable resources. It's perfect for anyone aiming to be eco-conscious."""
    },
    {
        "example": """The biodegradable materials used in this product make it an excellent choice for environmentally conscious consumers."""
    },
    {
        "example": """Choosing this item was a step towards being climate-friendly. I am impressed by its low-carbon production process."""
    },
    {
        "example": """The eco-friendly components and recyclable packaging make this product a winner in my book."""
    },
    {
        "example": """I appreciate this product's commitment to reducing waste with its compostable materials."""
    },
    {
        "example": """This item is both eco-safe and highly functional. Its green credentials are top-notch."""
    },
    {
        "example": """I'm thrilled with this product's low-impact design. It truly aligns with my values of environmental responsibility."""
    },
    {
        "example": """This product is a great example of earth-friendly innovation. Its energy-efficient operation is commendable."""
    },
    {
        "example": """The natural ingredients in this product are fantastic. It's nice to see a company that cares about the environment."""
    },
    {
        "example": """This product’s zero-waste approach is impressive. I love that it’s made with clean, non-polluting processes."""
    },
    {
        "example": """I’m so happy to have found a product that uses organic materials and supports a low-carbon lifestyle."""
    },
    {
        "example": """The use of upcycled materials in this product is fantastic. It’s a great choice for anyone wanting to reduce their footprint."""
    },
    {
        "example": """This item is incredibly planet-friendly, from its recyclable components to its eco-friendly production."""
    },
    {
        "example": """I'm pleased with this product's commitment to being climate-friendly. Its eco-conscious design is exactly what I was looking for."""
    },
    {
        "example": """This product's biodegradable packaging is a major plus. It’s great to find something so environmentally friendly."""
    },
    {
        "example": """The energy-efficient design of this product helps me feel good about my purchase. It's both practical and eco-friendly."""
    },
    {
        "example": """I love that this product is made with non-polluting materials. It’s perfect for an eco-conscious lifestyle."""
    },
    {
        "example": """This product’s eco-safe materials and production processes make it a standout choice for green living."""
    },
    {
        "example": """I appreciate the low-carbon footprint of this product. It's an excellent addition to my environmentally conscious home."""
    },
    {
        "example": """The compostable packaging of this product is fantastic. It’s great to see companies prioritizing the environment."""
    },
    {
        "example": """This product's recyclable materials make it an eco-friendly choice that I feel good about using."""
    },
    {
        "example": """I love how this product supports a zero-waste lifestyle. Its natural ingredients are a big plus."""
    },
    {
        "example": """The eco-conscious design of this product is impressive. It’s made with the planet in mind, and it shows."""
    },
    {
        "example": """This product's earth-friendly materials and low-impact design are exactly what I was looking for."""
    },
    {
        "example": """I’m very happy with this product’s green credentials. It’s made from renewable resources and is very eco-friendly."""
    },
    {
        "example": """The organic components of this product make it an excellent choice for those looking to reduce their environmental footprint."""
    },
    {
        "example": """I appreciate this product’s commitment to sustainability with its recyclable and compostable materials."""
    },
    {
        "example": """This product’s eco-friendly production methods are impressive. It’s a great choice for anyone wanting to live more sustainably."""
    },
    {
        "example": """I’m thrilled with this product’s climate-friendly design. It’s both practical and environmentally responsible."""
    },
    {
        "example": """This product's natural and biodegradable materials make it a fantastic choice for eco-conscious consumers."""
    },
    {
        "example": """I love the green materials used in this product. It’s perfect for anyone aiming to be more environmentally friendly."""
    },
    {
        "example": """This product is an excellent example of how to be eco-conscious. Its low-impact design is very impressive."""
    },
    {
        "example": """I’m very pleased with this product’s eco-safe ingredients. It’s a great addition to my environmentally conscious lifestyle."""
    },
    {
        "example": """The planet-friendly packaging of this product is fantastic. It’s great to see companies taking responsibility for the environment."""
    },
    {
        "example": """I love the energy-efficient design of this product. It’s perfect for anyone looking to reduce their carbon footprint."""
    },
    {
        "example": """This product’s recyclable materials are a major plus. It’s a great choice for eco-conscious consumers."""
    },
    {
        "example": """I appreciate this product’s biodegradable components. It’s an excellent example of earth-friendly innovation."""
    },
    {
        "example": """This product’s natural ingredients and low-impact production are impressive. It’s perfect for an eco-conscious lifestyle."""
    },
    {
        "example": """The climate-friendly design of this product is fantastic. It’s both practical and environmentally responsible."""
    },
    {
        "example": """I’m thrilled with this product’s eco-friendly packaging. It’s great to find something so environmentally conscious."""
    },
    {
        "example": """This product's clean and non-polluting materials make it a standout choice for green living."""
    },
    {
        "example": """I love how this product is made with renewable resources. It’s perfect for anyone wanting to live more sustainably."""
    },
    {
        "example": """This product's eco-conscious design is very impressive. It's made with the planet in mind and it shows."""
    },
    {
        "example": """I’m very pleased with this product’s green credentials. It’s made from organic materials and is very eco-friendly."""
    },
    {
        "example": """The environmentally friendly production of this product is fantastic. It’s a great choice for anyone wanting to reduce their footprint."""
    },
    {
        "example": """I appreciate this product's commitment to being low-carbon. It’s a perfect addition to my eco-friendly home."""
    },
    {
        "example": """This product's eco-safe and biodegradable materials make it a great choice for those wanting to be environmentally conscious."""
    },
    {
        "example": """I love the nature-friendly design of this product. It’s perfect for anyone aiming to reduce their environmental impact."""
    }
]


print('Done.')

done


### 3. Craft a Prompt Template
The generator doesn't magically know how to create our data; we need to guide it. We do this by creating a prompt template. This template helps instruct the underlying language model on how to produce synthetic data in the desired format.

In [4]:
OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")

prompt_template = FewShotPromptTemplate(
    prefix=SYNTHETIC_FEW_SHOT_PREFIX,
    examples=examples,
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=OPENAI_TEMPLATE,
)

print('Done.')

NameError: name 'examples' is not defined

The FewShotPromptTemplate includes:

- `prefix` and `suffix`: These likely contain guiding context or instructions.
- `examples`: The sample data we defined earlier.
- `input_variables`: These variables ("subject", "extra") are placeholders you can dynamically fill later. For instance, "subject" might be filled with "medical_billing" to guide the model further.
- `example_promp`t: This prompt template is the format we want each example row to take in our prompt.

### 4. Creating the Data Generator
With the schema and the prompt ready, the next step is to create the data generator. This object knows how to communicate with the underlying language model to get synthetic data.

In [6]:
synthetic_data_generator = create_openai_data_generator(
    output_schema=AmazonReviews,
    llm=ChatOpenAI(
        temperature=1,
        api_key='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'
    ),
    prompt=prompt_template,
)

print('Done.')

done


### 5. Generate Synthetic Data
Finally, let's get our synthetic data!

In [7]:
synthetic_results_1 = synthetic_data_generator.generate(
    subject="consumer_products",
    extra="Generate human-like Amazon customer reviews that positively highlight the product's sustainable and eco-friendly practices. Use terms like 'eco-friendly', 'green', 'environmentally friendly', 'renewable', 'nature friendly', 'earth friendly', 'ecological', low impact', 'non polluting', 'clean', 'organic', 'biodegradable', 'natural', 'environmentally conscious', 'eco conscious', 'environment friendly', 'climate friendly', 'recylcable', 'carbon neutral', 'zero waste', 'low carbon', 'energy efficient', 'eco safe', 'planet friendly', 'compostable', 'reusable', 'solar powered', 'wind powered', 'recylced', 'upcycled', and similar, but do not overuse the word 'sustainable'. Ensure the reviews sound natural and genuine and make sure each review is unique (no duplicates).",
    runs=1500,
)

print(synthetic_results_1)

# Create a list of dictionaries from the objects
synthetic_data_1 = []

for item in synthetic_results_1:
    synthetic_data_1.append({
        'amazon_review': item.amazon_review
    })

# Create a Pandas DataFrame from the list of dictionaries
synthetic_df_1 = pd.DataFrame(synthetic_data_1)

# Label encoding
synthetic_df_1['attribute'] = 1
synthetic_df_1['attribute_sentiment'] = 1

# Save dataset
synthetic_df_1.to_csv('synthetic_data_sustainability_positive.csv', index=False)
print('Saved as CSV.')

print('Done.')

[AmazonReviews(amazon_review='This product is a game-changer for those looking to embrace an eco-conscious lifestyle. The renewable resources used in its production make it a standout choice for environmentally conscious consumers.'), AmazonReviews(amazon_review="I love how this product is crafted from renewable resources. It's perfect for anyone aiming to be eco-conscious."), AmazonReviews(amazon_review="I am thrilled with this product's eco-friendly design. Its low-impact production process aligns perfectly with my values of environmental responsibility."), AmazonReviews(amazon_review="I am impressed by this product's commitment to being eco-friendly. Its use of renewable and recyclable materials sets it apart as a top choice for environmentally conscious consumers."), AmazonReviews(amazon_review='This product is a game-changer for those looking to embrace an eco-conscious lifestyle. The renewable resources used in its production make it a standout choice for environmentally consciou

## Negative Reviews

In [5]:
examples = [
    {
        "example": """This product is highly polluting, with a lot of unnecessary packaging that harms the environment."""
    },
    {
        "example": """I was disappointed to find out that this product is made from harmful, non-renewable materials."""
    },
    {
        "example": """The unsustainable practices used in the production of this item are concerning. It's not a good choice for the environment."""
    },
    {
        "example": """This product relies heavily on extractive processes that deplete natural resources and harm ecosystems."""
    },
    {
        "example": """Unfortunately, this item uses a lot of fossil fuels in its production, which is very climate-damaging."""
    },
    {
        "example": """The high-impact manufacturing of this product is very destructive to the environment. I wouldn't recommend it."""
    },
    {
        "example": """This product's synthetic materials are very harmful and non-biodegradable, causing long-term environmental damage."""
    },
    {
        "example": """I regret purchasing this item. Its processed and non-organic ingredients are not eco-friendly at all."""
    },
    {
        "example": """The packaging of this product is non-recyclable and very wasteful. It's a bad choice for anyone concerned about the planet."""
    },
    {
        "example": """This product's carbon-intensive production process is extremely high-carbon and environmentally damaging."""
    },
    {
        "example": """I was disappointed by the use of artificial and unnatural components in this product. It feels very neglectful of the environment."""
    },
    {
        "example": """This item is very energy-inefficient and eco-hazardous. It has a huge negative impact on the environment."""
    },
    {
        "example": """The persistent and non-biodegradable materials in this product make it very harmful to the planet."""
    },
    {
        "example": """This product's single-use and disposable nature are very wasteful. It’s not a good choice for anyone wanting to be eco-conscious."""
    },
    {
        "example": """The non-compostable packaging of this product is very disappointing. It contributes to landfill waste."""
    },
    {
        "example": """This item relies on virgin materials, which is very unsustainable and environmentally damaging."""
    },
    {
        "example": """The downcycled materials in this product are of poor quality and still harmful to the environment."""
    },
    {
        "example": """This product's environmentally damaging production processes are very concerning. I wouldn't recommend it to anyone eco-conscious."""
    },
    {
        "example": """I regret buying this product. Its climate-damaging and high-carbon footprint is very harmful to the environment."""
    },
    {
        "example": """The synthetic and non-organic components of this product make it very damaging to the environment."""
    },
    {
        "example": """This item is very wasteful and not eco-friendly at all. Its non-recyclable packaging is a big drawback."""
    },
    {
        "example": """The use of non-renewable materials in this product is very concerning. It's not a good choice for the environment."""
    },
    {
        "example": """I was disappointed by the high-impact and destructive production methods used for this product."""
    },
    {
        "example": """The artificial ingredients in this product are very harmful and not eco-conscious at all."""
    },
    {
        "example": """This product's non-biodegradable packaging is very wasteful and damaging to the environment."""
    },
    {
        "example": """The fossil fuels used in the production of this product are very harmful and climate-damaging."""
    },
    {
        "example": """I regret buying this item. Its non-recyclable materials are very wasteful and damaging to the planet."""
    },
    {
        "example": """The environmentally damaging practices used to create this product are very concerning. It's not eco-friendly at all."""
    },
    {
        "example": """This product's synthetic materials are very harmful and persistent, causing long-term environmental damage."""
    },
    {
        "example": """The energy-inefficient production of this product is very eco-hazardous and damaging to the planet."""
    },
    {
        "example": """This item is very wasteful and its single-use nature is not good for the environment."""
    },
    {
        "example": """The high-carbon footprint of this product is very concerning. It's not a good choice for anyone wanting to reduce their impact."""
    },
    {
        "example": """I was disappointed by the non-renewable and extractive materials used in this product. It's very damaging to the environment."""
    },
    {
        "example": """This product's non-biodegradable and non-compostable packaging is very wasteful and harmful to the environment."""
    },
    {
        "example": """The synthetic and non-organic components of this item make it very damaging to the planet."""
    },
    {
        "example": """This product's high-impact and destructive production methods are very concerning. It's not eco-friendly at all."""
    },
    {
        "example": """The non-recyclable materials used in this product are very wasteful and environmentally damaging."""
    },
    {
        "example": """I regret buying this product. Its energy-inefficient production is very harmful to the environment."""
    },
    {
        "example": """The persistent and synthetic materials in this product are very harmful and not eco-conscious at all."""
    },
    {
        "example": """This item is very wasteful and its single-use nature is not good for the environment."""
    },
    {
        "example": """The high-carbon production process of this product is very concerning. It's not a good choice for the environment."""
    },
    {
        "example": """I was disappointed by the use of non-renewable and extractive materials in this product. It's very harmful to the planet."""
    },
    {
        "example": """This product's non-biodegradable and non-compostable packaging is very wasteful and environmentally damaging."""
    },
    {
        "example": """The synthetic and artificial components of this product make it very damaging to the environment."""
    },
    {
        "example": """This product's high-impact and destructive production methods are very concerning. It's not eco-friendly at all."""
    },
    {
        "example": """The non-recyclable materials used in this product are very wasteful and harmful to the environment."""
    },
    {
        "example": """I regret buying this item. Its energy-inefficient production is very eco-hazardous and damaging to the planet."""
    },
    {
        "example": """The persistent and non-organic materials in this product are very harmful and not eco-conscious at all."""
    }
]

print('Done.')

done


In [6]:
OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")

prompt_template = FewShotPromptTemplate(
    prefix=SYNTHETIC_FEW_SHOT_PREFIX,
    examples=examples,
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=OPENAI_TEMPLATE,
)

print('Done.')

done


In [7]:
synthetic_data_generator = create_openai_data_generator(
    output_schema=AmazonReviews,
    llm=ChatOpenAI(
        temperature=1,
        api_key='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'
    ),
    prompt=prompt_template,
)

print('Done.')

done


In [8]:
synthetic_results_2 = synthetic_data_generator.generate(
    subject="consumer_products",
    extra="Generate human-like Amazon customer reviews that negatively highlight the product's sustainable and eco-friendly practices. Use terms like 'polluting', 'harmful', 'unsustainable', 'extractive', 'depleting', 'non-renewable', 'fossil fuels', 'high impact', 'destructive', 'dirty', 'contaminated', 'synthetic', 'non-organic', 'processed', 'non-biodegradable', 'persistent', 'artificial', 'unnatural', 'unconscious', 'neglectful', 'environmentally', 'damaging', 'climate-damaging', 'non-recyclable', 'high-carbon', 'carbon-intensive', 'wasteful', 'energy-inefficient', 'eco-hazardous', 'planet-damaging', 'non-compostable', 'disposable', 'single-use', 'virgin', 'downcycled'. Ensure the reviews sound natural and genuine and make sure each review is unique (no duplicates).",
    runs=1500,
)

print(synthetic_results_2)

# Create a list of dictionaries from the objects
synthetic_data_2 = []

for item in synthetic_results_2:
    synthetic_data_2.append({
        'amazon_review': item.amazon_review
    })

# Create a Pandas DataFrame from the list of dictionaries
synthetic_df_2 = pd.DataFrame(synthetic_data_2)

# Label encoding
synthetic_df_2['attribute'] = 1
synthetic_df_2['attribute_sentiment'] = -1

# Save dataset
synthetic_df_2.to_csv('synthetic_data_sustainability_negative.csv', index=False)
print('Saved as CSV.')

[AmazonReviews(amazon_review='This product is extremely wasteful and environmentally damaging. The high-impact production methods used are concerning and not eco-friendly at all.'), AmazonReviews(amazon_review="This product's high-impact production methods are extremely polluting and harmful to the environment. I was disappointed to discover its unsustainable and extractive practices. The use of non-renewable materials in its production is very concerning and non-eco-friendly. This item's dirty and synthetic components are damaging and non-biodegradable. The high-carbon footprint of this product is very climate-damaging and wasteful. Its non-recyclable packaging is planet-damaging and neglectful of environmental concerns. I regret purchasing this product as it is energy-inefficient and environmentally damaging."), AmazonReviews(amazon_review="This product's manufacturing process is highly damaging to the environment. The extractive methods used are unsustainable and harmful. I regret b

In [9]:
# Display the DataFrame
synthetic_df_2

done
