# Using SingleClassClassifierSyntheticDataGenerator

This notebook demonstrates how to use the `SingleClassClassifierSyntheticDataGenerator` to create synthetic datasets for single-class classification tasks.

### Import the Zenbase Library

In [1]:
import sys
import subprocess

def install_package(package):
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
    except subprocess.CalledProcessError as e:
        print(f"Failed to install {package}: {e}")
        raise

def install_packages(packages):
    for package in packages:
        install_package(package)

try:
    # Check if running in Google Colab
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    # Install the zenbase package if running in Google Colab
    # install_package('zenbase')
    # Install the zenbse package from a GitHub branch if running in Google Colab
    install_package('git+https://github.com/zenbase-ai/lib.git@main#egg=zenbase&subdirectory=py')

    # List of other packages to install in Google Colab
    additional_packages = [
        'python-dotenv',
        'openai',
        'langchain',
        'langchain_openai',
        'instructor',
        'matplotlib',
    ]

    # Install additional packages
    install_packages(additional_packages)

# Now import the zenbase library
try:
    import zenbase
except ImportError as e:
    print("Failed to import zenbase: ", e)
    raise

In [2]:
### Configure the Environment
from pathlib import Path
from dotenv import load_dotenv

# import os
#
# os.environ["OPENAI_API_KEY"] = "..."

load_dotenv(Path("../../.env.test"), override=True)

True

In [3]:
# Import necessary libraries
import instructor
from openai import OpenAI
from zenbase.predefined.syntethic_data.single_class_classifier import SingleClassClassifierSyntheticDataGenerator

## Step 1: Set up the OpenAI client and Instructor

In [4]:
# Initialize the OpenAI client
openai_client = OpenAI()

# Initialize the Instructor client
instructor_client = instructor.from_openai(openai_client)

## Step 2: Define the classification task

In [5]:
# Define the prompt for the classification task
prompt_definition = """Your task is to accurately categorize each incoming news article into one of the given categories based on its title and content."""

# Define the class dictionary
class_dict = {
    "Automobiles": "Discussions and news about automobiles, including car maintenance, driving experiences, and the latest automotive technology.",
    "Computers": "Topics related to computer hardware, software, graphics, cryptography, and operating systems, including troubleshooting and advancements.",
    "Science": "News and discussions about scientific topics including space exploration, medicine, and electronics.",
    "Politics": "Debates and news about political topics, including gun control, Middle Eastern politics, and miscellaneous political discussions.",
}

## Step 3: Create the SingleClassClassifierSyntheticDataGenerator

In [6]:
# Create the generator
generator = SingleClassClassifierSyntheticDataGenerator(
    instructor_client=instructor_client,
    prompt=prompt_definition,
    class_dict=class_dict,
    model="gpt-4o-mini"
)

## Step 4: Generate synthetic data

In [7]:
# Generate examples
examples_per_category = 5
examples = generator.generate_examples(examples_per_category)

# Display the first few examples
print(f"Generated {len(examples)} examples in total.\n")
for i, example in enumerate(examples[:10]):
    print(f"Example {i+1}:")
    print(f"Input: {example.inputs}")
    print(f"Output: {example.outputs}\n")

Generated 20 examples in total.

Example 1:
Input: With the rise of electric vehicles, many car manufacturers are racing to develop new battery technologies that allow for faster charging and longer ranges. Tesla and Rivian are leading the charge, but traditional companies like Ford and General Motors are also making significant investments in this area. What are the implications of these advancements?
Output: Automobiles

Example 2:
Input: Today, I took my brand new sedan for a drive along the coastal highway. The smooth acceleration and responsive steering made for an exhilarating experience. I can't wait to take it on a road trip this weekend!
Output: Automobiles

Example 3:
Input: Maintaining your vehicle is crucial for its longevity. Regular oil changes, brake inspections, and tire rotations are essential. Many drivers often neglect these services, but they can save you significant money in the long run. Here are some tips to keep your car in top shape throughout the years.
Output

## Step 5: Generate and save CSV

In [None]:
# Generate CSV content
csv_content = generator.generate_csv(examples_per_category)

# Display the first few lines of the CSV content
print("First few lines of the generated CSV:")
print("\n".join(csv_content.split("\n")[:6]))

# Save the CSV to a file
filename = "synthetic_dataset.csv"
generator.save_csv(filename, examples_per_category)
print(f"\nCSV file saved as: {filename}")

First few lines of the generated CSV:
inputs,outputs
"The latest electric cars are making waves in the auto industry! With advancements in battery technology, manufacturers are now able to produce vehicles with longer ranges and faster charging times. Brands like Tesla and Rivian are leading the charge, but traditional automakers are stepping up their game too. Have you experienced the thrill of driving an electric vehicle yet?",Automobiles
"Regular maintenance is crucial for your car's longevity. From changing the oil to checking tire pressure, these small tasks can keep your vehicle running smoothly. This weekend, I plan to give my car a thorough check-up and even polish it for that showroom shine!",Automobiles
"I recently took a road trip across the country, and my new SUV made the journey so much more enjoyable. The advanced GPS system helped me navigate through unfamiliar roads, and the spacious interior provided comfort for my family. What features do you look for in a family 

## Step 6: Analyze the generated dataset

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Read the CSV file
df = pd.read_csv(filename)

# Display basic statistics
print("Dataset statistics:")
print(df['outputs'].value_counts())

# Visualize the distribution of categories
plt.figure(figsize=(10, 6))
df['outputs'].value_counts().plot(kind='bar')
plt.title('Distribution of Categories in the Synthetic Dataset')
plt.xlabel('Category')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Conclusion

This notebook demonstrated how to use the `SingleClassClassifierSyntheticDataGenerator` to create a synthetic dataset for a single-class classification task. We covered the process of setting up the generator, creating examples, saving them to a CSV file, and performing a basic analysis of the generated data.

You can now use this synthetic dataset for various purposes, such as:
- Training and evaluating machine learning models
- Testing data processing pipelines
- Exploring different classification algorithms

Remember to adjust the `examples_per_category` parameter and the `class_dict` to suit your specific needs and to generate larger datasets if required.