# Single Class Classifier with Synthetic Data Generation

This notebook demonstrates how to use the `SingleClassClassifierSyntheticDataGenerator` to create a synthetic dataset, and then use that dataset to train and test a `SingleClassClassifier`.

## Setup and Imports

### Import the Zenbase Library

In [8]:
import sys
import subprocess

def install_package(package):
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
    except subprocess.CalledProcessError as e:
        print(f"Failed to install {package}: {e}")
        raise

def install_packages(packages):
    for package in packages:
        install_package(package)

try:
    # Check if running in Google Colab
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    # Install the zenbase package if running in Google Colab
    # install_package('zenbase')
    # Install the zenbse package from a GitHub branch if running in Google Colab
    install_package('git+https://github.com/zenbase-ai/lib.git@main#egg=zenbase&subdirectory=py')

    # List of other packages to install in Google Colab
    additional_packages = [
        'python-dotenv',
        'openai',
        'langchain',
        'langchain_openai',
        'instructor',
        'datasets'
    ]
    
    # Install additional packages
    install_packages(additional_packages)

# Now import the zenbase library
try:
    import zenbase
except ImportError as e:
    print("Failed to import zenbase: ", e)
    raise

### Configure the Environment

In [9]:
from pathlib import Path
from dotenv import load_dotenv

# import os
#
# os.environ["OPENAI_API_KEY"] = "..."

load_dotenv(Path("../../.env.test"), override=True)

True

In [10]:
import sys
import subprocess
import instructor
from openai import OpenAI
from zenbase.core.managers import ZenbaseTracer
from zenbase.predefined.single_class_classifier import SingleClassClassifier
from zenbase.predefined.syntethic_data.single_class_classifier import SingleClassClassifierSyntheticDataGenerator

# Set up OpenAI and Instructor clients
openai_client = OpenAI()
instructor_client = instructor.from_openai(openai_client)
zenbase_tracer = ZenbaseTracer()

## Define Classification Task

In [11]:
prompt_definition = """Your task is to accurately categorize each incoming news article into one of the given categories based on its title and content."""

class_dict = {
    "Automobiles": "Discussions and news about automobiles, including car maintenance, driving experiences, and the latest automotive technology.",
    "Computers": "Topics related to computer hardware, software, graphics, cryptography, and operating systems, including troubleshooting and advancements.",
    "Science": "News and discussions about scientific topics including space exploration, medicine, and electronics.",
    "Politics": "Debates and news about political topics, including gun control, Middle Eastern politics, and miscellaneous political discussions.",
}

## Generate Synthetic Data

In [12]:
# Set up the generator
generator = SingleClassClassifierSyntheticDataGenerator(
    instructor_client=instructor_client,
    prompt=prompt_definition,
    class_dict=class_dict,
    model="gpt-4o-mini"
)

# Define the number of examples per category for each set
train_examples_per_category = 10
val_examples_per_category = 3
test_examples_per_category = 3

# Generate train set
train_examples = generator.generate_examples(train_examples_per_category)
print(f"Generated {len(train_examples)} examples for the train set.\n")

# Generate validation set
val_examples = generator.generate_examples(val_examples_per_category)
print(f"Generated {len(val_examples)} examples for the validation set.\n")

# Generate test set
test_examples = generator.generate_examples(test_examples_per_category)
print(f"Generated {len(test_examples)} examples for the test set.\n")

Generated 40 examples for the train set.

Generated 12 examples for the validation set.

Generated 12 examples for the test set.



## Create and Train the Classifier

In [13]:
classifier = SingleClassClassifier(
    instructor_client=instructor_client,
    prompt=prompt_definition,
    class_dict=class_dict,
    model="gpt-4o-mini",
    zenbase_tracer=zenbase_tracer,
    training_set=train_examples,
    validation_set=val_examples,
    test_set=test_examples,
    samples=20,
)

result = classifier.perform()

## Analyze Results

In [14]:
print("Base Evaluation Score:", classifier.base_evaluation.evals['score'])
print("Best Evaluation Score:", classifier.best_evaluation.evals['score'])

print("\nBest function:", result.best_function)
print("Number of candidate results:", len(result.candidate_results))
print("Best candidate result:", result.best_candidate_result.evals)

Base Evaluation Score: 0.9166666666666666
Best Evaluation Score: 0.9166666666666666

Best function: <zenbase.types.LMFunction object at 0x14ac40760>
Number of candidate results: 20
Best candidate result: {'score': 0.9166666666666666}
Number of traces: 264


## Test the Classifier

In [15]:
new_article = """
title: Revolutionary Quantum Computer Achieves Milestone in Cryptography
content: Scientists at a leading tech company have announced a breakthrough in quantum computing, 
demonstrating a quantum computer capable of solving complex cryptographic problems in record time. 
This development has significant implications for data security and could revolutionize fields 
ranging from finance to national security. However, experts warn that it also poses potential 
risks to current encryption methods.
"""

classification = result.best_function(new_article)
print(f"The article is classified as: {classification.class_label.name}")

The article is classified as: Computers


## Conclusion

In this notebook, we've demonstrated how to:
1. Generate synthetic data for a single-class classification task
2. Prepare the synthetic data for training and testing
3. Create and train a SingleClassClassifier using the synthetic data
4. Analyze the results of the classifier
5. Use the trained classifier to categorize new input

This approach allows for rapid prototyping and testing of classification models, especially in scenarios where real-world labeled data might be scarce or difficult to obtain.