# Generate from API
A template to get started with simple calls through `generate`

In [1]:
# given that our server is running (`ollama serve`):
from ollama import ChatResponse, chat
from pydantic.types import JsonSchemaValue
from typing import Optional

model = "hf.co/unsloth/Llama-3.2-3B-Instruct-GGUF:Q6_K"

def generate(
    system_prompt: str,
    prompt: str,
    model: str,
    schema: Optional[JsonSchemaValue] = None,
    parse: bool = True,
    num_ctx: int = 48000,
    num_predict: int = 4000,
    temperature: float = 0.0,
) -> str:
    response: ChatResponse = chat(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt},
        ],
        options={
            "num_ctx": num_ctx,
            "num_predict": num_predict,
            "top_k": 100,
            "top_p": 0.8,
            "temperature": temperature,
            "seed": 0,  # this is not needed when temp is 0
            "repeat_penalty": 1.3,  # remain default for json outputs, from experience.
        },
        format=schema,
        stream=False,
    )
    res = response.message.content
    if parse and schema:
        try:
            res = eval(res)
        except Exception:
            res = None
    return res

In [2]:
out = generate(system_prompt="", prompt="create a dataset of 5 samples for sentiment analysis", model=model)
print(out)

Here is an example of a small dataset with 5 sample reviews, each labeled as either positive or negative:

**Dataset: Sentiment Analysis**

| **Sample ID** | **Review Text** | **Sentiment (Positive/Negative)** |
| --- | --- | --- |
| 1 | "I loved the new restaurant in town! The food was amazing and the service was top-notch." | Positive |
| 2 | "The hotel room was small, but it had a great view of the city. Overall, I would not recommend this place to anyone else." | Negative |
| 3 | "This product is so easy to use and has really improved my daily routine! Highly recommended!" | Positive |
| 4 | "I'm extremely disappointed with the customer service at this company. They were completely unhelpful when I needed assistance." | Negative |
| 5 | "The concert was incredible, but unfortunately it started late due to technical issues. Still had a great time overall though!" | Neutral |

Note that in real-world datasets, you would typically have more data points and labels (e.g., positive/negat

This is the very defintion of why we're interested in controllability. What fields do we have? What are the boundary conditions? Text length, which labels?!

## Agent basics

An agent, in this context, is a service that does a single operation, with the possibilities of having a selection of tools. For now, we limit it to a simple task.

It's useful to define an interface for future agents:

In [3]:
from abc import ABC, abstractmethod
from typing import Any

class Agent(ABC):
    def __init__(self, model: str):
        self.generate = generate  # here, generate is found above, this can also be imported in the agent interface
        self.model = model

    @abstractmethod
    def system(self) -> str:
        pass

    @abstractmethod
    def prompt(self, *args, **kwargs) -> str:
        pass

    @abstractmethod
    def schema(self, *args, **kwargs) -> dict[str, Any]:
        pass

    @abstractmethod
    def __call__(self, api: callable, *args, **kwargs) -> Any:
        pass

## What's your use case?

For a dataset generation system, a few examples are which labels (targets) we're interested in, along with the boundary conditions for the raw data, e.g., text.

We can create two agents:
- a labeling agent (defines a set of labels given the constraint of N number of labels (binary, multi-label, ...))
- a generation agent (generates text) that is given the input of the labeling agent


In [4]:
from pydantic import BaseModel, Field, conlist
from typing import List, Dict, Any, Literal

def filter_labels(filtered_datasets: List[Dict[str, Any]]):
    filtered_labels = set()
    for dataset in filtered_datasets:
        for label in dataset["labels"]:
            filtered_labels.add(label["label"].lower())

    return list(filtered_labels)

class LabelerAgent(Agent):
    def __init__(
        self, model, topic: str, num_labels: int
    ):
        super().__init__(model)
        self.topic = topic
        self.num_labels = num_labels

    def system(self):
        return f"You are an assistant that labels datasets based on the topic of {self.topic}."

    def prompt(self):
        return f"Create a list of {self.num_labels} label categories for a task related to {self.topic}. Determine the name, a description, and list of possible values. There must be exactly {self.num_labels} values. Output in JSON."

    def schema(self):
        class DatasetLabel(BaseModel):
            name: str
            description: str
            possible_values: List[str]

        class LabelerSchema(BaseModel):
            # labels: List[DatasetLabel]
            labels: conlist(DatasetLabel, min_length=self.num_labels, max_length=self.num_labels)

        return LabelerSchema.model_json_schema()

    def __call__(self, output_key: str = "labels"):
        output = self.generate(
            system_prompt=self.system(),
            prompt=self.prompt(),
            schema=self.schema(),
            model=self.model,
            num_ctx=200,
            temperature=0.0,
        )
        if output:
            return output[output_key]

labels = LabelerAgent(model=model, topic="sentiment analysis", num_labels=4)()
for label in labels:
    print(label)
    print()


{'name': 'Positive Sentiment', 'description': 'Indicates that the text expresses positive emotions or opinions.', 'possible_values': ['Very Positive', 'Somewhat Positive', 'Neutral', 'Negative']}

{'name': 'Negative Sentiment', 'description': 'Indicates that the text expresses negative emotions or opinions.', 'possible_values': ['Strongly Negative', 'Moderately Negative', 'Mildly Negative', 'Very Mildly Negative']}

{'name': 'Neutral Sentiment', 'description': 'Indicates that the text does not express a clear positive, neutral, or negative opinion.', 'possible_values': ['Somewhat Neutral', 'Generally Neutral', 'Strongly Neutral', 'Not Applicable']}

{'name': 'Ambiguous Sentiment', 'description': 'Indicates that the sentiment of the text is unclear due to ambiguity in language or context.', 'possible_values': ['Uncertain', 'Inconclusive', 'Insufficient Information', 'Unable To Determine']}



In [6]:
from content.src.dynamic_schema import dynamic_schema
# convert this generated data to a pydantic schema, so we can generate data suitable for its schema

labels_schema = dynamic_schema(labels)
labels_schema.model_json_schema()

{'properties': {'Positive_Sentiment': {'description': 'Indicates that the text expresses positive emotions or opinions.',
   'enum': ['Very Positive', 'Somewhat Positive', 'Neutral', 'Negative'],
   'title': 'Positive Sentiment',
   'type': 'string'},
  'Negative_Sentiment': {'description': 'Indicates that the text expresses negative emotions or opinions.',
   'enum': ['Strongly Negative',
    'Moderately Negative',
    'Mildly Negative',
    'Very Mildly Negative'],
   'title': 'Negative Sentiment',
   'type': 'string'},
  'Neutral_Sentiment': {'description': 'Indicates that the text does not express a clear positive, neutral, or negative opinion.',
   'enum': ['Somewhat Neutral',
    'Generally Neutral',
    'Strongly Neutral',
    'Not Applicable'],
   'title': 'Neutral Sentiment',
   'type': 'string'},
  'Ambiguous_Sentiment': {'description': 'Indicates that the sentiment of the text is unclear due to ambiguity in language or context.',
   'enum': ['Uncertain',
    'Inconclusive',


In [10]:
class DataGenerationAgent(Agent):
    def __init__(
        self,
        model,
        topic: str,
        label_schema: dict,
        num_samples: int = 10,  # number of new labels to generate
        text_strategy: str = "sentences",
    ):
        super().__init__(model)
        self.topic = topic
        self.label_schema = label_schema
        self.num_samples = num_samples
        self.text_strategy = text_strategy

    def system(self):
        pass

    def prompt(self):
        return f"Generate {self.num_samples} unique sample {self.text_strategy} as if from realistic sources. The samples should mimic data extracted from social media posts, official documents or statements online. It should be connected to the of '{self.topic}'. Use the annotation definitions below: {self.label_schema.model_json_schema()}. Think creatively, and avoid similar language. Output in JSON."

    def schema(self):
        class DataSample(BaseModel):
            text: str
            # text: str = Field(
            #     title="Text",
            #     description="The text of the data sample on the topic of {TOPIC}.",
            #     # min_length=200,
            #     # max_length=500,
            # )
            labels: List[self.label_schema]

        class DataGenerationSchema(BaseModel):
            samples: List[DataSample]

        return DataGenerationSchema.model_json_schema()

    def __call__(
        self,
        output_key: str = "samples",
    ):
        output = self.generate(
            system_prompt=self.system(),
            prompt=self.prompt(),
            schema=self.schema(),
            model=self.model,
            num_ctx=4000,
            num_predict=3000,
            temperature=1.0,  # we keep a high temp for more "creative" text generation
        )
        if output:
            return output[output_key]


generated_data = DataGenerationAgent(
    model=model,
    topic="sentiment analysis",
    label_schema=labels_schema,
    num_samples=5,
    text_strategy="sentences",
)()
generated_data

[{'text': 'I am beyond thrilled with the new restaurant opening downtown! The service was top-notch.',
  'labels': [{'Positive_Sentiment': 'Very Positive',
    'Negative_Sentiment': 'Mildly Negative',
    'Neutral_Sentiment': 'Not Applicable',
    'Ambiguous_Sentiment': 'Uncertain'}]},
 {'text': 'The new policy has caused a significant amount of inconvenience to our team. We are disappointed with the lack of communication.',
  'labels': [{'Positive_Sentiment': 'Very Positive',
    'Negative_Sentiment': 'Strongly Negative',
    'Neutral_Sentiment': 'Not Applicable',
    'Ambiguous_Sentiment': 'Insufficient Information'}]},
 {'text': '#JusticeForClimate: A group of activists gathered outside the parliament to raise awareness about climate change.',
  'labels': [{'Positive_Sentiment': 'Very Positive',
    'Negative_Sentiment': 'Strongly Negative',
    'Neutral_Sentiment': 'Not Applicable',
    'Ambiguous_Sentiment': 'Inconclusive'}]},
 {'text': "'We are committed to transparency and accou

In [11]:
import pandas as pd

columns = list(generated_data[0].keys()) + list(generated_data[0]["labels"][0].keys())
columns.remove("labels")
data = []
for sample in generated_data:
    # print(sample)
    for label in sample["labels"]:
        data.append({**sample, **label})
df = pd.DataFrame(data, columns=columns)
df

Unnamed: 0,text,Positive_Sentiment,Negative_Sentiment,Neutral_Sentiment,Ambiguous_Sentiment
0,I am beyond thrilled with the new restaurant o...,Very Positive,Mildly Negative,Not Applicable,Uncertain
1,The new policy has caused a significant amount...,Very Positive,Strongly Negative,Not Applicable,Insufficient Information
2,#JusticeForClimate: A group of activists gathe...,Very Positive,Strongly Negative,Not Applicable,Inconclusive
3,'We are committed to transparency and accounta...,Somewhat Positive,Very Mildly Negative,Strongly Neutral,Unable To Determine
4,"""It's a tough time for our company, but we are...",Neutral,Mildly Negative,Somewhat Neutral,Uncertain
