# Audio and speech

The OpenAI API provides range of audio capabilities. If we know what to build, then we just need to call the functions/packages.

## Text to speech

For turning text into speech, use the Audoi API ``audio/speech`` endpoint. Models compatible with this endpoint are most up-to-date GPTs (e.g., gpt-4o, gpt-4.1). We can ask the model to speak a certain way or with a certain tone of voice.

### Add audio to our existing application

Models such as GPT-4o, and GPT-4.1 are nativbely multimodal, meaning that they can understand and generate multiple modalities as input and output.

If we already have a text-based LLM application with Chat completions endpoint, then we may want to add the audiio capabilities. For example, if our chat application supports text input, we can definitely add audio input and output, just include ``audoi`` in ``modalities`` array and then use an audio model, like ``gpt-4.1``.

#### Audio output from model

In [1]:
import base64
from openai import OpenAI

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "sk-proj-_C0VLSk9gAQA"

In [3]:
client = OpenAI()

In [7]:
response = client.audio.speech.create(
    model = "tts-1",
    voice = "alloy", # voices: alloy, echo, fable, onyx, nova, shimmer
    input = "Is a golder retriever a good family dog?"
)

# save the audio content directly
with open("dog.wav","wb") as f:
    f.write(response.content)

## Structured Ouputs

As we know, JSON is one fo the most widely used formats in the world for applications to exchange the data

The strucuted Ouputs is the feathre that ensuures the model will always generate resopnses that adhere to our spplied JSON schema. 

Because in addtion to supporting JSON schema in the API, the OpenAI API for both Python and JavaScript make it easy to define object schemas using Pydantic and Zod respectively. Below, we will see how to extract information form the unstructured text in the code.

#### Chain of Thought

We want to ask the model to output an answer in a structured, step-by-step way, to guide us (i.e., the users) through the solution.

In [18]:
from openai import OpenAI
from pydantic import BaseModel
from typing import List

client = OpenAI()

class Step(BaseModel):
    explanation: str
    output: str

class MathReasoning(BaseModel):
    steps: List[Step]
    final_answer: str

response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[
        {
            "role": "system",
            "content": (
                "You are a helpful math tutor. Guide the user through the solution step by step. "
                "Return the result strictly as JSON in the format: "
                "{'steps': [{'explanation': str, 'output': str}], 'final_answer': str}"
            )
        },
        {
            "role": "user",
            "content": "How can I solve 8x + 7 = -23?"
        }
    ]
)

import json

# Extract JSON text
text_output = response.choices[0].message.content.strip()

# Optional: Strip markdown-style code block
if text_output.startswith("```json"):
    text_output = text_output[7:-3].strip()

# Parse JSON string into Pydantic model
math_reasoning = MathReasoning.model_validate_json(text_output)

# Print result
print(math_reasoning)


steps=[Step(explanation='First, subtract 7 from both sides of the equation to isolate terms with x.', output='8x + 7 - 7 = -23 - 7'), Step(explanation='Simplifying both sides gives:', output='8x = -30'), Step(explanation='Now, divide both sides by 8 to solve for x.', output='x = -30 / 8'), Step(explanation='Simplify the fraction.', output='x = -15 / 4')] final_answer='-15/4'


### Structured data extraction

We can also define structured fields to extract from unstrcutured input data, such as the research papers.

In [20]:
from openai import OpenAI
from pydantic import BaseModel
from typing import List

# Initialize OpenAI client
client = OpenAI()

# Define Pydantic model for structured data
class ResearchPaperExtraction(BaseModel):
    title: str
    authors: List[str]
    abstract: str
    keywords: List[str]

# Call GPT-4 to extract structured data
response = client.chat.completions.create(
    model="gpt-4.1",  # or "gpt-4", "gpt-4-1106-preview", etc.
    messages=[
        {
            "role": "system",
            "content": (
                "You are an expert at structured data extraction. "
                "You will be given unstructured text from a research paper "
                "and should extract the following fields in JSON: "
                "title, authors (list of names), abstract, keywords (list of terms)."
            )
        },
        {
            "role": "user",
            "content": """
In this paper, we present a novel transformer-based approach to climate pattern forecasting.
The study is conducted by Jane Doe and John Smith from the Institute of Atmospheric Science.
Our method significantly improves long-range climate event prediction.
Keywords include transformer, climate modeling, time-series forecasting.
"""
        }
    ]
)

# Get model output
text_output = response.choices[0].message.content.strip()

# Parse output to Pydantic model
research_paper = ResearchPaperExtraction.model_validate_json(text_output)

# Print structured output
print(research_paper)

title='A Novel Transformer-Based Approach to Climate Pattern Forecasting' authors=['Jane Doe', 'John Smith'] abstract='In this paper, we present a novel transformer-based approach to climate pattern forecasting. Our method significantly improves long-range climate event prediction.' keywords=['transformer', 'climate modeling', 'time-series forecasting']


#### Moderation

Using moderation, we can classify inputs on multiple categories, which is a common way of doing moderation.

In [21]:
from enum import Enum
from typing import Optional
from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class Category(str, Enum):
    violence = "violence"
    sexual = "sexual"
    self_harm = "self_harm"

class ContentCompliance(BaseModel):
    is_violating: bool
    category: Optional[Category]
    explanation_if_violating: Optional[str]

# Call OpenAI model
response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[
        {
            "role": "system",
            "content": "Determine if the user input violates specific guidelines and explain if they do. Respond in JSON with fields: is_violating (true/false), category (violence/sexual/self_harm/null), explanation_if_violating (string/null)."
        },
        {
            "role": "user",
            "content": "How do I prepare for a job interview?"
        }
    ]
)

# Extract and clean up the raw JSON string
text_output = response.choices[0].message.content.strip()
if text_output.startswith("```json"):
    text_output = text_output[7:]
if text_output.endswith("```"):
    text_output = text_output[:-3]

# Convert string to Pydantic model
compliance = ContentCompliance.model_validate_json(text_output)

# Print result
print(compliance)


is_violating=False category=None explanation_if_violating=None


## Conversation State

In the conversation state, OpenAI provides a few ways to manage conversations, which is important for preserving information across multiple messages or turns in a conversation.

##### Manually manage conversation state

While each text generation request is independent and stateless (unless we are using the Assistants API), we can still implement multi-turn conversations by providing additional messages as parameters to our text generation request.

In [22]:
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model = "gpt-4.1",
    messages = [
        {"role":"user","content":"knock knock."},
        {"role":"assistant","content":"Who's there?"},
        {"role":"user", "content": "Tom,"},
    ],
)

In [23]:
print(response.choices[0].message.content)

Tom, who?


In [25]:
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model = "gpt-4.1",
    messages= [
        {"role":"system", "content": "You are a helpful assistant."},
        {"role":"user","content":"Tell me a joke."}
    ]
)

In [26]:
print(response.choices[0].message.content)

Why don't skeletons fight each other?

They don't have the guts!


### OpenAI APIs for conversation state 

Our APIs make it easier to mange conversation state automatically, so we do not have to do pass inputs manually with each turn of a conversation. 

Share context across generated responses with the ``previous_response_id`` parameter. This parameter lets us chain responses and create a ghreaded conversation.

In the following example, we will ask the model to tell a joke. Separately, we will ask the model to explain why it is funny, and the model has all the necessary context to deliver a good response.

In [29]:
from openai import OpenAI

client = OpenAI()

# Step 1: First message creation
response = client.chat.completions.create(
    model = "gpt-4.1",
    messages= [
        {"role":"user","content":"Tell me a joke."}
    ]
)
print(response.choices[0].message.content)

# Step 2: Follow-up message with context
second_response = client.chat.completions.create(
    model = "gpt-4.1",
    messages = [
        {"role":"user","content":"Tell me a joke."},
        {"role":"assistant","content":response.choices[0].message.content},
        {"role":"user","content":"Explain why this is funny."}
    ]
)
print(second_response.choices[0].message.content)

Why don’t skeletons fight each other?

They don’t have the guts!
Absolutely! This joke is based on a play on words—**a pun**—involving the phrase “don’t have the guts.”

- **Literal meaning:** Skeletons, by definition, are just bones. They literally have no organs, including "guts" (internal organs).
- **Figurative meaning:** The phrase “don’t have the guts” is commonly used to mean someone doesn’t have the courage to do something.
- **Humor:** The joke is funny because it combines these two meanings—skeletons physically lack guts *and* therefore “don’t have the guts” (courage) to fight.

So, the humor comes from the double meaning and the surprise of connecting a well-known idiom to a literal image involving skeletons.
