# Audio Capabilities with Azure OpenAI's gpt-4o-audio-preview Model

The **gpt-4o-audio-preview** model introduces audio modality into the existing /chat/completions API. The audio model expands the potential of AI applications in textual and vocal interactions as well as audio analysis. The modalities supported in the gpt-4o-audio-preview model are as follows: text, audio, and text + audio.

The **gpt-4o-audio-preview** and **gpt-4o-mini-audio-preview** models introduce the audio modality into the existing /chat/completions API. The audio model expands the potential for AI applications in text and voice-based interactions and audio analysis. Modalities supported in gpt-4o-audio-preview and gpt-4o-mini-audio-preview models include:  text, audio, and text + audio.

Here's a table of the supported modalities with example use cases:


```markdown
| Modality input | Modality output | Example use case                           |
|----------------|-----------------|--------------------------------------------|
| Text           | Text + audio    | Text to speech, audio book generation      |
| Audio          | Text + audio    | Audio transcription, audio book generation |
| Audio          | Text            | Audio transcription                        |
| Text + audio   | Audio           | Audio book generation                      |
| Text + audio   | Text            | Audio transcription                        |
```

This table shows different combinations of modality inputs and outputs along with their example use cases. Let me know if you need any more help!
By using audio generation capabilities, you can achieve more dynamic and interactive AI applications. Models that support audio inputs and outputs allow you to generate spoken audio responses to prompts and use audio inputs to prompt the model.

> https://learn.microsoft.com/en-us/azure/ai-services/openai/audio-completions-quickstart?tabs=keyless%2Cwindows%2Ctypescript-keyless&pivots=programming-language-python

In [42]:
import base64
import datetime
import openai
import os
import random
import sys

from azure.core.credentials import AzureKeyCredential
from openai import AzureOpenAI
from dotenv import load_dotenv
from IPython.display import Audio

In [43]:
print(f"Python version: {sys.version}")
print(f"OpenAI version: {openai.__version__}")

Python version: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]
OpenAI version: 1.60.2


In [40]:
print(f"Today is: {datetime.datetime.today().strftime('%d-%b-%Y %H:%M:%S')}")

Today is: 07-Feb-2025 11:14:46


## Settings

In [2]:
load_dotenv("azure.env")

api_key = os.getenv("api_key")
endpoint = os.getenv("endpoint")

In [3]:
client = AzureOpenAI(api_version="2025-01-01-preview",
                     api_key=api_key,
                     azure_endpoint=endpoint)

In [4]:
voices = ["alloy", "echo", "shimmer"]
voices

['alloy', 'echo', 'shimmer']

## Text to Text/Audio

In [5]:
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": random.choice(voices),
        "format": "wav"
    },
    messages=[{
        "role": "user",
        "content": "Where is located Microsoft HQ?"
    }],
    temperature=0.7,
    max_tokens=8000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

In [6]:
print(completion.choices[0].message.audio.transcript)

Microsoft's headquarters is located in Redmond, Washington, USA. This campus serves as the main hub for the company's operations, housing numerous buildings and thousands of employees. The address is One Microsoft Way, Redmond, WA 98052.


In [7]:
audio_bytes = base64.b64decode(completion.choices[0].message.audio.data)

with open("msft.wav", "wb") as f:
    f.write(audio_bytes)

In [8]:
!ls msft.wav -lh

-rwxrwxrwx 1 root root 889K Feb  7 11:08 msft.wav


In [9]:
Audio("msft.wav", autoplay=False)

In [10]:
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": random.choice(voices),
        "format": "wav"
    },
    messages=[{
        "role": "user",
        "content": "Quelle est la capitale de la France?"
    }],
    temperature=0.7,
    max_tokens=8000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

In [11]:
print(completion.choices[0].message.audio.transcript)

La capitale de la France est Paris. C'est une ville connue pour son histoire, sa culture et ses monuments emblématiques comme la Tour Eiffel et le Louvre.


In [12]:
audio_bytes = base64.b64decode(completion.choices[0].message.audio.data)

with open("paris.wav", "wb") as f:
    f.write(audio_bytes)

In [13]:
!ls paris.wav -lh

-rwxrwxrwx 1 root root 479K Feb  7 11:08 paris.wav


In [14]:
Audio("paris.wav", autoplay=False)

## Audio to Text/Audio

In [15]:
!ls callcenter.mp3 -lh

-rwxrwxrwx 1 root root 503K Feb  6 08:11 callcenter.mp3


In [16]:
Audio('callcenter.mp3', autoplay=False)

In [17]:
with open('callcenter.mp3', 'rb') as wav_reader:
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8')

# Make the audio chat completions request
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": random.choice(voices),
        "format": "mp3"
    },
    messages=[
        {
            "role":
            "user",
            "content": [{
                "type": "text",
                "text": "Generate a transcript of all the discussion"
            }, {
                "type": "input_audio",
                "input_audio": {
                    "data": encoded_string,
                    "format": "mp3"
                }
            }]
        },
    ],
    temperature=0.7,
    max_tokens=8000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

print(completion.choices[0].message.audio.transcript)

Good day. Welcome to Contoso. My name is John Doe. How can I help you today?
Yes, good day. My name is Maria Smith. I would like to inquire about my current point balance.
No problem. I am happy to help. I need your date of birth to confirm your identity.
It is April 19th, 1988.
Great. Your current point balance is 599 points. Do you need any more information?
No, thank you. That was all. Goodbye.
You're welcome. Goodbye at Contoso.


In [18]:
with open('callcenter.mp3', 'rb') as wav_reader:
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8')

# Make the audio chat completions request
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": random.choice(voices),
        "format": "mp3"
    },
    messages=[
        {
            "role":
            "user",
            "content": [{
                "type": "text",
                "text": "What is the ask? The people names?"
            }, {
                "type": "input_audio",
                "input_audio": {
                    "data": encoded_string,
                    "format": "mp3"
                }
            }]
        },
    ],
    temperature=0.7,
    max_tokens=8000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

print(completion.choices[0].message.audio.transcript)

It seems like this is a conversation between a customer and a representative of a company named Contoso. The representative, John Doe, assists the customer, Maria Smith, in checking her point balance after confirming her identity with her date of birth. The interaction ends with Maria thanking John for his help and saying goodbye.


In [19]:
with open('callcenter.mp3', 'rb') as wav_reader:
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8')

# Make the audio chat completions request
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": random.choice(voices),
        "format": "mp3"
    },
    messages=[
        {
            "role":
            "user",
            "content": [{
                "type": "text",
                "text": "What is the sentiment analysis of this interaction?"
            }, {
                "type": "input_audio",
                "input_audio": {
                    "data": encoded_string,
                    "format": "mp3"
                }
            }]
        },
    ],
    temperature=0.7,
    max_tokens=8000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

print(completion.choices[0].message.audio.transcript)

The sentiment of this interaction is polite and positive. Both parties communicate respectfully and the conversation is smooth, with clear and helpful responses.


### We can save the completion into an audio file

In [20]:
# Read and encode audio file
with open('callcenter.mp3', 'rb') as wav_reader:
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8')

# Make the audio chat completions request
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": random.choice(voices),
        "format": "wav"
    },
    messages=[
        {
            "role":
            "user",
            "content": [{
                "type":
                "text",
                "text":
                "Describe in detail this call center discussion."
            }, {
                "type": "input_audio",
                "input_audio": {
                    "data": encoded_string,
                    "format": "mp3"
                }
            }]
        },
    ],
    temperature=0.7,
    max_tokens=8000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

print(completion.choices[0].message.audio.transcript)

# Write the output audio data to a file
wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)

with open("callcenterdetails.wav", "wb") as f:
    f.write(wav_bytes)

In this call center discussion, John Doe, a representative of Contoso, greets the caller, Maria Smith. Maria wants to inquire about her current point balance. John asks for her date of birth to verify her identity. After providing her date of birth, Maria is informed that she has 599 points. She confirms that she doesn't need any more information, and they both exchange goodbyes, concluding the conversation.


In [21]:
!ls callcenterdetails.wav -lh

-rwxrwxrwx 1 root root 1.3M Feb  7 11:09 callcenterdetails.wav


In [22]:
Audio("callcenterdetails.wav", autoplay=False)

## Audio to audio translation

In [23]:
with open('callcenter.mp3', 'rb') as wav_reader:
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8')

# Make the audio chat completions request
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": random.choice(voices),
        "format": "wav"
    },
    messages=[
        {
            "role":
            "user",
            "content": [{
                "type":
                "text",
                "text":
                "Translate this text into French text."
            }, {
                "type": "input_audio",
                "input_audio": {
                    "data": encoded_string,
                    "format": "mp3"
                }
            }]
        },
    ],
    temperature=0.7,
    max_tokens=8000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

print(completion.choices[0].message.audio.transcript)

# Write the output audio data to a file
wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)

with open("callcentertranslation.wav", "wb") as f:
    f.write(wav_bytes)

Bonjour. Bienvenue chez Contoso. Mon nom est John Doe. Comment puis-je vous aider aujourd'hui ?  
Oui, bonjour. Je m'appelle Maria Smith. J'aimerais me renseigner sur le solde actuel de mes points.  
Pas de problème. Je suis heureux de vous aider. J'ai besoin de votre date de naissance pour confirmer votre identité.  
C'est le 19 avril 1988.  
Très bien. Votre solde actuel est de 599 points. Avez-vous besoin de plus d'informations ?  
Non merci. C'est tout. Au revoir.  
Je vous en prie. Au revoir chez Contoso.


In [24]:
!ls callcentertranslation.wav -lh

-rwxrwxrwx 1 root root 1.5M Feb  7 11:09 callcentertranslation.wav


In [25]:
Audio("callcentertranslation.wav", autoplay=False)

## Podcast generation

In [26]:
with open('podcast.txt', 'r') as file:
    podcast_txt = file.read()

print(podcast_txt)

Building on the powerful yet flexible Assistants API, Azure AI Agent Service has built-in memory management and a sophisticated interface to seamlessly integrate with popular compute platforms, bridging LLM capabilities with general purpose, programmatic actions.

Enable your agent to take actions with 1400+ Azure Logic Apps connectors: Leverage a wide ecosystem of connectors in Logic Apps to enable your agent to complete tasks and take actions on behalf of your users. With Logic apps, you simply need to define the business logic for your workflow in Azure Portal to connect your agent to external systems, tools and APIs.

Examples of connectors include Microsoft products such as Azure App Service, Dynamics365 Customer Voice, Microsoft Teams, M365 Excel, and leading enterprise services such as MongoDB, Dropbox, Jira, Gmail, Twilio, SAP, Stripe, ServiceNow and many more.

Think beyond chat mode by implement stateless or stateful code-based actions with Azure Functions: Enable your agent 

In [27]:
PROMPT = f"""
Craft an engaging podcast script featuring a conversation between two people based on the provided text. Use informal 
language to make the dialogue feel natural and human-like.

# Steps

1. **Review the Document(s) and Podcast Title**: Understand the main themes, key points, technical points, interesting facts, and overall tone.
2. **Adjust for Podcast Duration**: Generate a long conversation.
3. **Character Development**: Create two distinct personalities for the hosts.
4. **Script Structure**: Balance detailed explanations with engaging dialogue
5. **Use Informal Language**: Incorporate expressions and fillers to create a natural flow in the dialogue.
6. **Add analysis, interpretations, or expert opinions based on the document.**: Address potential applications, challenges, or implications of the information. Include rhetorical questions, thought-provoking ideas, or challenges for the audience to reflect on. Suggest related topics for future exploration.
7. **Add Humor and Emotion**: Include laughter and emotional responses to make the conversation lively. Consider how the hosts would react to the content to keep it engaging.
8. **Summarize the key takeaways from the document**: End with a call to action or resources for further learning.

This is the text to analyse: {podcast_txt}
"""

In [28]:
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": random.choice(voices),
        "format": "mp3"
    },
    messages=[{
        "role": "user",
        "content": PROMPT
    }],
    temperature=0.7,
    max_tokens=8000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

In [29]:
print(completion.choices[0].message.audio.transcript)

**Podcast Title: Tech Talk with Alex and Jamie**

**Intro Music Fades Out**

**Alex**: Hey there, tech enthusiasts! Welcome back to another episode of Tech Talk. I'm Alex, your go-to person for all things code and coffee.

**Jamie**: And I'm Jamie, the one who tries to keep up with Alex's tech lingo and, you know, bring a bit of common sense to the table. So, what's on the menu today, Alex?

**Alex**: Oh, Jamie, today we've got a feast! We're talking about the Azure AI Agent Service. Picture this: a service that brings together memory management, a swanky interface, and lets you integrate with over 1400 Azure Logic Apps connectors.

**Jamie**: Wow, that's a mouthful! So, basically, it's like giving your AI a Swiss Army knife, right? Whatever you need, there's a tool for that?

**Alex**: Exactly! You can set it up to handle tasks across various platforms, from Microsoft Teams to Gmail. The idea is to let your AI not just chat, but actually do stuff.

**Jamie**: So it's like having an as

In [30]:
audio_bytes = base64.b64decode(completion.choices[0].message.audio.data)

with open("podcast.mp3", "wb") as f:
    f.write(audio_bytes)

In [31]:
Audio("podcast.mp3", autoplay=False)

## Generate audio and use multi-turn chat completions

In [32]:
with open('callcenter.mp3', 'rb') as wav_reader:
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8')

messages = [{
    "role":
    "user",
    "content": [{
        "type": "text",
        "text": "Describe in detail the spoken audio input."
    }, {
        "type": "input_audio",
        "input_audio": {
            "data": encoded_string,
            "format": "mp3"
        }
    }]
}]

# Get the first turn's response

completion = client.chat.completions.create(model="gpt-4o-audio-preview",
                                            modalities=["text", "audio"],
                                            audio={
                                                "voice": random.choice(voices),
                                                "format": "mp3"
                                            },
                                            messages=messages)

print("Get the first turn's response:")
print(completion.choices[0].message.audio.transcript)

Get the first turn's response:
The spoken audio involves a customer service interaction. A male speaker, welcoming the customer to Contoso, introduces himself as John Doe and asks how he can assist. A female speaker, Maria Smith, inquires about her point balance. John Doe asks for her date of birth to confirm her identity, and then he informs her that she has a current point balance of 599 points. The call ends courteously after Maria expresses that she doesn't need further information.


In [33]:
print("Add a history message referencing the first turn's audio by ID:")
print(completion.choices[0].message.audio.id)

Add a history message referencing the first turn's audio by ID:
audio_67a5ea94b7048190aa5b901e8720666f


In [34]:
# Add a history message referencing the first turn's audio by ID
messages.append({
    "role": "assistant",
    "audio": {
        "id": completion.choices[0].message.audio.id
    }
})

# Add the next turn's user message
messages.append({"role": "user", "content": "Summarize the discussion."})

In [35]:
messages

[{'role': 'user',
  'content': [{'type': 'text',
    'text': 'Describe in detail the spoken audio input.'},
   {'type': 'input_audio',
    'input_audio': {'data': '//uQZAAA8uNSwDGCGIIAAA0gAAABDNEbDCeIYAgAADSAAAAECCAgkhWkcYABCIVc3OIWiWIWiHu+79Su4t3eGhP6JxKfvCTpxOvue76I8od/9ET/0Ouf76FXOaElfpolACp6ImlFn+mlf3OZfv/8CEACAHl2kaf4Bh+f8d+eBn5oAHp8wpzTSxCgRXKOhmpwORqLhBCcGFwpmjBhWMNCNSbEZ1wSoTzIPdFFFEMeM9IElMna5oTnLDTs3vuOjuamZGLcK9ROFkSgjF6fO5ZV94PumCwytbBcys61sSIszJBMWtXoeP2oADAAq7T4pYumbfGVGbG8qJnBcEkeEBvKQgGldsTB7QTBmwuCqXyUIpLKtqu4lSEha5i/3tCEzxmHM3wakoiqCkqIZ0+M4LrFgCmuJEnMJA2HNdL39ejKVpOz3vCTZQlADUl2h91uX5+YiK8O+JsbkpaTWPms7obyfOQY+imHTSCasQyUYfVvqL7/bvhbJsqdTqzcvXe8u719qzLy2n409kdq9j3+8Q0/xO7ZW+b8tb+MhvVfKe9WGgQw//uSZF0L80VEQynmGXIAAA0gAAABDRkdDrTzAAgAADSCgAAEtp7zFEtXGLWT2i44KOPqR9vZyme/9/01AAEwAAbPxoeYJABRO/3w3SSUXuGsQKUdSCClGgDjQGSOlyilMwOalA3KUAEEcNz5saLRWAMcA0AQgYjwvFMpmJwsIolkyD3AumHvgMLwbBju5scQdFBjNMshvhEyAEgRdjA3RQUs4midSc2cg5MIjmENGmmx5lWpKRScxd3MhOZBDMT4boEmaEHSPJseWk9C6k1IOaHU

In [36]:
# Send the follow-up request with the accumulated messages
completion = client.chat.completions.create(model="gpt-4o-audio-preview",
                                            messages=messages)

print("\nSummarize the discussion")
print(completion.choices[0].message.content)


Summarize the discussion
{"summary": "A customer named Maria Smith contacted Contoso's customer service to inquire\nabout her point balance. The representative, John Doe, confirmed her identity\nusing her date of birth and informed her that her current point balance is 599\npoints. The conversation concluded courteously, with Maria indicating she\nneeded no further information."}
