# Audio Capabilities with Azure OpenAI's gpt-4o-audio-preview Model

The gpt-4o-audio-preview model introduces audio modality into the existing /chat/completions API. The audio model expands the potential of AI applications in textual and vocal interactions as well as audio analysis. The modalities supported in the gpt-4o-audio-preview model are as follows: text, audio, and text + audio.

> https://learn.microsoft.com/en-us/azure/ai-services/openai/audio-completions-quickstart?tabs=keyless%2Cwindows%2Ctypescript-keyless&pivots=programming-language-python

In [1]:
import base64 
import os

from azure.core.credentials import AzureKeyCredential
from openai import AzureOpenAI
from dotenv import load_dotenv
from IPython.display import Audio

## Settings

In [2]:
load_dotenv("azure.env")

api_key = os.getenv("api_key")
endpoint = os.getenv("endpoint")

In [3]:
client = AzureOpenAI(api_version="2025-01-01-preview",
                     api_key=api_key,
                     azure_endpoint=endpoint)

## Text to Text/Audio

In [4]:
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": "alloy",
        "format": "mp3"
    },
    messages=[{
        "role": "user",
        "content": "Where is located Microsoft HQ?"
    }])

In [5]:
print(completion.choices[0].message.audio.transcript)

Microsoft's headquarters is located in Redmond, Washington, USA. The address is One Microsoft Way, Redmond, WA 98052. It's a large campus that serves as the main hub for Microsoft's operations.


In [6]:
audio_bytes = base64.b64decode(completion.choices[0].message.audio.data)

with open("answer1.mp3", "wb") as f:
    f.write(audio_bytes)

In [7]:
Audio("answer1.mp3", autoplay=False)

In [17]:
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": "alloy",
        "format": "wav"
    },
    messages=[{
        "role": "user",
        "content": "Quelle est la capitale de la France?"
    }])

In [18]:
audio_bytes = base64.b64decode(completion.choices[0].message.audio.data)

with open("answer2.wav", "wb") as f:
    f.write(audio_bytes)

In [19]:
print(completion.choices[0].message.audio.transcript)

La capitale de la France est Paris.


In [20]:
Audio("answer2.wav", autoplay=False)

## Audio to Text/Audio

In [11]:
Audio('callcenter.mp3', autoplay=False)

In [12]:
with open('callcenter.mp3', 'rb') as wav_reader:
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8')

# Make the audio chat completions request
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": "echo",
        "format": "mp3"
    },
    messages=[
        {
            "role":
            "user",
            "content": [{
                "type": "text",
                "text": "Generate a transcript of all the discussion"
            }, {
                "type": "input_audio",
                "input_audio": {
                    "data": encoded_string,
                    "format": "mp3"
                }
            }]
        },
    ])

print(completion.choices[0].message.audio.transcript)

Good day. Welcome to Contoso, my name is John Doe. How can I help you today?

Yes, good day. My name is Maria Smith. I would like to inquire about my current point balance.

No problem. I am happy to help. I need your date of birth to confirm your identity.

It is April 19th, 1988.

Great. Your current point balance is 599 points. Do you need any more information?

No, thank you. That was all. Goodbye.

You're welcome. Goodbye at Contoso.


In [13]:
with open('callcenter.mp3', 'rb') as wav_reader:
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8')

# Make the audio chat completions request
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": "echo",
        "format": "mp3"
    },
    messages=[
        {
            "role":
            "user",
            "content": [{
                "type": "text",
                "text": "What is the ask? The people names?"
            }, {
                "type": "input_audio",
                "input_audio": {
                    "data": encoded_string,
                    "format": "mp3"
                }
            }]
        },
    ])

print(completion.choices[0].message.audio.transcript)

The ask here is for support in checking the point balance of a customer named Maria Smith. John Doe, the representative, needs the customer's date of birth to verify identity before providing the requested information.


In [14]:
# Read and encode audio file
with open('callcenter.mp3', 'rb') as wav_reader:
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8')

# Make the audio chat completions request
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": "echo",
        "format": "wav"
    },
    messages=[
        {
            "role":
            "user",
            "content": [{
                "type":
                "text",
                "text":
                "Describe in detail this call center discussion."
            }, {
                "type": "input_audio",
                "input_audio": {
                    "data": encoded_string,
                    "format": "mp3"
                }
            }]
        },
    ])

print(completion.choices[0].message.audio.transcript)

# Write the output audio data to a file
wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)

with open("analysis.wav", "wb") as f:
    f.write(wav_bytes)

In the call center discussion, a representative named John Doe welcomes a customer, Maria Smith. Maria inquires about her current point balance with the company, Contoso. John, after confirming Maria's date of birth for identity verification, informs her that she has a total of 599 points. Maria does not require any additional information, and the call ends with a polite exchange of farewells.


In [15]:
Audio("analysis.wav", autoplay=False)

## Generate audio and use multi-turn chat completions

In [16]:
with open('callcenter.mp3', 'rb') as wav_reader:
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8')

messages = [{
    "role":
    "user",
    "content": [{
        "type": "text",
        "text": "Describe in detail the spoken audio input."
    }, {
        "type": "input_audio",
        "input_audio": {
            "data": encoded_string,
            "format": "mp3"
        }
    }]
}]

# Get the first turn's response

completion = client.chat.completions.create(model="gpt-4o-audio-preview",
                                            modalities=["text", "audio"],
                                            audio={
                                                "voice": "shimmer",
                                                "format": "mp3"
                                            },
                                            messages=messages)

print("Get the first turn's response:")
print(completion.choices[0].message.audio.transcript)
print()

print("Add a history message referencing the first turn's audio by ID:")
print(completion.choices[0].message.audio.id)
print()

# Add a history message referencing the first turn's audio by ID
messages.append({
    "role": "assistant",
    "audio": {
        "id": completion.choices[0].message.audio.id
    }
})

# Add the next turn's user message
messages.append({"role": "user", "content": "Summarize the discussion."})

# Send the follow-up request with the accumulated messages
completion = client.chat.completions.create(model="gpt-4o-audio-preview",
                                            messages=messages)

print()
print("Summarize the discussion")
print(completion.choices[0].message.content)

Get the first turn's response:
The audio input depicts a dialogue between two speakers. The first speaker greets the listener, mentions they're from "Contoso," and introduces themselves as John Doe, offering assistance. The second speaker, Maria Smith, inquires about her points balance. John asks for Maria's date of birth to verify her identity. After Maria provides the date, John informs her of her current points balance, which is 599 points. The conversation concludes with Maria declining further assistance and the speakers exchanging pleasantries before saying goodbye.

Add a history message referencing the first turn's audio by ID:
audio_67a479d9f8948190b0dce81753888d51


Summarize the discussion
{"Temperature":1.0,"Length":900,"Timestamp":"2023-10-07T17:28:11.880Z"}The discussion is a brief customer service exchange. John Doe, representing Contoso, assists Maria Smith by providing information about her current point balance. After verifying Maria's identity through her date of bir