# Audio Capabilities with Azure OpenAI's gpt-4o-audio-preview Model

The gpt-4o-audio-preview model introduces audio modality into the existing /chat/completions API. The audio model expands the potential of AI applications in textual and vocal interactions as well as audio analysis. The modalities supported in the gpt-4o-audio-preview model are as follows: text, audio, and text + audio.

> https://learn.microsoft.com/en-us/azure/ai-services/openai/audio-completions-quickstart?tabs=keyless%2Cwindows%2Ctypescript-keyless&pivots=programming-language-python

In [1]:
import base64 
import os

from azure.core.credentials import AzureKeyCredential
from openai import AzureOpenAI
from dotenv import load_dotenv
from IPython.display import Audio

## Settings

In [2]:
load_dotenv("azure.env")

api_key = os.getenv("api_key")
endpoint = os.getenv("endpoint")

In [3]:
client = AzureOpenAI(api_version="2025-01-01-preview",
                     api_key=api_key,
                     azure_endpoint=endpoint)

## Text to Text/Audio

In [4]:
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": "alloy",
        "format": "mp3"
    },
    messages=[{
        "role": "user",
        "content": "Where is located Microsoft HQ?"
    }],
    temperature=0.7,
    max_tokens=8000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

In [5]:
print(completion.choices[0].message.audio.transcript)

Microsoft's headquarters is located in Redmond, Washington, USA.


In [6]:
audio_bytes = base64.b64decode(completion.choices[0].message.audio.data)

with open("answer1.mp3", "wb") as f:
    f.write(audio_bytes)

In [7]:
!ls answer1.mp3 -lh

-rwxrwxrwx 1 root root 46K Feb  6 16:01 answer1.mp3


In [8]:
Audio("answer1.mp3", autoplay=False)

In [9]:
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": "alloy",
        "format": "mp3"
    },
    messages=[{
        "role": "user",
        "content": "Quelle est la capitale de la France?"
    }],
    temperature=0.7,
    max_tokens=8000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

In [10]:
print(completion.choices[0].message.audio.transcript)

La capitale de la France est Paris.


In [11]:
audio_bytes = base64.b64decode(completion.choices[0].message.audio.data)

with open("answer2.mp3", "wb") as f:
    f.write(audio_bytes)

In [12]:
!ls answer2.mp3 -lh

-rwxrwxrwx 1 root root 29K Feb  6 16:01 answer2.mp3


In [13]:
Audio("answer2.mp3", autoplay=False)

## Audio to Text/Audio

In [14]:
!ls callcenter.mp3 -lh

-rwxrwxrwx 1 root root 503K Feb  6 08:11 callcenter.mp3


In [15]:
Audio('callcenter.mp3', autoplay=False)

In [16]:
with open('callcenter.mp3', 'rb') as wav_reader:
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8')

# Make the audio chat completions request
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": "echo",
        "format": "mp3"
    },
    messages=[
        {
            "role":
            "user",
            "content": [{
                "type": "text",
                "text": "Generate a transcript of all the discussion"
            }, {
                "type": "input_audio",
                "input_audio": {
                    "data": encoded_string,
                    "format": "mp3"
                }
            }]
        },
    ],
    temperature=0.7,
    max_tokens=8000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

print(completion.choices[0].message.audio.transcript)

Good day. Welcome to Contoso. My name is John Doe. How can I help you today?

Yes, good day. My name is Maria Smith. I would like to inquire about my current point balance.

No problem. I am happy to help. I need your date of birth to confirm your identity.

It is April 19th, 1988.

Great. Your current point balance is 599 points. Do you need any more information?

No, thank you. That was all. Goodbye.

You're welcome. Goodbye at Contoso.


In [17]:
with open('callcenter.mp3', 'rb') as wav_reader:
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8')

# Make the audio chat completions request
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": "echo",
        "format": "mp3"
    },
    messages=[
        {
            "role":
            "user",
            "content": [{
                "type": "text",
                "text": "What is the ask? The people names?"
            }, {
                "type": "input_audio",
                "input_audio": {
                    "data": encoded_string,
                    "format": "mp3"
                }
            }]
        },
    ],
    temperature=0.7,
    max_tokens=8000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

print(completion.choices[0].message.audio.transcript)

The ask was about the current point balance in an account. Maria Smith inquired about her point balance, and after providing her date of birth for identity confirmation, she was informed that her balance was 599 points.


In [18]:
with open('callcenter.mp3', 'rb') as wav_reader:
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8')

# Make the audio chat completions request
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": "echo",
        "format": "mp3"
    },
    messages=[
        {
            "role":
            "user",
            "content": [{
                "type": "text",
                "text": "What is the sentiment analysis of this interaction?"
            }, {
                "type": "input_audio",
                "input_audio": {
                    "data": encoded_string,
                    "format": "mp3"
                }
            }]
        },
    ],
    temperature=0.7,
    max_tokens=8000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

print(completion.choices[0].message.audio.transcript)

The sentiment of this interaction is polite and positive. Both parties are courteous and the conversation is handled efficiently and amicably.


### We can save the completion into an audio file

In [19]:
# Read and encode audio file
with open('callcenter.mp3', 'rb') as wav_reader:
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8')

# Make the audio chat completions request
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={
        "voice": "echo",
        "format": "wav"
    },
    messages=[
        {
            "role":
            "user",
            "content": [{
                "type":
                "text",
                "text":
                "Describe in detail this call center discussion."
            }, {
                "type": "input_audio",
                "input_audio": {
                    "data": encoded_string,
                    "format": "mp3"
                }
            }]
        },
    ],
    temperature=0.7,
    max_tokens=8000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

print(completion.choices[0].message.audio.transcript)

# Write the output audio data to a file
wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)

with open("analysis.wav", "wb") as f:
    f.write(wav_bytes)

In this call center discussion, the interaction begins with the representative, John Doe, greeting the customer and introducing himself, then asking how he can assist. The customer, Maria Smith, inquires about her current points balance. John asks for her date of birth to confirm her identity, to which she responds with her birth date, April 19th, 1988. After confirming her identity, John informs her that her current point balance is 599 points. Maria thanks him and indicates that she does not need further information, and they both end the call on a polite note.


In [20]:
!ls analysis.wav -lh

-rwxrwxrwx 1 root root 1.8M Feb  6 16:03 analysis.wav


In [21]:
Audio("analysis.wav", autoplay=False)

## Generate audio and use multi-turn chat completions

In [22]:
with open('callcenter.mp3', 'rb') as wav_reader:
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8')

messages = [{
    "role":
    "user",
    "content": [{
        "type": "text",
        "text": "Describe in detail the spoken audio input."
    }, {
        "type": "input_audio",
        "input_audio": {
            "data": encoded_string,
            "format": "mp3"
        }
    }]
}]

# Get the first turn's response

completion = client.chat.completions.create(model="gpt-4o-audio-preview",
                                            modalities=["text", "audio"],
                                            audio={
                                                "voice": "shimmer",
                                                "format": "mp3"
                                            },
                                            messages=messages)

print("Get the first turn's response:")
print(completion.choices[0].message.audio.transcript)
print()

print("Add a history message referencing the first turn's audio by ID:")
print(completion.choices[0].message.audio.id)
print()

# Add a history message referencing the first turn's audio by ID
messages.append({
    "role": "assistant",
    "audio": {
        "id": completion.choices[0].message.audio.id
    }
})

# Add the next turn's user message
messages.append({"role": "user", "content": "Summarize the discussion."})

# Send the follow-up request with the accumulated messages
completion = client.chat.completions.create(model="gpt-4o-audio-preview",
                                            messages=messages)

print()
print("Summarize the discussion")
print(completion.choices[0].message.content)

Get the first turn's response:
The audio input presents a conversation between two speakers. The first speaker, identified as John Doe, welcomes the caller to Contoso and offers assistance. The second speaker, Maria Smith, shares her name and inquires about her point balance. John Doe asks for Maria's date of birth for identity confirmation, and upon validation, he informs her that her current point balance is 599. Maria acknowledges the information and ends the call politely.

Add a history message referencing the first turn's audio by ID:
audio_67a4dd984d6481909465b8c4e7262517


Summarize the discussion
{"Maria Smith"} called Contoso to inquire about her point balance. After confirming her identity with her date of birth, John Doe informed her that she has 599 points. The conversation ended with a polite goodbye.
