Notebook Created August 2025, Python 3.13.5

In [11]:
import json

# Read json within keys directory to pick up the keys
with open('/Users/vilourenco/Documents/GitHub/azure-ai-engineer/keys/keys.json') as f:
    keys = json.load(f)

In [12]:
import azure.cognitiveservices.speech as speech_sdk

# Initialize the SpeechConfig with the API key and region
speech_config = speech_sdk.SpeechConfig(keys["azure_foundry"]["api_key"], 'eastus')

(Text From: https://learn.microsoft.com/en-us/training/modules/create-speech-enabled-apps/)
# Speech to Text API

The Azure AI Speech service supports speech recognition through the following features:
- **Real-time transcription**: Instant transcription with intermediate results for live audio inputs.
- **Fast transcription**: Fastest synchronous output for situations with predictable latency.
- **Batch transcription**: Efficient processing for large volumes of prerecorded audio.
- **Custom speech**: Models with enhanced accuracy for specific domains and conditions.
### Using the Azure AI Speech SDK
While the specific details vary, depending on the SDK being used (Python, C#, and so on); there's a consistent pattern for using the Speech to text API:

![image.png](https://learn.microsoft.com/en-us/training/wwl-data-ai/create-speech-enabled-apps/media/speech-to-text.png)

 A diagram showing how a SpeechRecognizer object is created from a **SpeechConfig** and **AudioConfig**, and its RecognizeOnceAsync method is used to call the Speech API.
1. Use a **SpeechConfig** object to encapsulate the information required to connect to your Azure AI Speech resource. Specifically, its location and key.
2. Optionally, use an **AudioConfig** to define the input source for the audio to be transcribed. By default, this is the default system microphone, but you can also specify an audio file.
3. Use the **SpeechConfig** and **AudioConfig** to create a **SpeechRecognizer** object. This object is a proxy client for the Speech to text API.
4. Use the methods of the **SpeechRecognizer** object to call the underlying API functions. For example, the RecognizeOnceAsync() method uses the Azure AI Speech service to asynchronously transcribe a single spoken utterance.
5. Process the response from the Azure AI Speech service. In the case of the **RecognizeOnceAsync()** method, the result is a **SpeechRecognitionResult** object that includes the following properties:
- Duration
- OffsetInTicks
- Properties
- Reason
- ResultId
- Text
If the operation was successful, 

The Reason property has the enumerated value RecognizedSpeech, and the Text property contains the transcription. Other possible values for Result include NoMatch (indicating that the audio was successfully parsed but no speech was recognized) or Canceled, indicating that an error occurred (in which case, you can check the Properties collection for the CancellationReason property to determine what went wrong).

In [13]:
# Speech to Text API

# Configure
speech_config.endpoint_id = keys["azure_foundry"]["endpoint"]

# Initialize the AudioConfig to use the default microphone
audio_config = speech_sdk.AudioConfig(use_default_microphone=True)

# Create a SpeechRecognizer object
speech_recognizer = speech_sdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

# Start speech recognition
print("Speak into your microphone.")
result = speech_recognizer.recognize_once_async().get()

# Check the result
if result.reason == speech_sdk.ResultReason.RecognizedSpeech:
    print(f"Recognized: {result.text}")
elif result.reason == speech_sdk.ResultReason.NoMatch:
    print("No speech could be recognized.")
elif result.reason == speech_sdk.ResultReason.Canceled:
    cancellation_details = result.cancellation_details
    print(f"Speech Recognition canceled: {cancellation_details.reason}")
    if cancellation_details.reason == speech_sdk.CancellationReason.Error:
        print(f"Error details: {cancellation_details.error_details}")
else:
    print(f"Unexpected result: {result.reason}")
# End of code to run speech recognition

Speak into your microphone.
Recognized: Close control.


# Text to Speech API

Similarly to its Speech to text APIs, the Azure AI Speech service offers other REST APIs for speech synthesis:

- The Text to speech API, which is the primary way to perform speech synthesis.
- The Batch synthesis API, which is designed to support batch operations that convert large volumes of text to audio for example to generate an audio-book from the source text.

You can learn more about the REST APIs in the <a href="https://learn.microsoft.com/en-us/azure/ai-services/speech-service/batch-synthesis">Text to speech REST API documentation</a>. In practice, most interactive speech-enabled applications use the Azure AI Speech service through a (programming) language-specific SDK. 

### Using the Azure AI Speech SDK
As with speech recognition, in practice most interactive speech-enabled applications are built using the Azure AI Speech SDK.
The pattern for implementing speech synthesis is similar to that of speech recognition:
![image.png](https://learn.microsoft.com/en-us/training/wwl-data-ai/create-speech-enabled-apps/media/text-to-speech.png)

1. Use a SpeechConfig object to encapsulate the information required to connect to your Azure AI Speech resource. Specifically, its location and key.
2. Optionally, use an AudioConfig to define the output device for the speech to be synthesized. By default, this is the default system speaker, but you can also specify an audio file, or by explicitly setting this value to a null value, you can process the audio stream object that is returned directly.
3. Use the SpeechConfig and AudioConfig to create a SpeechSynthesizer object. This object is a proxy client for the Text to speech API.
4. Use the methods of the SpeechSynthesizer object to call the underlying API functions. For example, the SpeakTextAsync() method uses the Azure AI Speech service to convert text to spoken audio.
5. Process the response from the Azure AI Speech service. In the case of the SpeakTextAsync method, the result is a SpeechSynthesisResult object that contains the following properties:
- AudioData
- Properties
- Reason
- ResultId

When speech has been successfully synthesized, the Reason property is set to the SynthesizingAudioCompleted enumeration and the AudioData property contains the audio stream (which, depending on the AudioConfig may have been automatically sent to a speaker or file).

In [16]:
# Text to Speech API
# Initialize the SpeechConfig with the API key and region
speech_config = speech_sdk.SpeechConfig(subscription=keys["azure_foundry"]["api_key"], region='eastus')

# Initialize the AudioConfig to output to the default speaker
audio_config = speech_sdk.AudioConfig(use_default_microphone=True)

# Create a SpeechSynthesizer object
speech_synthesizer = speech_sdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)

# Text to synthesize
text_to_speak = "Hello, this is a test of the Azure AI Speech service."

# Start speech synthesis
result = speech_synthesizer.speak_text_async(text_to_speak).get()

# Check the result
if result.reason == speech_sdk.ResultReason.SynthesizingAudioCompleted:
    print("Speech synthesis completed successfully.")
elif result.reason == speech_sdk.ResultReason.Canceled:
    cancellation_details = result.cancellation_details
    print(f"Speech synthesis canceled: {cancellation_details.reason}")
    if cancellation_details.reason == speech_sdk.CancellationReason.Error:
        print(f"Error details: {cancellation_details.error_details}")
else:
    print(f"Unexpected result: {result.reason}")
# End of code to run speech synthesis

Speech synthesis completed successfully.
