

# Introduction

In this post, I’ll show you how to harness the power of **Azure Speech to Text** in Python to transcribe audio files and byte streams—quickly, accurately, and with advanced features like word-level timestamps, speaker diarization, and automatic language detection.

Whether you’re building live meeting captioning, interactive voice agents, or large-scale audio analytics, you’ll gain practical skills and insights to make your solutions smarter and more responsive. Let’s dive in!

# Why Azure Speech to Text?

The Azure Speech to Text service provides the following core features:

- Real-time transcription: Instant transcription with intermediate results for live audio inputs.
- Fast transcription: Synchronous output optimized for predictable latency scenarios.
- Batch transcription: Efficient processing for large volumes of prerecorded audio.
- Custom speech: Models with enhanced accuracy for specific domains and conditions.

This article focuses on real-time transcription 

# Real-Time Transcription

Real-time speech-to-text technology enables the immediate conversion of spoken audio—captured from microphones or digital files—into structured text output as the audio is processed. This capability is essential for applications requiring low-latency transcription, such as live meeting captioning, interactive voice agents, and automated documentation systems.

By leveraging advanced voice activity detection (VAD) and streaming recognition algorithms, real-time transcription systems can deliver intermediate results, support speaker diarization, and integrate seamlessly with engineering and AI workflows. These solutions are optimized for scenarios where rapid feedback, continuous monitoring, and integration with downstream analytics or automation pipelines are required, ensuring both accuracy and responsiveness in dynamic environments.

# Setting Up Azure Speech SDK in Python

First, install the Azure Speech SDK:

```shell
pip install azure-cognitiveservices-speech
```

The Azure AI Speech SDK provides two primary classes for speech transcription:

- `speechsdk.transcription.ConversationTranscriber`: Supports advanced features such as speaker diarization, automatic language identification, and word-level offsets.
- `speechsdk.SpeechRecognizer`: Provides base speech transcription functionality.


# Choosing Between ConversationTranscriber and SpeechRecognizer

**SpeechRecognizer** is best for:
- Simple, single-speaker transcription scenarios (e.g., dictation, command recognition).
- Basic transcription tasks where advanced features are not required.
- Use cases where you only need the recognized text and do not need speaker identification or advanced metadata.

**ConversationTranscriber** is best for:
- Multi-speaker conversations (e.g., meetings, interviews) where speaker diarization is needed.
- Scenarios requiring word-level timestamps, automatic language identification, or more detailed recognition metadata.
- Applications that need to distinguish between speakers and process advanced conversational features.


# recognize_once_async

The `SpeechRecognizer.recognize_once_async` method performs speech recognition in a non-blocking (asynchronous) mode, suitable for quick, single-shot transcription of short audio files or streams. It processes a single utterance, where the end of the utterance is automatically detected by either:

- Listening for silence at the end of the speech.
- Reaching a maximum duration of approximately 30 seconds of audio.

This approach ensures fast transcription for short segments and is best suited for scenarios requiring capture of a single spoken phrase or command without waiting for longer audio input.

Here’s how to use it:

```python
audio_input = speechsdk.AudioConfig(filename="output.wav")
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)
# result is of type speech_recognition_result
result = speech_recognizer.recognize_once_async().get()
```

The recognized text can be retrieved from the `speech_recognition_result.text` property. For robust error handling and response management, it is essential to evaluate the `speech_recognition_result.reason` property and implement logic for each possible outcome:

- **Recognized Speech** (`speechsdk.ResultReason.RecognizedSpeech`):
  - Output the recognized text for further processing or display.
- **No Match** (`speechsdk.ResultReason.NoMatch
  - Notify the user or system that no speech could be recognized, enabling fallback or retry mechanisms.
- **Recognition Canceled** (`speechsdk.ResultReason.Canceled`):
  - Log the cancellation reason and error details for diagnostics.
  - Advise on configuration issues, such as missing resource keys or endpoint values, to facilitate troubleshooting.

Implementing structured error handling ensures application reliability, facilitates debugging, and provides meaningful feedback for both users and downstream systems.

# Use continuous recognition vs recognize_once_async

While `SpeechRecognizer.recognize_once_async` is ideal for quick, single-shot transcription of short audio files or streams. 

Continuous recongition provides 
- **Continuous Transcription**: Handles longer audio and ongoing conversations, not limited to a single utterance.
- **Control stop recognizing** : continuous recognition is used when you want to control when to stop recognizing, possible using a seperate voice activity detector
- **Real-Time Feedback**: Provides intermediate results and updates as the conversation progresses.
- **Event-Driven Architecture**: Uses callback functions to handle events like transcribed text, cancellations, and session stops, allowing for more flexible and interactive workflows.

The `speech.transcription.ConversationTranscriber` class will be used for realizing continuous recognition.
To stop recognition, you must call stop_transcribing() or stop_transcribing_async(). 

## Note on Callback Functions
Callback functions allow you to process transcribed text as it arrives, handle errors, and manage session lifecycle events. 

The following callback are available

- **transcribing**: Triggered when intermediate transcription results are available. Useful for real-time feedback as speech is being processed.
- **transcribed**: Triggered when a final transcription result is available. Use this to handle the completed transcription of an utterance.
- **canceled**: Triggered when recognition is canceled due to errors or interruptions. Provides details for diagnostics and error handling.
- **session_started**: Triggered when a recognition session starts. Useful for initializing resources or logging session activity.
- **session_stopped**: Triggered when a recognition session stops. Use this to clean up resources or finalize session logs.
- **speech_start_detected**: Triggered when the start of speech is detected in the audio stream. Can be used to mark the beginning of an utterance.
- **speech_end_detected**: Triggered when the end of speech is detected. Useful for segmenting utterances and managing recognition boundaries.

```python
        conversation_transcriber = speechsdk.transcription.ConversationTranscriber(speech_config=speech_config, audio_config=audio_input)

        ....

        # Connect callbacks to ConversationTranscriber events
        conversation_transcriber.transcribing.connect(_on_recognizing)
        conversation_transcriber.transcribed.connect(_on_recognized)
        conversation_transcriber.session_started.connect(_session_started)
        conversation_transcriber.canceled.connect(canceled_callback)
        conversation_transcriber.session_stopped.connect(session_stopped)

        # Start continuous transcription
        logger.info("starting ")
        result_future=conversation_transcriber.start_transcribing_async()
        
        # Waits for completion.
        while not transcribing_stop:
            time.sleep(.5)      
        
        logger.info("completed transcribing")
        result=future=conversation_transcriber.stop_transcibing_aync()
        result_future.get()
        logger.info("recognition as started ")

```

# Alternate Hypothesis 

Alternate hypotheses in speech recognition refer to multiple possible transcriptions for a given audio segment, each with an associated confidence score. Instead of returning only the most likely transcript, Azure Speech to Text provides a ranked list of alternatives, allowing applications to access other plausible interpretations of the spoken input.

Azure Speech to Text typically includes up to 5 alternate hypotheses for each utterance in the recognition result. These are found in the `NBest` array of the result's JSON property. Each hypothesis contains fields such as `Display`, `Lexical`, `Text`, `ITN`, and `Confidence`, enabling detailed analysis and selection based on application needs.

The top hypothesis (highest confidence) is returned in `result.text`, while all alternatives can be accessed by parsing the `result.json` property.

## Explanation of Fields in NBest

Each element in the `NBest` array contains several fields:

- **Display**: The formatted transcript as it would appear to a user, with punctuation and capitalization.
- **Lexical**: The raw transcript with minimal formatting, typically all lowercase and without punctuation.
- **Text**: Usually similar to `Display`, but may differ depending on the service version.
- **ITN (Inverse Text Normalization)**: The transcript converted to a normalized form suitable for further processing (e.g., numbers as digits).
- **Confidence**: A score (0.0 to 1.0) indicating the system's confidence in the accuracy of the hypothesis.

To enable alternate hypothesis set the output format to `Detailed`:

```python
speech_config.output_format = speechsdk.OutputFormat.Detailed
```

The value returned in `result.text` is the transcript of the highest-confidence alternative, which corresponds to the `Display` field of the first element in the `NBest` array (`NBest[0]['Display']`). To access other alternatives, parse the `result.json` property.

```python
import json
result_json = json.loads(result.json)
# Print all alternate hypotheses in NBest
if 'NBest' in json1 and json1['NBest']:
    for idx, alt in enumerate(json1['NBest']):
            display_text = alt.get('Display', alt.get('Lexical', alt.get('Text', '')))
            confidence = alt.get('Confidence', None)
            logger.info(f"Alternative {idx+1}: {display_text} | Confidence: {confidence}")
                
```



# Word-Level Timestamps

Word-level timestamps provide precise timing information for each word in the transcribed text, including start and end times. This is useful for applications requiring detailed synchronization, such as:

- Video captioning and subtitle alignment
- Audio analysis and phoneme segmentation
- Synchronizing transcripts with other media (e.g., video editing, search)
- Building interactive transcripts and word-level navigation


## Enable Word Level Transcripts 

To enable word-level timestamps in Azure Speech to Text: 
Set the `SpeechServiceResponse_RequestWordLevelTimestamps` property to "true" in the `SpeechConfig`.

```python

speech_config.set_property(
            speechsdk.PropertyId.SpeechServiceResponse_RequestWordLevelTimestamps, "true"
        )
```

## How Word-Level Offsets Work

When you enable word-level timestamps in Azure Speech to Text, the recognition result includes a `json` property. This property contains a detailed JSON structure with timing data for each word. The relevant information is found in the `NBest` array:


- **Words**: Each word object includes:
    - `Word`: The recognized word
    - `Offset`: The start time of the word (in 100-nanosecond ticks)
    - `Duration`: The duration of the word (in ticks)
    - `Confidence`: Confidence score for the word

- **NBest**: This array contains alternative recognition hypotheses for the utterance. Each element includes a `Words` array, which holds word-level details. 


# Example: Extracting Word-Level Offsets

```python
import json
result_json = json.loads(result.json)
# Select only the highest-confidence alternative (NBest[0])
if 'NBest' in result_json and result_json['NBest']:
    best_alternative = result_json['NBest'][0]
    for word in best_alternative.get('Words', []):
        print(f"Word: {word['Word']}, Start: {word['Offset']}, Duration: {word['Duration']}, Confidence: {word['Confidence']}")
```


Word-level timestamps provide precise timing information for each word in the transcribed text, including start and end times. This is useful for applications requiring detailed synchronization, such as video captioning, audio analysis, or alignment with other media.


# Speaker Diarization

Diarization distinguishes between different speakers in a conversation. The speaker ID is a generic identifier assigned to each participant by the service during recognition as speakers are identified from the audio content.

## Enabling Speaker Diarization

```
        speech_config.set_property(
            speechsdk.PropertyId.SpeechServiceResponse_DiarizeIntermediateResults, "true"
        )
```
The speaker ID for each recognized segment is available in the callback event as `evt.result.speaker_id`. You can access this property inside your transcribing or transcribed callback to identify which speaker spoke each part of the audio. This is useful for labeling transcript segments by speaker in multi-speaker scenarios.

# Enable Language Detection

Automatic language detection allows Azure Speech to Text to identify the spoken language in an audio stream without prior specification. This is useful for applications where the language may vary or is unknown at runtime.

## How to Enable Automatic Language Detection

To enable automatic language detection, use the `AutoDetectSourceLanguageConfig` class and pass it to the recognizer or transcriber. You can specify a list of possible languages to detect.

```
speech_config.set_property(property_id=speechsdk.PropertyId.SpeechServiceConnection_LanguageIdMode, value='Continuous')

auto_detect_source_language_config = speechsdk.languageconfig.AutoDetectSourceLanguageConfig(
    languages=["en-US", "de-DE", "zh-CN"])
```

## How to find the detected language from the result

To parse the detected language from the results, you need to examine the JSON output for each turn of the conversation. In continuous recognition, the final results are typically an array of JSON objects, each representing a segment (turn) with its detected language and transcription details.

Refer to the official documentation for more details: [Azure Speech Service Language Identification](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-identification?pivots=programming-language-python&tabs=continuous)

**Example: Parsing detected language from final results**

```python
# Assume final_results is a list of JSON objects, one per turn
for turn in final_results:
    # Each turn is a dict parsed from JSON
    # The detected language is usually under 'PrimaryLanguage' or similar key
    detected_lang = turn.get('PrimaryLanguage', {}).get('Language', None)
    transcript = turn.get('DisplayText', '')
    print(f"Detected language: {detected_lang}, Transcript: {transcript}")
```

Each JSON object may look like:

```json
{
  "PrimaryLanguage": {
    "Language": "en-US",
    "Confidence": 0.98
  },
  "DisplayText": "Hello, how are you?",
  ...
}
```

This allows you to extract the detected language and transcript for every segment in a multi-turn conversation.

## Bilingual Detection

Azure Speech to Text supports automatic detection among multiple languages, but it does not perform true bilingual (simultaneous multi-language) transcription within a single utterance. The service will select the most likely language from the provided list for each recognition session or utterance.

If your use case involves code-switching (speakers switching between languages mid-sentence), the service will typically recognize only one language per utterance. For best results, provide a list of expected languages and segment audio where possible.

- Specify up to 4 languages in the `AutoDetectSourceLanguageConfig`.
- The detected language is returned in the recognition result properties.
- For continuous recognition, language detection is performed per utterance.

The link to complete source code can be found at 

# Conclusion

Azure Speech to Text makes it easy to transcribe audio files and byte streams in Python, with powerful options for fast transcription and voice activity detection. Whether you’re building AI-powered apps, automating meeting notes, or processing audio data at scale, Azure’s SDK and APIs offer flexibility and accuracy.

Sign up for my newsletter for the latest in AI, Python, and cloud engineering. Share your experiences or questions in the comments



In [None]:
import azure.cognitiveservices.speech as speechsdk
import threading
import time
import logging
from datetime import datetime
import json

# Setup logger with timestamp
logging.basicConfig(format='%(asctime)s %(levelname)s: %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)


import os
from dotenv import load_dotenv

load_dotenv()
speech_key = os.getenv("AZURE_SPEECH_KEY")
service_region = os.getenv("AZURE_SPEECH_REGION")
endpoint_id = os.getenv("AZURE_SPEECH_ENDPOINT_ID")



def recognize_speech(mode="recognize_speech_once",source="sample1.wav",alternative_hypotheses=False,word_timestamps=False,diarize_intermediate=False,language_id_mode=None):
    # File input for ConversationTranscriber
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    #speech_config.set_property(property_id=speechsdk.PropertyId.SpeechServiceResponse_DiarizeIntermediateResults, value='true')
    speech_config.speech_recognition_language="en-US"



    audio_input = speechsdk.AudioConfig(filename=source)
    
    if word_timestamps == True:
        speech_config.set_property(
            speechsdk.PropertyId.SpeechServiceResponse_RequestWordLevelTimestamps, "true"
        )
    if diarize_intermediate == True:
        speech_config.set_property(
            speechsdk.PropertyId.SpeechServiceResponse_DiarizeIntermediateResults, "true"
        )
    if language_id_mode is not None:
        try:
            speech_config.set_property(property_id=speechsdk.PropertyId.SpeechServiceConnection_LanguageIdMode, value='Continuous')
            speech_config.auto_detect_source_language_config = speechsdk.languageconfig.AutoDetectSourceLanguageConfig(languages=language_id_mode)
        except Exception:
            pass
    if alternative_hypotheses == True:
        speech_config.output_format = speechsdk.OutputFormat.Detailed 




    if mode == "recognize_speech_once":
        
        speech_config.output_format = speechsdk.OutputFormat.Detailed
        
        speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)

        # Add timeout logic to recognize_once_async
        try:
            result = speech_recognizer.recognize_once_async().get()
        except Exception as e:
            logger.error(f"Recognition failed or timed out: {e}")
            print("Recognition failed or timed out.")
            return

        raw_json = getattr(result, "json", "")
        logger.info(f"Raw JSON: {raw_json}")
        
        if result.reason == speechsdk.ResultReason.RecognizedSpeech:
            print("Recognized: {}".format(result.text))
        elif result.reason == speechsdk.ResultReason.NoMatch:
            print("No speech could be recognized.")
        elif result.reason == speechsdk.ResultReason.Canceled:
            cancellation_details = result.cancellation_details
            print("Speech Recognition canceled: {}".format(cancellation_details.reason))
            if cancellation_details.reason == speechsdk.CancellationReason.Error:
                print("Error details: {}".format(cancellation_details.error_details))
                print("Did you set the speech resource key and endpoint values?")
                
        json1=json.loads(raw_json)
        # Print all alternate hypotheses in NBest
        if 'NBest' in json1 and json1['NBest']:
            for idx, alt in enumerate(json1['NBest']):
                display_text = alt.get('Display', alt.get('Lexical', alt.get('Text', '')))
                confidence = alt.get('Confidence', None)
                print(f"Alternative {idx+1}: {display_text} | Confidence: {confidence}")
                
                
    else :

        conversation_transcriber = speechsdk.transcription.ConversationTranscriber(speech_config=speech_config, audio_config=audio_input,
        auto_detect_source_language_config=speech_config.auto_detect_source_language_config if language_id_mode is not None else None
         )


        results = []
        final_results = []
        transcribing_stop = False

        def _on_recognizing(evt: speechsdk.SpeechRecognitionEventArgs):
            try:
                res=evt.result
                logger.info(f"Recognizing: {evt.result.text} Speaker ID: {evt.result.speaker_id}")
                
                raw_json = getattr(res, "json", "")
                if raw_json:
                    payload = json.loads(raw_json)
                else:    
                    payload = {"text": res.text, "start_ts": float(getattr(res, "offset", 0)) / 10_000_000.0 , "speaker_id": res.speaker_id}
                results.append(payload)
                
                
                if res.reason is not None:
                    if res.reason == speechsdk.ResultReason.RecognizedSpeech:
                        print("Recognized: {}".format(res.text))
                    elif res.reason == speechsdk.ResultReason.NoMatch:
                        print("No speech could be recognized.")
                    elif res.reason == speechsdk.ResultReason.Canceled:
                        cancellation_details = res.cancellation_details
                        print("Speech Recognition canceled: {}".format(cancellation_details.reason))
                        if cancellation_details.reason == speechsdk.CancellationReason.Error:
                            print("Error details: {}".format(cancellation_details.error_details))
                            print("Did you set the speech resource key and endpoint values?")
                
            except Exception as e:
                import traceback
                traceback.print_exc()
            #print('\tSpeaker ID={}'.format(evt.result.speaker_id))

        # Callback for final results
        def _on_recognized(evt: speechsdk.SpeechRecognitionEventArgs):
            #logger.info(f"Recognized: {evt.result.text}")
            
            #auto_detect_source_language_result = speechsdk.AutoDetectSourceLanguageResult(evt.result)
            if evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:
                res=evt.result
                raw_json = getattr(res, "json", "")
                if raw_json:
                    payload = json.loads(raw_json)
                else:    
                    payload = {"text": res.text, "start_ts": float(getattr(res, "offset", 0)) / 10_000_000.0 , "speaker_id": res.speaker_id}
                # Check if payload is a list (JSON array)
                if isinstance(payload, list):
                    final_results.extend(payload)
                else:
                    final_results.append(payload)
                logger.info(f"Recognized: {evt.result.text} Speaker ID: {evt.result.speaker_id}")
            elif evt.result.reason == speechsdk.ResultReason.NoMatch:
                print('\tNOMATCH: Speech could not be TRANSCRIBED: {}'.format(evt.result.no_match_details))

        # Callback for session started
        def _session_started(evt: speechsdk.SessionEventArgs):
            logger.info("Session started.")

        # Callback for session stopped
        def session_stopped(evt: speechsdk.SessionEventArgs):
            logger.info("Session stopped.")
            nonlocal transcribing_stop
            transcribing_stop=True

        # Callback for cancellation
        def canceled_callback(evt: speechsdk.SessionEventArgs):
            logger.info(f"Canceled: {evt}")
            nonlocal transcribing_stop
            transcribing_stop=True

        # Connect callbacks to ConversationTranscriber events
        conversation_transcriber.transcribing.connect(_on_recognizing)

        conversation_transcriber.transcribed.connect(_on_recognized)

        conversation_transcriber.session_started.connect(_session_started)
        conversation_transcriber.canceled.connect(canceled_callback)
        conversation_transcriber.session_stopped.connect(session_stopped)

        # Start continuous transcription
        logger.info("starting ")
        result_future=conversation_transcriber.start_transcribing_async()
        
        # Waits for completion.
        while not transcribing_stop:
            time.sleep(.5)      
   
        logger.info("completed transcribing")
        result_future=conversation_transcriber.stop_transcribing_async()
        logger.info("stopping transcribing")
        result_future.get()
             
        logger.info(final_results)
        
        s1=[]
        for s in final_results:
          s1.append(s["DisplayText"])  
        
        json1=final_results[0]
        if 'NBest' in json1 and json1['NBest']:
            for idx, alt in enumerate(json1['NBest']):
                display_text = alt.get('Display', alt.get('Lexical', alt.get('Text', '')))
                confidence = alt.get('Confidence', None)
                logger.info(f"Alternative {idx+1}: {display_text} | Confidence: {confidence}")
  
            best_alternative = json1['NBest'][0]
            for word in best_alternative.get('Words', []):
                print(f"Word: {word['Word']}, Start: {word['Offset']}, Duration: {word['Duration']}, Confidence: {word['Confidence']}")              
        
        logger.info("output text",s1)
        logger.info("completed script")
        return final_results

try:
     
     final_results=recognize_speech(source="sample3.wav",mode="continuous",word_timestamps=False,language_id_mode=["en-US","zh-CN"],alternative_hypotheses=False)
except Exception as e :
    import traceback
    traceback.print_exc()





2025-10-06 03:45:27,098 INFO: starting 
2025-10-06 03:45:27,098 INFO: Session started.
2025-10-06 03:45:30,242 INFO: Recognizing: what's the weather like Speaker ID: Unknown
2025-10-06 03:45:30,277 INFO: Recognizing: what's the weather like today Speaker ID: Unknown
2025-10-06 03:45:30,279 INFO: Recognized: What's the weather like today? Speaker ID: Guest-1
2025-10-06 03:45:30,706 INFO: Recognizing: 今天 Speaker ID: Unknown
2025-10-06 03:45:30,899 INFO: Recognizing: 今天天气怎 Speaker ID: Unknown
2025-10-06 03:45:30,920 INFO: Recognized: 今天天气怎么样？ Speaker ID: Guest-1
2025-10-06 03:45:31,162 INFO: Recognizing: how do i go to that Speaker ID: Unknown
2025-10-06 03:45:31,267 INFO: Recognizing: how do i go to that bus stop Speaker ID: Unknown
2025-10-06 03:45:31,310 INFO: Recognized: How do I go to that bus stop? Speaker ID: Guest-1
2025-10-06 03:45:31,437 INFO: Recognizing: 请问 Speaker ID: Unknown
2025-10-06 03:45:31,636 INFO: Recognizing: 请问那个车站 Speaker ID: Unknown
2025-10-06 03:45:31,730 INFO: R

2025-10-06 03:46:16,734 INFO: Recognizing: what's the weather like Speaker ID: Unknown
2025-10-06 03:46:16,767 INFO: Recognizing: what's the weather like today Speaker ID: Unknown
2025-10-06 03:46:16,768 INFO: Recognized: What's the weather like today? Speaker ID: Guest-1
2025-10-06 03:47:07,903 INFO: Recognizing: jing jing Speaker ID: Unknown
2025-10-06 03:47:23,706 INFO: Recognizing: 今天 Speaker ID: Unknown
2025-10-06 03:47:27,103 INFO: Recognizing: 今天天气怎 Speaker ID: Unknown
2025-10-06 03:47:33,492 INFO: Recognized: 今天天气怎么样？ Speaker ID: Guest-1
2025-10-06 03:48:39,617 INFO: Recognizing: how Speaker ID: Unknown
2025-10-06 03:48:43,414 INFO: Recognizing: how do Speaker ID: Unknown
2025-10-06 03:48:48,014 INFO: Recognizing: how do i Speaker ID: Unknown
2025-10-06 03:48:50,414 INFO: Recognizing: how do i go Speaker ID: Unknown
2025-10-06 03:48:50,522 INFO: Recognizing: how do i go to Speaker ID: Unknown
2025-10-06 03:48:56,124 INFO: Recognizing: how do i go to that Speaker ID: Unknown
202