# ASR using Whisper in Docling

## Overview

This notebook demonstrates how to build an **Automatic Speech Recognition (ASR)** pipeline using **Whisper** models with **Docling**.

### What is ASR?

**Automatic Speech Recognition (ASR)** is the technology that converts spoken language into written text. It's the foundation of voice assistants, transcription services, and accessibility tools.

### What is Whisper?

**Whisper** is a state-of-the-art ASR model developed by OpenAI. It's trained on 680,000 hours of multilingual data and can:
- Transcribe speech in multiple languages
- Handle noisy audio
- Provide accurate timestamps
- Work across different accents and speaking styles

### What is Docling?

**Docling** is a powerful document processing library that supports multiple input formats, including audio files. It provides a unified pipeline for converting various document types into structured formats.

---

## What This Notebook Does

This notebook will:
1. Configure an ASR pipeline with Whisper Turbo model
2. Convert an audio file to text
3. Export the transcription to Markdown with timestamps
4. Demonstrate automatic model selection based on your hardware

---


## Prerequisites

Before running this notebook, you need to:

1. **Install Docling with ASR extras:**
   ```bash
   pip install docling[asr]
   ```

2. **Install ffmpeg** (required for audio processing):
   - **macOS:** `brew install ffmpeg`
   - **Ubuntu/Debian:** `sudo apt-get install ffmpeg`
   - **Windows:** Download from [ffmpeg.org](https://ffmpeg.org/download.html)

3. **Optional - For Apple Silicon (M1/M2/M3):**
   ```bash
   pip install mlx-whisper
   ```
   This will enable faster inference using MLX optimization.

---


## Step 1: Import Required Libraries

Let's import all the necessary modules for our ASR pipeline:

- **`pathlib.Path`**: For handling file paths in a cross-platform way
- **`docling_core.types.doc.DoclingDocument`**: The document object that stores transcriptions
- **`docling.datamodel.asr_model_specs`**: Pre-configured ASR model specifications
- **`docling.datamodel`**: Data models for conversion status and input formats
- **`docling.document_converter`**: Main converter class for processing audio files
- **`docling.pipeline.asr_pipeline`**: The ASR-specific pipeline implementation


In [None]:
from pathlib import Path

from docling_core.types.doc import DoclingDocument
from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import ConversionStatus, InputFormat
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline


## Step 2: Create the ASR Converter

This function creates and configures a `DocumentConverter` specifically for ASR tasks.

### Key Components:

1. **`AsrPipelineOptions()`**: Configuration object for the ASR pipeline
2. **`asr_model_specs.WHISPER_TURBO`**: Automatic model selection:
   - Uses **MLX Whisper Turbo** on Apple Silicon (faster, optimized for M-series chips)
   - Falls back to **Native Whisper Turbo** on other hardware
3. **`AudioFormatOption`**: Specifies:
   - `pipeline_cls`: Which pipeline to use (AsrPipeline)
   - `pipeline_options`: Configuration for the pipeline

### Available Models:

You can swap `WHISPER_TURBO` with other models from `asr_model_specs`:
- `WHISPER_TINY`: Smallest, fastest, less accurate
- `WHISPER_BASE`: Balanced for speed and accuracy
- `WHISPER_SMALL`: Good accuracy, moderate speed
- `WHISPER_MEDIUM`: High accuracy, slower
- `WHISPER_LARGE`: Highest accuracy, slowest
- `WHISPER_TURBO`: Optimized version, good balance (recommended)


In [None]:
def get_asr_converter():
    """Create a DocumentConverter configured for ASR with automatic model selection.

    Uses `asr_model_specs.WHISPER_TURBO` which automatically selects the best
    implementation for your hardware:
    - MLX Whisper Turbo for Apple Silicon (M1/M2/M3) with mlx-whisper installed
    - Native Whisper Turbo as fallback

    You can swap in another model spec from `docling.datamodel.asr_model_specs`
    to experiment with different model sizes.
    """
    # Create pipeline options
    pipeline_options = AsrPipelineOptions()
    
    # Set the ASR model (Whisper Turbo with automatic hardware selection)
    pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO

    # Create the document converter with audio format support
    converter = DocumentConverter(
        format_options={
            InputFormat.AUDIO: AudioFormatOption(
                pipeline_cls=AsrPipeline,
                pipeline_options=pipeline_options,
            )
        }
    )
    
    return converter


## Step 3: Define the ASR Pipeline Conversion Function

This function performs the actual transcription:

### Process Flow:

1. **Validate**: Check if the audio file exists
2. **Initialize**: Get the configured converter
3. **Convert**: Process the audio file through the ASR pipeline
4. **Verify**: Ensure the conversion was successful
5. **Return**: Return the `DoclingDocument` containing the transcription

### Return Value:

The function returns a `DoclingDocument` object that contains:
- Transcribed text segments
- Timestamps for each segment
- Metadata about the conversion

This document can be exported to various formats (Markdown, JSON, etc.)


In [None]:
def asr_pipeline_conversion(audio_path: Path) -> DoclingDocument:
    """Run the ASR pipeline and return a `DoclingDocument` transcript."""
    
    # Check if the audio file exists
    assert audio_path.exists(), f"Audio file not found: {audio_path}"
    
    # Get the configured ASR converter
    converter = get_asr_converter()

    # Convert the audio file to text
    print(f"Converting audio file: {audio_path}")
    result: ConversionResult = converter.convert(audio_path)

    # Verify conversion was successful
    assert result.status == ConversionStatus.SUCCESS, (
        f"Conversion failed with status: {result.status}"
    )
    
    print("Conversion successful!")
    return result.document


## Step 4: Run the ASR Pipeline

Now let's transcribe an audio file!

### Input Audio:

The default example uses `tests/data/audio/sample_10s.mp3` from Docling's test suite.

**To use your own audio file:**
1. Replace the path below with your audio file path
2. Supported formats: MP3, WAV, FLAC, M4A, OGG, and more (via ffmpeg)

Example:
```python
audio_path = Path("path/to/your/audio.mp3")
```


In [None]:
# Define the path to your audio file
# Default: uses Docling's test audio file
audio_path = Path("tests/data/audio/sample_10s.mp3")

# Uncomment and modify this line to use your own audio file:
# audio_path = Path("your_audio_file.mp3")

# Run the ASR pipeline
doc = asr_pipeline_conversion(audio_path=audio_path)


## Step 5: View the Transcription

Let's export and display the transcription in Markdown format.

### Output Format:

The output includes:
- **Timestamps**: `[time: start-end]` showing when each segment was spoken
- **Text**: The transcribed speech

### Expected Output:

For the sample audio file, you should see something like:

```
[time: 0.0-4.0]  Shakespeare on Scenery by Oscar Wilde

[time: 5.28-9.96]  This is a LibriVox recording. All LibriVox recordings are in the public domain.
```


In [None]:
# Export the document to Markdown format
transcription = doc.export_to_markdown()

# Display the transcription
print("\n=== Transcription ===")
print(transcription)


## Step 6: Explore the Document Object

The `DoclingDocument` object contains rich information about the transcription.

Let's explore its structure:


ng

In [None]:
# Display document type
print(f"Document Type: {type(doc)}")

# Display document name (if available)
if hasattr(doc, 'name'):
    print(f"Document Name: {doc.name}")

# You can also export to JSON for more detailed structure
print("\n=== Document as JSON (first 500 chars) ===")
json_output = doc.export_to_json()
print(json_output[:500] + "...")


## Customization Options

### 1. Using Different Whisper Models

You can experiment with different model sizes to balance speed and accuracy:


### 2. Processing Multiple Audio Files

You can process multiple audio files in a batch:


In [None]:
def get_custom_asr_converter(model_spec):
    """Create a converter with a custom model specification."""
    pipeline_options = AsrPipelineOptions()
    pipeline_options.asr_options = model_spec
    
    converter = DocumentConverter(
        format_options={
            InputFormat.AUDIO: AudioFormatOption(
                pipeline_cls=AsrPipeline,
                pipeline_options=pipeline_options,
            )
        }
    )
    return converter

# Example usage (commented out to avoid running):
# converter_tiny = get_custom_asr_converter(asr_model_specs.WHISPER_TINY)
# converter_large = get_custom_asr_converter(asr_model_specs.WHISPER_LARGE)


## Key Takeaways

1. **Simple Setup**: Docling provides a straightforward API for ASR tasks
2. **Hardware Optimization**: Automatic model selection based on your hardware (MLX for Apple Silicon)
3. **Multiple Formats**: Supports various audio formats through ffmpeg
4. **Timestamped Output**: Includes precise timing information for each speech segment
5. **Flexible Export**: Can export to Markdown, JSON, or other formats
6. **Model Options**: Choose from multiple Whisper model sizes based on your needs

---

## Next Steps

- Try transcribing your own audio files
- Experiment with different Whisper models
- Process multiple files in batch
- Integrate with other Docling features for document processing
- Explore multilingual transcription capabilities

---

## Resources

- [Docling Documentation](https://github.com/DS4SD/docling)
- [Whisper by OpenAI](https://github.com/openai/whisper)
- [MLX Whisper for Apple Silicon](https://github.com/ml-explore/mlx-examples/tree/main/whisper)
- [FFmpeg Download](https://ffmpeg.org/download.html)


terial

In [None]:
def batch_transcribe(audio_files: list[Path]) -> dict[str, str]:
    """Transcribe multiple audio files and return a dictionary of results."""
    converter = get_asr_converter()
    results = {}
    
    for audio_path in audio_files:
        if not audio_path.exists():
            print(f"Warning: File not found - {audio_path}")
            continue
            
        try:
            result = converter.convert(audio_path)
            if result.status == ConversionStatus.SUCCESS:
                results[str(audio_path)] = result.document.export_to_markdown()
            else:
                print(f"Failed to convert: {audio_path}")
        except Exception as e:
            print(f"Error processing {audio_path}: {e}")
    
    return results

# Example usage (commented out):
# audio_files = [
#     Path("audio1.mp3"),
#     Path("audio2.mp3"),
#     Path("audio3.mp3"),
# ]
# transcriptions = batch_transcribe(audio_files)


al