# Audio Transcription with Vertector

Demonstrates:
- Whisper model configuration
- MLX vs Standard backend
- Timestamped transcriptions
- Multi-language support
- Batch processing
- SRT generation

## Setup

In [1]:
from pathlib import Path
from vertector_data_ingestion import (
    create_audio_transcriber,
    AudioConfig,
    WhisperModelSize,
    AudioBackend,
    HardwareDetector,
    setup_logging,
)

setup_logging(log_level="INFO")

[32m2025-12-31 04:57:13[0m | [1mINFO    [0m | [36mvertector_data_ingestion.monitoring.logger[0m:[36msetup_logging[0m:[36m52[0m - [1mLogging initialized at INFO level[0m


## Hardware Detection

In [2]:
hw_info = HardwareDetector.get_device_info()

print("Hardware:")
print(f"  Device: {hw_info.get('device_type')}")
print(f"  Chip: {hw_info.get('chip', 'Unknown')}")
print(f"  Use MLX: {hw_info.get('use_mlx', False)}")
print(f"  Batch Size: {hw_info.get('batch_size', 1)}")

if hw_info.get('device_type') == 'mps':
    print("\n✓ Recommend: MLX backend (10-20x faster on Apple Silicon)")
elif hw_info.get('device_type') == 'cuda':
    print("\n✓ Recommend: Standard with CUDA")
else:
    print("\n✓ Recommend: Standard (CPU)")

[32m2025-12-31 04:57:46[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m


Hardware:
  Device: mps
  Chip: M1
  Use MLX: True
  Batch Size: 8

✓ Recommend: MLX backend (10-20x faster on Apple Silicon)


## Basic Transcription

In [3]:
config = AudioConfig(
    model_size=WhisperModelSize.BASE,
    backend=AudioBackend.AUTO,
    language="en",
    word_timestamps=True,
)

transcriber = create_audio_transcriber(config)
audio_path = Path("../test_documents/harvard.wav")

if audio_path.exists():
    result = transcriber.transcribe(audio_path)
    
    print("Result:")
    print(f"  Text: {result.text}")
    print(f"  Language: {result.language}")
    print(f"  Duration: {result.duration:.2f}s")
    print(f"  Segments: {len(result.segments)}")
else:
    print(f"File not found: {audio_path}")

[32m2025-12-31 04:58:36[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.audio_factory[0m:[36mcreate_audio_transcriber[0m:[36m23[0m - [1mCreating audio transcriber: model=base, backend=auto[0m
[32m2025-12-31 04:58:36[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
[32m2025-12-31 04:58:36[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m__init__[0m:[36m55[0m - [1mInitializing WhisperTranscriber with model=base, device=mlx, backend=auto[0m
[32m2025-12-31 04:58:36[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m_load_model[0m:[36m89[0m - [1mLoading Whisper model: base[0m
[32m2025-12-31 04:58:36[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m_load_model[0m:[36m97[0m - [1mLoaded MLX Whisper model: base[0m
[32m2025-12-

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

[32m2025-12-31 04:58:43[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36mtranscribe[0m:[36m191[0m - [1mTranscription complete in 7.06s: 216 chars, 6 segments[0m


Result:
  Text: The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health and zest. A salt pickle tastes fine with ham. Tacos al pastor are my favorite. A zestful food is the hot cross bun.
  Language: en
  Duration: 7.06s
  Segments: 6


## Timestamped Segments

In [4]:
if audio_path.exists():
    for i, segment in enumerate(result.segments[:5], 1):
        print(f"\nSegment {i}:")
        print(f"  Time: [{segment.start:.1f}s - {segment.end:.1f}s]")
        print(f"  Text: {segment.text}")


Segment 1:
  Time: [0.9s - 3.6s]
  Text: The stale smell of old beer lingers.

Segment 2:
  Time: [4.2s - 6.2s]
  Text: It takes heat to bring out the odor.

Segment 3:
  Time: [7.0s - 9.2s]
  Text: A cold dip restores health and zest.

Segment 4:
  Time: [10.0s - 12.0s]
  Text: A salt pickle tastes fine with ham.

Segment 5:
  Time: [12.7s - 14.3s]
  Text: Tacos al pastor are my favorite.


## Model Size Comparison

In [6]:
import time

if audio_path.exists():
    models = [WhisperModelSize.TINY, WhisperModelSize.BASE, WhisperModelSize.SMALL]
    
    print("Model Comparison:")
    for model_size in models:
        config = AudioConfig(model_size=model_size, backend=AudioBackend.AUTO)
        transcriber = create_audio_transcriber(config)
        
        start = time.time()
        result = transcriber.transcribe(audio_path)
        elapsed = time.time() - start
        
        print(f"\n{model_size.value.upper()}: {elapsed:.2f}s")
        print(f"  Text: {result.text[:100]}...")

[32m2025-12-31 05:06:51[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.audio_factory[0m:[36mcreate_audio_transcriber[0m:[36m23[0m - [1mCreating audio transcriber: model=tiny, backend=auto[0m
[32m2025-12-31 05:06:51[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
[32m2025-12-31 05:06:51[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m__init__[0m:[36m55[0m - [1mInitializing WhisperTranscriber with model=tiny, device=mlx, backend=auto[0m
[32m2025-12-31 05:06:51[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m_load_model[0m:[36m89[0m - [1mLoading Whisper model: tiny[0m
[32m2025-12-31 05:06:51[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m_load_model[0m:[36m97[0m - [1mLoaded MLX Whisper model: tiny[0m
[32m2025-12-

Model Comparison:


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

[32m2025-12-31 05:06:52[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36mtranscribe[0m:[36m191[0m - [1mTranscription complete in 0.83s: 219 chars, 6 segments[0m
[32m2025-12-31 05:06:52[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.audio_factory[0m:[36mcreate_audio_transcriber[0m:[36m23[0m - [1mCreating audio transcriber: model=base, backend=auto[0m
[32m2025-12-31 05:06:52[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
[32m2025-12-31 05:06:52[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m__init__[0m:[36m55[0m - [1mInitializing WhisperTranscriber with model=base, device=mlx, backend=auto[0m
[32m2025-12-31 05:06:52[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m_load_model[0m:[36m89[0m - [1mLoading Whisper model


TINY: 0.84s
  Text: The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health...


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

[32m2025-12-31 05:06:53[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36mtranscribe[0m:[36m191[0m - [1mTranscription complete in 1.64s: 216 chars, 6 segments[0m
[32m2025-12-31 05:06:53[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.audio_factory[0m:[36mcreate_audio_transcriber[0m:[36m23[0m - [1mCreating audio transcriber: model=small, backend=auto[0m
[32m2025-12-31 05:06:53[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
[32m2025-12-31 05:06:53[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m__init__[0m:[36m55[0m - [1mInitializing WhisperTranscriber with model=small, device=mlx, backend=auto[0m
[32m2025-12-31 05:06:53[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m_load_model[0m:[36m89[0m - [1mLoading Whisper mod


BASE: 1.64s
  Text: The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health...


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

[32m2025-12-31 05:06:57[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36mtranscribe[0m:[36m191[0m - [1mTranscription complete in 3.72s: 216 chars, 6 segments[0m



SMALL: 3.72s
  Text: The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health...


## Multi-Language

In [7]:
if audio_path.exists():
    # Auto-detect
    auto_config = AudioConfig(model_size=WhisperModelSize.BASE, language=None)
    transcriber = create_audio_transcriber(auto_config)
    result = transcriber.transcribe(audio_path)
    
    print(f"Detected language: {result.language}")
    print(f"Text: {result.text[:200]}...")

[32m2025-12-31 05:08:42[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.audio_factory[0m:[36mcreate_audio_transcriber[0m:[36m23[0m - [1mCreating audio transcriber: model=base, backend=auto[0m
[32m2025-12-31 05:08:42[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
[32m2025-12-31 05:08:42[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m__init__[0m:[36m55[0m - [1mInitializing WhisperTranscriber with model=base, device=mlx, backend=auto[0m
[32m2025-12-31 05:08:42[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m_load_model[0m:[36m89[0m - [1mLoading Whisper model: base[0m
[32m2025-12-31 05:08:42[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m_load_model[0m:[36m97[0m - [1mLoaded MLX Whisper model: base[0m
[32m2025-12-

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

[32m2025-12-31 05:08:43[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36mtranscribe[0m:[36m191[0m - [1mTranscription complete in 1.66s: 216 chars, 6 segments[0m


Detected language: en
Text: The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health and zest. A salt pickle tastes fine with ham. Tacos al pastor are my favorite. A zestful food is th...


## Batch Processing

In [8]:
audio_dir = Path("../test_documents/")

if audio_dir.exists():
    audio_files = list(audio_dir.glob("*.wav")) + list(audio_dir.glob("*.mp3"))
    
    if audio_files:
        config = AudioConfig(model_size=WhisperModelSize.BASE)
        transcriber = create_audio_transcriber(config)
        
        for audio_file in audio_files[:5]:
            result = transcriber.transcribe(audio_file)
            print(f"\n{audio_file.name}: {result.duration:.1f}s")
            print(f"  {result.text[:100]}...")
else:
    print("Create '../test_documents/' directory with audio files")

[32m2025-12-31 05:09:32[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.audio_factory[0m:[36mcreate_audio_transcriber[0m:[36m23[0m - [1mCreating audio transcriber: model=base, backend=auto[0m
[32m2025-12-31 05:09:32[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
[32m2025-12-31 05:09:32[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m__init__[0m:[36m55[0m - [1mInitializing WhisperTranscriber with model=base, device=mlx, backend=auto[0m
[32m2025-12-31 05:09:32[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m_load_model[0m:[36m89[0m - [1mLoading Whisper model: base[0m
[32m2025-12-31 05:09:32[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m_load_model[0m:[36m97[0m - [1mLoaded MLX Whisper model: base[0m
[32m2025-12-


harvard.wav: 1.1s
  The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health...


## Generate SRT Subtitles

In [9]:
def format_srt_timestamp(seconds: float) -> str:
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

if audio_path.exists():
    config = AudioConfig(model_size=WhisperModelSize.BASE, word_timestamps=True)
    transcriber = create_audio_transcriber(config)
    result = transcriber.transcribe(audio_path)
    
    # Generate SRT
    srt_output = []
    for i, segment in enumerate(result.segments, 1):
        start = format_srt_timestamp(segment.start)
        end = format_srt_timestamp(segment.end)
        srt_output.append(f"{i}\n{start} --> {end}\n{segment.text.strip()}\n")
    
    srt_content = "\n".join(srt_output)
    
    # Save using convert_and_export pattern
    from vertector_data_ingestion import UniversalConverter
    converter = UniversalConverter()
    srt_path = converter.config.output_dir / "transcript.srt"
    srt_path.parent.mkdir(parents=True, exist_ok=True)
    srt_path.write_text(srt_content, encoding="utf-8")
    
    print(f"Saved to: {srt_path}")

[32m2025-12-31 05:12:46[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.audio_factory[0m:[36mcreate_audio_transcriber[0m:[36m23[0m - [1mCreating audio transcriber: model=base, backend=auto[0m
[32m2025-12-31 05:12:46[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
[32m2025-12-31 05:12:46[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m__init__[0m:[36m55[0m - [1mInitializing WhisperTranscriber with model=base, device=mlx, backend=auto[0m
[32m2025-12-31 05:12:46[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m_load_model[0m:[36m89[0m - [1mLoading Whisper model: base[0m
[32m2025-12-31 05:12:46[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m_load_model[0m:[36m97[0m - [1mLoaded MLX Whisper model: base[0m
[32m2025-12-

Saved to: output/transcript.srt


## Summary

Demonstrated:
- Hardware detection
- Basic transcription
- Timestamped segments
- Model comparison
- Multi-language support
- Batch processing
- SRT generation

Next: `03_rag_pipeline.ipynb`