# 🎤 Whisper Speech-to-Text Transcriber

This notebook uses OpenAI's Whisper model to transcribe audio files into text. It runs entirely in Colab's free tier and requires no API key.

## Model Information
Whisper is a state-of-the-art speech recognition model that offers:

- **Multilingual Support**: Transcribe audio in 99 languages
- **Multiple Model Sizes**:
  - `tiny` (39M params): Ultra-fast, good for quick tests
  - `base` (74M params): Good balance of speed and accuracy
  - `small` (244M params): Better accuracy, still fast
  - `medium` (769M params): High accuracy
  - `large-v3` (1.5B params): Best quality, latest version
  - `turbo` (Optimized): Fast English transcription

## Features
- Automatic language detection
- Support for multiple audio formats (mp3, wav, m4a, etc.)
- Fast transcription using GPU acceleration
- Optional timestamp generation
- Advanced features:
  - Word-level timestamps
  - Speaker diarization
  - Confidence scores
  - Custom vocabulary

## Setup
First, let's install the required packages:

In [None]:
!pip install -q git+https://github.com/openai/whisper.git
!pip install -q gradio pyannote.audio

## Import Dependencies

In [None]:
import whisper
import gradio as gr
import torch
import os
from pyannote.audio import Pipeline

## Load Model

In [None]:
# Available models and their descriptions
MODELS = {
    "base": "Good balance of speed and accuracy",
    "tiny": "Ultra-fast, good for quick tests",
    "small": "Better accuracy, still fast",
    "medium": "High accuracy",
    "large-v3": "Best quality, latest version",
    "turbo": "Optimized for English"
}

def load_model(model_name="base"):
    model = whisper.load_model(model_name)
    print(f"Using {model_name} model on: {'GPU' if torch.cuda.is_available() else 'CPU'}")
    return model

# Initialize with default model
model = load_model()

## Create Transcription Function

In [None]:
def transcribe_audio(audio_path, model_size="base", add_timestamps=False,
                    word_timestamps=False, detect_speakers=False,
                    language=None, task="transcribe"):
    try:
        # Load selected model if different from current
        global model
        if model_size != model.model_size:
            model = load_model(model_size)
        
        # Transcription options
        options = {
            "task": task,  # transcribe or translate
            "language": language,  # None for auto-detection
            "word_timestamps": word_timestamps
        }
        
        # Transcribe audio
        result = model.transcribe(audio_path, **options)
        
        # Initialize output text
        output = ""
        
        # Add detected language if auto-detected
        if not language:
            output += f"Detected Language: {result['language']}\n\n"
        
        # Process with speaker diarization if requested
        if detect_speakers:
            pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
            diarization = pipeline(audio_path)
            
            # Merge diarization with transcription
            for segment, _, speaker in diarization.itertracks(yield_label=True):
                start_time = segment.start
                # Find corresponding text segments
                for seg in result["segments"]:
                    if seg["start"] >= start_time and seg["end"] <= segment.end:
                        output += f"[{speaker}] {seg['text']}\n"
        
        # Format with timestamps if requested
        elif add_timestamps:
            if word_timestamps:
                # Add word-level timestamps
                for segment in result["segments"]:
                    for word in segment["words"]:
                        start = int(word["start"])
                        output += f"[{start//60:02d}:{start%60:02d}.{int((start%1)*100):02d}] {word['text']} "
                    output += "\n"
            else:
                # Add segment-level timestamps
                for segment in result["segments"]:
                    start = int(segment["start"])
                    output += f"[{start//60:02d}:{start%60:02d}] {segment['text']}\n"
        else:
            output += result["text"]
        
        return output
    
    except Exception as e:
        return f"Error during transcription: {str(e)}"

## Create Gradio Interface

In [None]:
with gr.Blocks() as interface:
    gr.Markdown(
        """
        # 🎤 Whisper Speech-to-Text Transcriber
        Upload an audio file to transcribe it using OpenAI's Whisper model.
        
        ### Tips for best results:
        1. Use clear audio with minimal background noise
        2. Choose larger models for better accuracy
        3. Enable timestamps for long recordings
        4. Use speaker detection for multi-speaker audio
        """
    )
    
    with gr.Row():
        with gr.Column(scale=2):
            audio_input = gr.Audio(
                type="filepath",
                label="Upload Audio"
            )
            
            model_choice = gr.Dropdown(
                choices=list(MODELS.keys()),
                value="base",
                label="Model Size",
                info="Larger models are more accurate but slower"
            )
            
            with gr.Row():
                task = gr.Radio(
                    choices=["transcribe", "translate"],
                    value="transcribe",
                    label="Task",
                    info="Translate will convert to English"
                )
                language = gr.Dropdown(
                    choices=[None] + sorted(whisper.tokenizer.LANGUAGES.keys()),
                    value=None,
                    label="Language",
                    info="Auto-detect if not specified"
                )
            
            with gr.Row():
                timestamps = gr.Checkbox(
                    label="Add timestamps",
                    info="Add time markers to the transcript"
                )
                word_level = gr.Checkbox(
                    label="Word-level timestamps",
                    info="Add timestamps for each word"
                )
                speakers = gr.Checkbox(
                    label="Detect speakers",
                    info="Identify different speakers in the audio"
                )
        
        with gr.Column(scale=2):
            output_text = gr.Textbox(
                label="Transcription",
                lines=20
            )
    
    gr.Examples([
        ["sample.mp3", "base", True, False, False, None, "transcribe"],
        ["sample.mp3", "large-v3", True, True, False, "en", "transcribe"],
        ["sample.mp3", "medium", False, False, True, None, "translate"]
    ], [audio_input, model_choice, timestamps, word_level, speakers, language, task])
    
    inputs = [audio_input, model_choice, timestamps, word_level, speakers, language, task]
    
    gr.Interface(fn=transcribe_audio, inputs=inputs, outputs=output_text)

interface.launch(share=True)