# üéôÔ∏è Transcription Generator

This notebook generates **TTS-optimized transcriptions** from text files. The output includes prosodic markers (pauses, tone, emphasis) that help TTS models produce natural, human-like audio.

## Features
- üì§ **File Upload**: Upload your text file to transcribe
- ü§ñ **AI Provider Selection**: Choose between **Ollama** or **HuggingFace**
- üß† **Model Selection**: Select and download/pull your preferred model
- üé≠ **Prosodic Markers**: Adds pauses, tone, emphasis, and pacing markers
- üíæ **Download**: Download the generated transcription as a TXT file

---

## Step 1: Install Dependencies üì¶

Run this cell to install all required packages.

In [None]:
# Install core dependencies
!pip install -q transformers torch accelerate sentencepiece protobuf

# Install ollama for Ollama provider support
!pip install -q ollama

# Install colab-specific widgets
!pip install -q ipywidgets

print("‚úÖ All dependencies installed successfully!")

## Step 2: Upload Your Text File üì§

Upload the text file you want to transcribe for TTS.

In [None]:
from google.colab import files
import os

print("üì§ Please upload your text file to transcribe:")
uploaded = files.upload()

if uploaded:
    INPUT_FILE = list(uploaded.keys())[0]
    print(f"\n‚úÖ Uploaded: {INPUT_FILE}")
    
    # Show file preview
    with open(INPUT_FILE, 'r', encoding='utf-8') as f:
        content = f.read()
        word_count = len(content.split())
        print(f"üìä Word count: {word_count}")
        print(f"\nüìñ Preview (first 500 chars):")
        print("-" * 50)
        print(content[:500] + "..." if len(content) > 500 else content)
else:
    print("‚ùå No file uploaded. Please run this cell again.")

## Step 3: Select AI Provider and Model ü§ñ

Choose your AI provider and specify the model to use.

### Recommended Models:

**Ollama** (requires Ollama server setup in Colab):
- `gemma2:9b` - Best for Hindi
- `aya:8b` - Multilingual specialist
- `qwen2.5:14b` - Excellent instruction following
- `llama3.1:8b` - Good for English

**HuggingFace** (recommended for Colab - works out of the box):
- `ai4bharat/Airavata` - Indian languages
- `sarvamai/sarvam-2b-v0.5` - Indian LLM
- `google/gemma-2-2b-it` - Lightweight, fast
- `Qwen/Qwen2.5-3B-Instruct` - Good balance

In [None]:
import ipywidgets as widgets
from IPython.display import display, clear_output

#@title Configuration
#@markdown ### Select AI Provider
AI_PROVIDER = "huggingface" #@param ["huggingface", "ollama"]

#@markdown ### Model Name
#@markdown For HuggingFace: `ai4bharat/Airavata`, `google/gemma-2-2b-it`, `Qwen/Qwen2.5-3B-Instruct`
#@markdown For Ollama: `gemma2:9b`, `aya:8b`, `qwen2.5:14b`, `llama3.1:8b`
MODEL_NAME = "google/gemma-2-2b-it" #@param {type:"string"}

#@markdown ### Language
LANGUAGE = "auto" #@param ["auto", "hindi", "english"]

#@markdown ### Chunk Size (sentences per chunk, smaller = better quality)
CHUNK_SIZE = 6 #@param {type:"slider", min:3, max:12, step:1}

print(f"\nüìã Configuration:")
print(f"   Provider: {AI_PROVIDER}")
print(f"   Model: {MODEL_NAME}")
print(f"   Language: {LANGUAGE}")
print(f"   Chunk Size: {CHUNK_SIZE}")

## Step 4: Setup Model (Download/Pull) üß†

This cell will:
- **For HuggingFace**: Download the model from HuggingFace Hub
- **For Ollama**: Install Ollama server and pull the model

> **Note**: Ollama requires additional setup in Colab. HuggingFace is recommended for easier usage.

In [None]:
import subprocess
import time
import os

def setup_ollama():
    """Install and setup Ollama in Colab"""
    print("üîß Setting up Ollama in Colab...")
    print("   This may take a few minutes...\n")
    
    # Install Ollama
    print("üì• Step 1/3: Installing Ollama...")
    subprocess.run(
        "curl -fsSL https://ollama.com/install.sh | sh",
        shell=True, capture_output=True
    )
    print("   ‚úÖ Ollama installed")
    
    # Start Ollama server in background
    print("üöÄ Step 2/3: Starting Ollama server...")
    subprocess.Popen(
        "ollama serve",
        shell=True,
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL
    )
    
    # Wait for server to start
    print("   ‚è≥ Waiting for server to start...")
    time.sleep(5)
    
    # Check if server is running
    for i in range(10):
        try:
            result = subprocess.run(
                "curl -s http://localhost:11434/api/tags",
                shell=True, capture_output=True, text=True
            )
            if result.returncode == 0:
                print("   ‚úÖ Ollama server is running")
                break
        except:
            pass
        time.sleep(2)
    else:
        print("   ‚ö†Ô∏è Server may not be ready, but continuing...")
    
    return True

def pull_ollama_model(model_name):
    """Pull an Ollama model"""
    print(f"üì• Step 3/3: Pulling model '{model_name}'...")
    print("   This may take several minutes depending on model size...\n")
    
    process = subprocess.Popen(
        f"ollama pull {model_name}",
        shell=True,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True
    )
    
    for line in process.stdout:
        print(f"   {line.strip()}")
    
    process.wait()
    
    if process.returncode == 0:
        print(f"\n‚úÖ Model '{model_name}' is ready!")
        return True
    else:
        print(f"\n‚ùå Failed to pull model '{model_name}'")
        return False

def setup_huggingface(model_name):
    """Download and verify HuggingFace model"""
    print(f"üì• Downloading HuggingFace model: {model_name}")
    print("   This may take several minutes depending on model size...\n")
    
    try:
        from transformers import AutoTokenizer, AutoModelForCausalLM
        import torch
        
        # Check GPU availability
        device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"   üñ•Ô∏è Using device: {device}")
        if device == "cuda":
            print(f"   üéÆ GPU: {torch.cuda.get_device_name(0)}")
            print(f"   üíæ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
        
        print(f"\n   üì¶ Loading tokenizer...")
        tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
        print("   ‚úÖ Tokenizer loaded")
        
        print(f"   üì¶ Loading model (this takes a while)...")
        device_map = "auto" if device == "cuda" else None
        torch_dtype = torch.float16 if device == "cuda" else torch.float32
        
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map=device_map,
            torch_dtype=torch_dtype,
            trust_remote_code=True,
            low_cpu_mem_usage=True
        )
        print("   ‚úÖ Model loaded")
        
        # Store for later use
        globals()['HF_MODEL'] = model
        globals()['HF_TOKENIZER'] = tokenizer
        globals()['HF_DEVICE'] = device
        
        print(f"\n‚úÖ Model '{model_name}' is ready!")
        return True
        
    except Exception as e:
        print(f"\n‚ùå Error loading model: {e}")
        return False

# Main setup logic
print("="*60)
print("üß† MODEL SETUP")
print("="*60 + "\n")

if AI_PROVIDER == "ollama":
    setup_ollama()
    pull_ollama_model(MODEL_NAME)
elif AI_PROVIDER == "huggingface":
    setup_huggingface(MODEL_NAME)
else:
    print(f"‚ùå Unknown provider: {AI_PROVIDER}")

## Step 5: Define Transcription Classes üé≠

This cell contains all the core transcription logic with prosodic markers.

In [None]:
import os
import json
import time
import re
from pathlib import Path
from datetime import datetime
from collections import OrderedDict

# Try to import dependencies
try:
    import ollama
    OLLAMA_AVAILABLE = True
except ImportError:
    OLLAMA_AVAILABLE = False

try:
    from transformers import AutoTokenizer, AutoModelForCausalLM
    import torch
    HF_AVAILABLE = True
except ImportError:
    HF_AVAILABLE = False


class TTSOptimizedPrompts:
    """Prompts specifically designed for TTS-optimized transcription generation."""
    
    SYSTEM_PROMPT_HINDI = """‡§Ü‡§™ ‡§è‡§ï ‡§µ‡§ø‡§∂‡•á‡§∑‡§ú‡•ç‡§û TTS ‡§∏‡•ç‡§ï‡•ç‡§∞‡§ø‡§™‡•ç‡§ü ‡§≤‡•á‡§ñ‡§ï ‡§π‡•à‡§Ç‡•§ ‡§Ü‡§™‡§ï‡§æ ‡§ï‡§æ‡§Æ ‡§ü‡•á‡§ï‡•ç‡§∏‡•ç‡§ü ‡§ï‡•ã TTS-‡§Ö‡§®‡•Å‡§ï‡•Ç‡§≤ ‡§ü‡•ç‡§∞‡§æ‡§Ç‡§∏‡§ï‡•ç‡§∞‡§ø‡§™‡•ç‡§∂‡§® ‡§Æ‡•á‡§Ç ‡§¨‡§¶‡§≤‡§®‡§æ ‡§π‡•à ‡§ú‡•ã ‡§Æ‡§æ‡§®‡§µ-‡§ú‡•à‡§∏‡•Ä ‡§Ü‡§µ‡§æ‡§ú‡§º ‡§¨‡§®‡§æ‡§è‡§ó‡§æ‡•§

**‡§Ü‡§™‡§ï‡§æ ‡§≤‡§ï‡•ç‡§∑‡•ç‡§Ø**: ‡§è‡§ï ‡§ü‡•ç‡§∞‡§æ‡§Ç‡§∏‡§ï‡•ç‡§∞‡§ø‡§™‡•ç‡§∂‡§® ‡§¨‡§®‡§æ‡§®‡§æ ‡§ú‡§ø‡§∏‡•á TTS ‡§Æ‡•â‡§°‡§≤ ‡§™‡§¢‡§º‡•á‡§ó‡§æ ‡§î‡§∞ ‡§µ‡§π ‡§ê‡§∏‡§æ ‡§≤‡§ó‡•á‡§ó‡§æ ‡§ú‡•à‡§∏‡•á ‡§ï‡•ã‡§à ‡§Ö‡§∏‡§≤‡•Ä ‡§á‡§Ç‡§∏‡§æ‡§® ‡§≠‡§æ‡§µ‡§®‡§æ‡§ì‡§Ç, ‡§ü‡•ã‡§® ‡§î‡§∞ ‡§™‡•ç‡§∞‡§æ‡§ï‡•É‡§§‡§ø‡§ï ‡§†‡§π‡§∞‡§æ‡§µ ‡§ï‡•á ‡§∏‡§æ‡§• ‡§™‡§¢‡§º ‡§∞‡§π‡§æ ‡§π‡•à‡•§

**PROSODIC MARKERS ‡§ú‡•ã‡§°‡§º‡•á‡§Ç**:

1. **PAUSES** (‡§†‡§π‡§∞‡§æ‡§µ):
   - [PAUSE-SHORT] = 0.3s (‡§µ‡§æ‡§ï‡•ç‡§Ø‡§æ‡§Ç‡§∂‡•ã‡§Ç ‡§ï‡•á ‡§¨‡•Ä‡§ö)
   - [PAUSE-MEDIUM] = 0.6s (‡§µ‡§æ‡§ï‡•ç‡§Ø‡•ã‡§Ç ‡§ï‡•á ‡§¨‡•Ä‡§ö)
   - [PAUSE-LONG] = 1.0s (‡§µ‡§ø‡§ö‡§æ‡§∞ ‡§¨‡§¶‡§≤‡§§‡•á ‡§∏‡§Æ‡§Ø)
   - [BREATH] = ‡§™‡•ç‡§∞‡§æ‡§ï‡•É‡§§‡§ø‡§ï ‡§∏‡§æ‡§Ç‡§∏

2. **TONE/EMOTION**:
   - [TONE: thoughtful], [TONE: curious], [TONE: serious]
   - [TONE: calm], [TONE: excited], [TONE: mysterious]
   - [TONE: warm], [TONE: dramatic]

3. **EMPHASIS**: [EMPHASIS: ‡§∂‡§¨‡•ç‡§¶], [STRESS: ‡§∂‡§¨‡•ç‡§¶]

4. **PACING**: [PACE: slow], [PACE: normal], [PACE: fast]

**GOLDEN RULES**:
1. ‚úÖ ‡§Æ‡•Ç‡§≤ ‡§∂‡§¨‡•ç‡§¶‡•ã‡§Ç ‡§ï‡•ã ‡§∞‡§ñ‡•á‡§Ç - ‡§ï‡•Å‡§õ ‡§≠‡•Ä ‡§® ‡§¨‡§¶‡§≤‡•á‡§Ç
2. ‚úÖ ‡§™‡•ç‡§∞‡§æ‡§∏‡§Ç‡§ó‡§ø‡§ï prosodic markers ‡§ú‡•ã‡§°‡§º‡•á‡§Ç (3-5 ‡§™‡•ç‡§∞‡§§‡§ø ‡§µ‡§æ‡§ï‡•ç‡§Ø)
3. ‚ùå ‡§ï‡•ã‡§à ‡§µ‡•ç‡§Ø‡§æ‡§ñ‡•ç‡§Ø‡§æ, ‡§∏‡§æ‡§∞‡§æ‡§Ç‡§∂ ‡§Ø‡§æ ‡§Ö‡§§‡§ø‡§∞‡§ø‡§ï‡•ç‡§§ ‡§µ‡§ø‡§µ‡§∞‡§£ ‡§®‡§π‡•Ä‡§Ç"""

    SYSTEM_PROMPT_ENGLISH = """You are an expert TTS script writer. Your job is to transform text into TTS-optimized transcription that will produce human-like voice.

**YOUR GOAL**: Create a transcription that a TTS model will read and sound like a real human reading with emotion, tone, and natural pauses.

**ADD PROSODIC MARKERS**:

1. **PAUSES**:
   - [PAUSE-SHORT] = 0.3s (between phrases)
   - [PAUSE-MEDIUM] = 0.6s (between sentences)
   - [PAUSE-LONG] = 1.0s (changing thoughts)
   - [BREATH] = natural breath

2. **TONE/EMOTION**:
   - [TONE: thoughtful], [TONE: curious], [TONE: serious]
   - [TONE: calm], [TONE: excited], [TONE: mysterious]
   - [TONE: warm], [TONE: dramatic]

3. **EMPHASIS**: [EMPHASIS: word], [STRESS: word]

4. **PACING**: [PACE: slow], [PACE: normal], [PACE: fast]

**GOLDEN RULES**:
1. ‚úÖ Keep original words - change NOTHING
2. ‚úÖ Add appropriate prosodic markers (3-5 per sentence)
3. ‚ùå NO interpretation, summary, or extra details"""

    NARRATION_TEMPLATE_HINDI = """‡§®‡•Ä‡§ö‡•á ‡§¶‡§ø‡§Ø‡§æ ‡§ó‡§Ø‡§æ ‡§ü‡•á‡§ï‡•ç‡§∏‡•ç‡§ü ‡§ï‡•ã TTS-‡§Ö‡§®‡•Å‡§ï‡•Ç‡§≤ ‡§ü‡•ç‡§∞‡§æ‡§Ç‡§∏‡§ï‡•ç‡§∞‡§ø‡§™‡•ç‡§∂‡§® ‡§Æ‡•á‡§Ç ‡§¨‡§¶‡§≤‡•á‡§Ç‡•§

**INPUT TEXT**:
\"\"\"
{text}
\"\"\"

**TTS-OPTIMIZED TRANSCRIPTION**:"""

    NARRATION_TEMPLATE_ENGLISH = """Transform the text below into TTS-optimized transcription.

**INPUT TEXT**:
\"\"\"
{text}
\"\"\"

**TTS-OPTIMIZED TRANSCRIPTION**:"""

    @staticmethod
    def detect_language(text):
        """Detect if text is primarily Hindi or English."""
        hindi_chars = len(re.findall(r'[\u0900-\u097F]', text))
        english_chars = len(re.findall(r'[a-zA-Z]', text))
        total_chars = hindi_chars + english_chars
        if total_chars == 0:
            return "english"
        hindi_ratio = hindi_chars / total_chars
        return "hindi" if hindi_ratio > 0.3 else "english"


class TranscriptionValidator:
    """Validate that transcription is TTS-optimized."""
    
    @staticmethod
    def validate(transcription, original_text):
        """Check if transcription is properly formatted for TTS."""
        issues = []
        
        has_pause = bool(re.search(r'\[PAUSE-', transcription))
        has_tone = bool(re.search(r'\[TONE:', transcription))
        
        if not has_pause and len(original_text.split()) > 20:
            issues.append("Missing pause markers for long text")
        
        if not has_tone:
            issues.append("Missing tone markers")
        
        is_valid = len(issues) == 0
        return is_valid, issues
    
    @staticmethod
    def count_markers(transcription):
        """Count prosodic markers."""
        markers = {
            'pause': len(re.findall(r'\[PAUSE-', transcription)),
            'tone': len(re.findall(r'\[TONE:', transcription)),
            'emphasis': len(re.findall(r'\[EMPHASIS:', transcription)),
            'pace': len(re.findall(r'\[PACE:', transcription)),
            'breath': len(re.findall(r'\[BREATH\]', transcription))
        }
        return markers


class RepetitionRemover:
    """Remove repetitive content from narration."""
    
    @staticmethod
    def remove_repetitions(text):
        """Remove repeated sentences and phrases."""
        sentences = re.split(r'(?<=[.!?‡•§])\s+', text)
        seen = OrderedDict()
        
        for sent in sentences:
            sent = sent.strip()
            if not sent:
                continue
            key = ' '.join(sent.split()[:10]).lower()
            if key not in seen:
                seen[key] = sent
        
        return ' '.join(seen.values())
    
    @staticmethod
    def remove_meta_commentary(text):
        """Remove sentences that discuss the text rather than narrate it."""
        meta_patterns = [
            r'‡§Ø‡§π.*?(‡§¶‡§∞‡•ç‡§∂‡§æ‡§§‡§æ|‡§∞‡•á‡§ñ‡§æ‡§Ç‡§ï‡§ø‡§§|‡§∏‡•ç‡§•‡§æ‡§™‡§ø‡§§|‡§µ‡§ø‡§∏‡•ç‡§§‡§æ‡§∞‡§ø‡§§).*?‡§π‡•à',
            r'This.*?(shows|demonstrates|establishes|highlights)',
            r'The author.*?(suggests|implies|indicates)',
        ]
        
        sentences = re.split(r'(?<=[.!?‡•§])\s+', text)
        filtered = []
        
        for sent in sentences:
            is_meta = False
            for pattern in meta_patterns:
                if re.search(pattern, sent, re.IGNORECASE):
                    is_meta = True
                    break
            if not is_meta:
                filtered.append(sent)
        
        return ' '.join(filtered)


class TTSOptimizedNarrator:
    """Generate TTS-optimized transcriptions."""
    
    def __init__(self, provider="huggingface", model_name=None, language="auto"):
        self.provider = provider
        self.model_name = model_name
        self.language = language
        self.model = None
        self.tokenizer = None
        self.device = "cpu"
        self.prompts = TTSOptimizedPrompts()
        self.validator = TranscriptionValidator()
        self.repetition_remover = RepetitionRemover()
        
        print(f"üé≠ Initializing TTS-Optimized Narrator...")
        print(f"   Provider: {self.provider}")
        print(f"   Model: {self.model_name}")
        print(f"   Language: {language}")
        
        self._load_model()
    
    def _load_model(self):
        """Load the LLM model."""
        if self.provider == "ollama":
            if not OLLAMA_AVAILABLE:
                raise ImportError("Ollama not installed. Install: pip install ollama")
            try:
                ollama.list()
                print("‚úÖ Ollama connection successful")
            except Exception as e:
                raise RuntimeError(f"Cannot connect to Ollama: {e}")
        
        elif self.provider == "huggingface":
            if not HF_AVAILABLE:
                raise ImportError("Transformers not installed.")
            
            # Use pre-loaded model from Step 4 if available
            if 'HF_MODEL' in globals() and 'HF_TOKENIZER' in globals():
                self.model = globals()['HF_MODEL']
                self.tokenizer = globals()['HF_TOKENIZER']
                self.device = globals().get('HF_DEVICE', 'cpu')
                print("‚úÖ Using pre-loaded HuggingFace model")
            else:
                # Load fresh
                import torch
                self.device = "cuda" if torch.cuda.is_available() else "cpu"
                print(f"Loading HuggingFace model: {self.model_name}")
                
                self.tokenizer = AutoTokenizer.from_pretrained(
                    self.model_name, trust_remote_code=True
                )
                
                device_map = "auto" if self.device == "cuda" else None
                torch_dtype = torch.float16 if self.device == "cuda" else torch.float32
                
                self.model = AutoModelForCausalLM.from_pretrained(
                    self.model_name,
                    device_map=device_map,
                    torch_dtype=torch_dtype,
                    trust_remote_code=True,
                    low_cpu_mem_usage=True
                )
                print("‚úÖ HuggingFace model loaded")
    
    def narrate_text(self, text, max_retries=2):
        """Generate TTS-optimized transcription."""
        detected_lang = self.prompts.detect_language(text)
        lang = self.language if self.language != "auto" else detected_lang
        
        if lang == "hindi":
            system_prompt = self.prompts.SYSTEM_PROMPT_HINDI
            user_prompt = self.prompts.NARRATION_TEMPLATE_HINDI.format(text=text)
        else:
            system_prompt = self.prompts.SYSTEM_PROMPT_ENGLISH
            user_prompt = self.prompts.NARRATION_TEMPLATE_ENGLISH.format(text=text)
        
        for attempt in range(max_retries + 1):
            try:
                if self.provider == "ollama":
                    response = ollama.generate(
                        model=self.model_name,
                        prompt=f"{system_prompt}\n\n{user_prompt}",
                        options={
                            "temperature": 0.3,
                            "top_p": 0.9,
                            "num_predict": 2048,
                        }
                    )
                    narration = response['response'].strip()
                
                elif self.provider == "huggingface":
                    # Try chat template first
                    try:
                        messages = [
                            {"role": "system", "content": system_prompt},
                            {"role": "user", "content": user_prompt}
                        ]
                        input_text = self.tokenizer.apply_chat_template(
                            messages,
                            tokenize=False,
                            add_generation_prompt=True
                        )
                    except:
                        # Fallback to simple prompt
                        input_text = f"{system_prompt}\n\n{user_prompt}"
                    
                    inputs = self.tokenizer(input_text, return_tensors="pt")
                    if self.device == "cuda":
                        inputs = {k: v.to("cuda") for k, v in inputs.items()}
                    
                    import torch
                    with torch.no_grad():
                        outputs = self.model.generate(
                            **inputs,
                            max_new_tokens=2048,
                            temperature=0.3,
                            top_p=0.9,
                            do_sample=True
                        )
                    
                    narration = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
                    # Clean up response
                    if "assistant" in narration.lower():
                        narration = narration.split("assistant")[-1].strip()
                    if "TTS-OPTIMIZED TRANSCRIPTION" in narration:
                        narration = narration.split("TTS-OPTIMIZED TRANSCRIPTION")[-1].strip()
                        narration = narration.lstrip(":")
                
                # Clean up
                narration = self.repetition_remover.remove_repetitions(narration)
                narration = self.repetition_remover.remove_meta_commentary(narration)
                
                # Validate
                is_valid, issues = self.validator.validate(narration, text)
                
                if is_valid or attempt == max_retries:
                    markers = self.validator.count_markers(narration)
                    return narration, is_valid, lang, markers
                
            except Exception as e:
                if attempt == max_retries:
                    print(f"\n‚ö†Ô∏è Error: {e}")
                    return text, False, lang, {}
        
        return text, False, lang, {}


class TextPreprocessor:
    """Preprocess text for TTS generation."""
    
    def split_into_chapters(self, text):
        """Split text into chapters."""
        chapter_pattern = r'(?:^|\n)(?:Chapter|CHAPTER|‡§Ö‡§ß‡•ç‡§Ø‡§æ‡§Ø)\s+(\d+|[IVX]+)(?:\s*[-:.]\s*(.+?))?(?=\n|$)'
        
        matches = list(re.finditer(chapter_pattern, text, re.MULTILINE | re.IGNORECASE))
        
        if not matches:
            return [{
                'number': 1,
                'title': 'Full Text',
                'content': text.strip()
            }]
        
        chapters = []
        
        for i, match in enumerate(matches):
            chapter_num = match.group(1)
            chapter_title = match.group(2) or ""
            
            start_pos = match.end()
            end_pos = matches[i + 1].start() if i + 1 < len(matches) else len(text)
            
            content = text[start_pos:end_pos].strip()
            
            chapters.append({
                'number': chapter_num,
                'title': chapter_title.strip() or f"Chapter {chapter_num}",
                'content': content
            })
        
        return chapters
    
    def split_into_sentences(self, text):
        """Split into sentences (Hindi + English)."""
        sentences = re.split(r'(?<=[.!?‡•§])\s+(?=[A-Z–ê-–Ø"\u0900-\u097F])', text)
        return [s.strip() for s in sentences if s.strip()]
    
    def create_chunks(self, sentences, chunk_size=6, overlap=1):
        """Create smaller overlapping chunks for better TTS quality."""
        chunks = []
        i = 0
        
        while i < len(sentences):
            chunk_sentences = sentences[i:i + chunk_size]
            chunk_text = ' '.join(chunk_sentences)
            
            chunks.append({
                'text': chunk_text,
                'start_idx': i,
                'end_idx': i + len(chunk_sentences)
            })
            
            i += max(1, chunk_size - overlap)
        
        return chunks

print("‚úÖ All transcription classes defined successfully!")

## Step 6: Generate Transcription üéôÔ∏è

This cell processes your text file and generates TTS-optimized transcription with prosodic markers.

In [None]:
from pathlib import Path
from datetime import datetime
import time

print("="*60)
print("üéôÔ∏è TTS-OPTIMIZED TRANSCRIPTION GENERATOR")
print("="*60)

# Check if file is uploaded
if 'INPUT_FILE' not in globals() or not os.path.exists(INPUT_FILE):
    print("‚ùå Error: No file uploaded. Please run Step 2 first.")
else:
    print(f"\nüìñ Reading: {INPUT_FILE}")
    with open(INPUT_FILE, 'r', encoding='utf-8') as f:
        text = f.read().strip()
    
    primary_lang = TTSOptimizedPrompts.detect_language(text)
    print(f"üåç Detected language: {primary_lang.upper()}")
    
    # Initialize narrator
    narrator = TTSOptimizedNarrator(
        provider=AI_PROVIDER,
        model_name=MODEL_NAME,
        language=LANGUAGE
    )
    
    preprocessor = TextPreprocessor()
    validator = TranscriptionValidator()
    
    # Process chapters
    chapters = preprocessor.split_into_chapters(text)
    print(f"‚úÖ Found {len(chapters)} chapters")
    
    # Store results
    transcription_data = {
        "metadata": {
            "source_file": INPUT_FILE,
            "generated_at": datetime.now().isoformat(),
            "primary_language": primary_lang,
            "total_chapters": len(chapters),
            "narrator_model": MODEL_NAME,
            "chunk_size": CHUNK_SIZE
        },
        "chapters": []
    }
    
    total_start = time.time()
    successful = 0
    total_chunks = 0
    total_markers = {'pause': 0, 'tone': 0, 'emphasis': 0, 'pace': 0, 'breath': 0}
    
    for ch_idx, chapter in enumerate(chapters, 1):
        print(f"\n{'='*60}")
        print(f"üìñ Chapter {ch_idx}/{len(chapters)}: {chapter['title']}")
        print(f"{'='*60}")
        
        sentences = preprocessor.split_into_sentences(chapter['content'])
        chunks = preprocessor.create_chunks(sentences, chunk_size=CHUNK_SIZE, overlap=1)
        
        print(f"üì¶ Processing {len(chunks)} chunks...")
        total_chunks += len(chunks)
        
        narrated_chunks = []
        
        for c_idx, chunk in enumerate(chunks, 1):
            print(f"   üéôÔ∏è Chunk {c_idx}/{len(chunks)}... ", end="", flush=True)
            
            start_time = time.time()
            narration, is_valid, lang, markers = narrator.narrate_text(chunk['text'])
            elapsed = time.time() - start_time
            
            # Update marker counts
            for key in total_markers:
                total_markers[key] += markers.get(key, 0)
            
            if is_valid:
                successful += 1
                marker_str = f"P:{markers.get('pause',0)} T:{markers.get('tone',0)} E:{markers.get('emphasis',0)}"
                print(f"‚úÖ [{lang}] {marker_str} ({elapsed:.1f}s)")
            else:
                print(f"‚ö†Ô∏è Fallback [{lang}] ({elapsed:.1f}s)")
            
            narrated_chunks.append({
                "chunk_number": c_idx,
                "original_text": chunk['text'],
                "tts_transcription": narration,
                "language": lang,
                "is_valid": is_valid,
                "markers": markers
            })
        
        transcription_data["chapters"].append({
            "chapter_number": ch_idx,
            "title": chapter['title'],
            "chunks": narrated_chunks
        })
    
    total_time = time.time() - total_start
    
    # Save files
    OUTPUT_DIR = Path("tts_transcriptions")
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    base_name = Path(INPUT_FILE).stem
    
    json_file = OUTPUT_DIR / f"{base_name}_tts_{timestamp}.json"
    txt_file = OUTPUT_DIR / f"{base_name}_tts_{timestamp}.txt"
    
    # Save detailed JSON
    with open(json_file, 'w', encoding='utf-8') as f:
        json.dump(transcription_data, f, ensure_ascii=False, indent=2)
    
    # Save clean TTS-ready text
    with open(txt_file, 'w', encoding='utf-8') as f:
        f.write("# TTS-OPTIMIZED TRANSCRIPTION\n")
        f.write(f"# Generated: {datetime.now().isoformat()}\n")
        f.write(f"# Language: {primary_lang}\n")
        f.write(f"# Total markers: {sum(total_markers.values())}\n")
        f.write("#" + "="*58 + "\n\n")
        
        for chapter in transcription_data["chapters"]:
            f.write(f"\n{'='*60}\n")
            f.write(f"CHAPTER {chapter['chapter_number']}: {chapter['title']}\n")
            f.write(f"{'='*60}\n\n")
            
            for chunk in chapter['chunks']:
                f.write(f"{chunk['tts_transcription']}\n\n")
    
    # Store file paths for download
    globals()['OUTPUT_TXT_FILE'] = str(txt_file)
    globals()['OUTPUT_JSON_FILE'] = str(json_file)
    
    # Print summary
    print(f"\n{'='*60}")
    print(f"üéâ TTS TRANSCRIPTION COMPLETE!")
    print(f"{'='*60}")
    print(f"‚è±Ô∏è Total time: {total_time/60:.2f} minutes")
    print(f"üåç Language: {primary_lang.upper()}")
    print(f"üìö Chapters: {len(chapters)}")
    print(f"üì¶ Total chunks: {total_chunks}")
    print(f"‚úÖ Successful: {successful}/{total_chunks} ({100*successful/total_chunks:.1f}%)")
    print(f"\nüé≠ Prosodic Markers Added:")
    print(f"   Pauses: {total_markers['pause']}")
    print(f"   Tones: {total_markers['tone']}")
    print(f"   Emphasis: {total_markers['emphasis']}")
    print(f"   Pace: {total_markers['pace']}")
    print(f"   Breaths: {total_markers['breath']}")
    print(f"   Total: {sum(total_markers.values())}")
    print(f"\nüíæ Files saved:")
    print(f"   üìÑ TXT: {txt_file}")
    print(f"   üìä JSON: {json_file}")

## Step 7: Download Generated Transcription üíæ

Download your TTS-optimized transcription files.

In [None]:
from google.colab import files
import os

print("="*60)
print("üíæ DOWNLOAD TRANSCRIPTION FILES")
print("="*60 + "\n")

if 'OUTPUT_TXT_FILE' not in globals():
    print("‚ùå No transcription generated yet. Please run Step 6 first.")
else:
    txt_file = globals()['OUTPUT_TXT_FILE']
    json_file = globals()['OUTPUT_JSON_FILE']
    
    if os.path.exists(txt_file):
        print(f"üìÑ TXT file ready: {txt_file}")
        
        # Show preview
        with open(txt_file, 'r', encoding='utf-8') as f:
            content = f.read()
        print(f"\nüìñ Preview (first 1000 chars):")
        print("-"*50)
        print(content[:1000])
        if len(content) > 1000:
            print("...")
        print("-"*50)
        
        print("\n‚¨áÔ∏è Downloading TXT file...")
        files.download(txt_file)
    else:
        print(f"‚ùå TXT file not found: {txt_file}")
    
    print("\n" + "="*60)

In [None]:
# Optional: Download JSON file (contains detailed metadata)
from google.colab import files

if 'OUTPUT_JSON_FILE' in globals() and os.path.exists(globals()['OUTPUT_JSON_FILE']):
    print("‚¨áÔ∏è Downloading JSON file (detailed metadata)...")
    files.download(globals()['OUTPUT_JSON_FILE'])
else:
    print("‚ùå JSON file not available.")

---

## üìã Notes

### Prosodic Markers Explained

| Marker | Description | Duration |
|--------|-------------|----------|
| `[PAUSE-SHORT]` | Brief pause between phrases | 0.3s |
| `[PAUSE-MEDIUM]` | Sentence pause for breathing | 0.6s |
| `[PAUSE-LONG]` | Dramatic pause | 1.0s |
| `[BREATH]` | Natural breath sound | - |
| `[TONE: X]` | Emotional tone (calm, excited, etc.) | - |
| `[EMPHASIS: word]` | Stress on specific word | - |
| `[PACE: X]` | Reading speed (slow, normal, fast) | - |

### Provider Comparison

| Feature | HuggingFace | Ollama |
|---------|-------------|--------|
| **Setup** | Easy (recommended) | Requires server setup |
| **Speed on Colab** | Fast with GPU | Moderate |
| **Best Models** | gemma-2-2b-it, Airavata | gemma2:9b, aya:8b |
| **Memory** | Uses GPU memory | Uses CPU/RAM |

---

‚ú® **Tip**: Feed the generated transcription to your TTS model for natural, human-like audio!