# üéôÔ∏è Human-Like TTS Generator for Google Colab

**Enhanced Multilingual TTS with Emotional Narration Support (Hindi + English)**

This notebook allows you to:
1. Upload your transcription file
2. Select AI provider and model
3. Generate TTS audio and download it

**Supported Models:**
- üå≥ `suno/bark` - Best for emotions (recommended)
- üì¢ `facebook/mms-tts-hin` - Fast Hindi TTS
- üé§ `microsoft/speecht5_tts` - English TTS
- üáÆüá≥ AI4Bharat models (requires authentication)

## üì¶ Step 1: Install Dependencies
Run this cell to install all required packages.

In [None]:
# Install required packages
!pip install -q torch transformers soundfile pydub numpy huggingface-hub accelerate scipy
!pip install -q datasets  # For speaker embeddings
!apt-get -qq install -y ffmpeg  # For audio processing
print("‚úÖ All dependencies installed!")

## üì§ Step 2: Upload Your Transcription File
Upload your text/JSON file containing the transcription.

In [None]:
from google.colab import files
import os

print("üì§ Please upload your transcription file (.txt or .json):")
uploaded = files.upload()

# Get the uploaded file name
UPLOADED_FILE = list(uploaded.keys())[0]
print(f"\n‚úÖ Uploaded: {UPLOADED_FILE}")
print(f"üìÑ File size: {len(uploaded[UPLOADED_FILE])} bytes")

# Display preview
with open(UPLOADED_FILE, 'r', encoding='utf-8') as f:
    content = f.read()
print(f"\nüìù Preview (first 500 chars):\n{content[:500]}...")

## ‚öôÔ∏è Step 3: Select AI Provider and Model
Choose your preferred TTS model and settings.

In [None]:
import ipywidgets as widgets
from IPython.display import display, HTML

# Model options with descriptions
MODEL_OPTIONS = {
    "suno/bark (Best for emotions, supports Hindi)": "suno/bark",
    "facebook/mms-tts-hin (Fast Hindi TTS)": "facebook/mms-tts-hin",
    "microsoft/speecht5_tts (English TTS)": "microsoft/speecht5_tts",
    "ai4bharat/indic-parler-tts (Hindi - requires auth)": "ai4bharat/indic-parler-tts",
    "Custom Model (enter below)": "custom"
}

# AI Provider dropdown
model_dropdown = widgets.Dropdown(
    options=list(MODEL_OPTIONS.keys()),
    value="suno/bark (Best for emotions, supports Hindi)",
    description='Model:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='500px')
)

# Custom model input
custom_model_input = widgets.Text(
    value='',
    placeholder='Enter HuggingFace model name (e.g., suno/bark-small)',
    description='Custom Model:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='500px')
)

# HuggingFace token (for gated models)
hf_token_input = widgets.Password(
    value='',
    placeholder='Optional: Enter HF token for gated models',
    description='HF Token:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='500px')
)

# Language selection
language_dropdown = widgets.Dropdown(
    options=['auto (detect)', 'hi (Hindi)', 'en (English)'],
    value='auto (detect)',
    description='Language:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='300px')
)

display(HTML("<h3>üéõÔ∏è Select Your TTS Configuration</h3>"))
display(model_dropdown)
display(custom_model_input)
display(hf_token_input)
display(language_dropdown)

print("\nüí° Tip: suno/bark is recommended for emotional Hindi narration!")

In [None]:
# Store the selected configuration
selected_model_key = model_dropdown.value
SELECTED_MODEL = MODEL_OPTIONS[selected_model_key]

if SELECTED_MODEL == "custom":
    SELECTED_MODEL = custom_model_input.value
    if not SELECTED_MODEL:
        raise ValueError("Please enter a custom model name!")

HF_TOKEN = hf_token_input.value if hf_token_input.value else None

LANGUAGE = language_dropdown.value.split()[0]  # Extract 'auto', 'hi', or 'en'

print(f"\n‚úÖ Configuration saved:")
print(f"   üì¶ Model: {SELECTED_MODEL}")
print(f"   üåê Language: {LANGUAGE}")
print(f"   üîë HF Token: {'Provided' if HF_TOKEN else 'Not provided'}")

## üöÄ Step 4: TTS Engine Setup
This cell contains the complete TTS engine code.

In [None]:
#!/usr/bin/env python3
"""
Enhanced Multilingual TTS Engine for Google Colab
Supports: Bark, VITS, SpeechT5, AI4Bharat models
"""

import os
import sys
import json
import time
import warnings
import re
from pathlib import Path
from datetime import datetime
import logging

# Suppress warnings
os.environ['TRANSFORMERS_VERBOSITY'] = 'error'
warnings.filterwarnings("ignore")
logging.getLogger("transformers").setLevel(logging.ERROR)

import torch
import numpy as np
import soundfile as sf
from transformers import (
    AutoProcessor, AutoModel, AutoTokenizer,
    VitsModel, VitsTokenizer,
    BarkModel, BarkProcessor,
    SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
)
from huggingface_hub import login, HfFolder
from pydub import AudioSegment
from pydub.effects import normalize, compress_dynamic_range

# Check GPU availability
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"üñ•Ô∏è Using device: {DEVICE}")
if DEVICE == "cuda":
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")


class TranscriptionParser:
    """Parse human-like transcription with language-agnostic markers."""
    
    def __init__(self):
        self.tone_markers = re.compile(r'\[TONE:\s*(\w+)\]')
        self.pause_markers = re.compile(r'\[PAUSE-(SHORT|MEDIUM|LONG)\]')
        self.pronounce_markers = re.compile(r'(\S+)\s*\[PRONOUNCE:\s*([^\]]+)\]')
        self.emphasis_markers = re.compile(r'\[EMPHASIS:\s*([^\]]+)\]')
    
    def parse(self, text):
        """Parse transcription and extract emotional/contextual information."""
        segments = []
        current_pos = 0
        current_tone = "neutral"
        
        markers = []
        
        for match in self.tone_markers.finditer(text):
            markers.append(('tone', match.start(), match.end(), match.group(1)))
        
        for match in self.pause_markers.finditer(text):
            markers.append(('pause', match.start(), match.end(), match.group(1)))
        
        markers.sort(key=lambda x: x[1])
        
        for marker_type, start, end, value in markers:
            if start > current_pos:
                segment_text = text[current_pos:start].strip()
                if segment_text:
                    segment_text = self._clean_markers(segment_text)
                    segments.append({
                        'text': segment_text,
                        'tone': current_tone,
                        'type': 'speech'
                    })
            
            if marker_type == 'tone':
                current_tone = value
            elif marker_type == 'pause':
                pause_duration = {
                    'SHORT': 0.3,
                    'MEDIUM': 0.6,
                    'LONG': 1.0
                }.get(value, 0.5)
                
                segments.append({
                    'text': '',
                    'duration': pause_duration,
                    'type': 'pause'
                })
            
            current_pos = end
        
        if current_pos < len(text):
            segment_text = text[current_pos:].strip()
            if segment_text:
                segment_text = self._clean_markers(segment_text)
                segments.append({
                    'text': segment_text,
                    'tone': current_tone,
                    'type': 'speech'
                })
        
        return segments
    
    def _clean_markers(self, text):
        """Remove all markers from text."""
        text = self.tone_markers.sub('', text)
        text = self.pause_markers.sub('', text)
        text = self.pronounce_markers.sub(r'\1', text)
        text = self.emphasis_markers.sub(r'\1', text)
        return text.strip()


class MultilingualTTSEngine:
    """TTS engine with multilingual support."""
    
    def __init__(self, model_name, model_type="auto", device="cuda", language="auto", hf_token=None):
        self.model_name = model_name
        self.model_type = model_type
        self.device = device
        self.language = language
        self.hf_token = hf_token
        self.model = None
        self.processor = None
        self.vocoder = None
        self.tokenizer = None
        
        if self.hf_token:
            os.environ['HF_TOKEN'] = self.hf_token
        
        if self.model_type == "auto":
            self.model_type = self._detect_model_type(model_name)
            print(f"üîç Auto-detected model type: {self.model_type}")
        
        self.bark_voice_presets = {
            'neutral': 'v2/en_speaker_6',
            'happy': 'v2/en_speaker_9',
            'sad': 'v2/en_speaker_3',
            'excited': 'v2/en_speaker_9',
            'serious': 'v2/en_speaker_1',
            'thoughtful': 'v2/en_speaker_6',
            'angry': 'v2/en_speaker_1',
            'calm': 'v2/en_speaker_6',
            'worried': 'v2/en_speaker_3',
            'determined': 'v2/en_speaker_1',
            'curious': 'v2/en_speaker_9',
        }
        
        self.load_model()
    
    def _detect_model_type(self, model_name):
        """Auto-detect model type from model name."""
        model_lower = model_name.lower()
        
        if 'bark' in model_lower:
            return 'bark'
        elif 'ai4bharat' in model_lower or 'indic' in model_lower:
            return 'ai4bharat'
        elif 'speecht5' in model_lower:
            return 'speecht5'
        elif 'mms-tts' in model_lower or 'vits' in model_lower:
            return 'vits'
        else:
            print(f"‚ö†Ô∏è Could not auto-detect model type, defaulting to 'vits'")
            return 'vits'
    
    def load_model(self):
        """Load TTS model."""
        print(f"üì• Loading {self.model_type} model: {self.model_name}")
        
        if self.model_type == "bark":
            self._load_bark()
        elif self.model_type == "vits":
            self._load_vits()
        elif self.model_type == "speecht5":
            self._load_speecht5()
        elif self.model_type == "ai4bharat":
            self._load_ai4bharat()
        else:
            raise ValueError(f"Unknown model type: {self.model_type}")
        
        print("‚úÖ Model loaded successfully!")
    
    def _load_bark(self):
        """Load Bark model."""
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            
            self.processor = BarkProcessor.from_pretrained(
                self.model_name,
                token=self.hf_token
            )
            self.model = BarkModel.from_pretrained(
                self.model_name,
                torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
                token=self.hf_token
            ).to(self.device)
            
            if hasattr(self.model, 'generation_config'):
                if self.model.generation_config.pad_token_id is None:
                    self.model.generation_config.pad_token_id = self.model.generation_config.eos_token_id
            
            if self.device == "cuda":
                try:
                    self.model = self.model.to_bettertransformer()
                    print("   ‚úÖ Optimized for GPU with BetterTransformer")
                except:
                    pass
    
    def _load_vits(self):
        """Load VITS model."""
        print(f"   Language: {self.language}")
        try:
            self.tokenizer = VitsTokenizer.from_pretrained(
                self.model_name,
                token=self.hf_token
            )
            self.model = VitsModel.from_pretrained(
                self.model_name,
                token=self.hf_token
            ).to(self.device)
        except Exception as e:
            self.processor = AutoProcessor.from_pretrained(
                self.model_name,
                token=self.hf_token
            )
            self.model = AutoModel.from_pretrained(
                self.model_name,
                token=self.hf_token
            ).to(self.device)
    
    def _load_speecht5(self):
        """Load SpeechT5 model."""
        self.processor = SpeechT5Processor.from_pretrained(
            self.model_name,
            token=self.hf_token
        )
        self.model = SpeechT5ForTextToSpeech.from_pretrained(
            self.model_name,
            token=self.hf_token
        ).to(self.device)
        self.vocoder = SpeechT5HifiGan.from_pretrained(
            "microsoft/speecht5_hifigan"
        ).to(self.device)
    
    def _load_ai4bharat(self):
        """Load AI4Bharat models."""
        print(f"   Loading AI4Bharat model...")
        
        try:
            self.tokenizer = AutoTokenizer.from_pretrained(
                self.model_name,
                token=self.hf_token
            )
            self.processor = AutoProcessor.from_pretrained(
                self.model_name,
                token=self.hf_token
            )
            self.model = AutoModel.from_pretrained(
                self.model_name,
                torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
                token=self.hf_token
            ).to(self.device)
            
            if self.tokenizer.pad_token_id is None:
                self.tokenizer.pad_token = self.tokenizer.eos_token
                self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
            
        except Exception as e:
            print(f"   Trying as VITS model...")
            self.tokenizer = VitsTokenizer.from_pretrained(
                self.model_name,
                token=self.hf_token
            )
            self.model = VitsModel.from_pretrained(
                self.model_name,
                token=self.hf_token
            ).to(self.device)
    
    def detect_language(self, text):
        """Detect text language."""
        hindi_chars = len(re.findall(r'[\u0900-\u097F]', text))
        english_chars = len(re.findall(r'[a-zA-Z]', text))
        
        total_chars = hindi_chars + english_chars
        if total_chars == 0:
            return "en"
        
        hindi_ratio = hindi_chars / total_chars
        return "hi" if hindi_ratio > 0.3 else "en"
    
    def generate_with_emotion(self, text, tone="neutral", sample_rate=24000):
        """Generate audio with emotional context."""
        if self.language == "auto":
            detected_lang = self.detect_language(text)
        else:
            detected_lang = self.language
        
        if self.model_type == "bark":
            return self._generate_bark(text, tone, detected_lang)
        elif self.model_type == "vits":
            return self._generate_vits(text, detected_lang)
        elif self.model_type == "speecht5":
            return self._generate_speecht5(text)
        elif self.model_type == "ai4bharat":
            return self._generate_ai4bharat(text, tone, detected_lang)
        else:
            raise ValueError(f"Unknown model type: {self.model_type}")
    
    def _generate_bark(self, text, tone, language):
        """Generate audio using Bark."""
        voice_preset = self.bark_voice_presets.get(tone, 'v2/en_speaker_6')
        
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            
            inputs = self.processor(
                text,
                voice_preset=voice_preset,
                return_tensors="pt"
            )
            
            inputs = {k: v.to(self.device) for k, v in inputs.items()}
            
            if 'input_ids' in inputs:
                attention_mask = torch.ones_like(inputs['input_ids'])
                inputs['attention_mask'] = attention_mask
            
            with torch.no_grad():
                speech_output = self.model.generate(
                    **inputs,
                    do_sample=True,
                    pad_token_id=self.model.generation_config.pad_token_id
                )
            
            audio_array = speech_output.cpu().numpy().squeeze()
        
        return audio_array, self.model.generation_config.sample_rate
    
    def _generate_vits(self, text, language):
        """Generate audio using VITS."""
        try:
            inputs = self.tokenizer(text, return_tensors="pt", padding=True)
            input_ids = inputs['input_ids'].to(self.device)
            
            with torch.no_grad():
                output = self.model(input_ids)
            
            audio_array = output.waveform.cpu().numpy().squeeze()
            
        except Exception as e:
            inputs = self.processor(text, return_tensors="pt", padding=True)
            inputs = {k: v.to(self.device) for k, v in inputs.items()}
            
            with torch.no_grad():
                output = self.model.generate(**inputs)
            
            audio_array = output.cpu().numpy().squeeze()
        
        sample_rate = getattr(self.model.config, 'sampling_rate', 22050)
        return audio_array, sample_rate
    
    def _generate_speecht5(self, text):
        """Generate audio using SpeechT5."""
        inputs = self.processor(text=text, return_tensors="pt", padding=True)
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        from datasets import load_dataset
        embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
        speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0).to(self.device)
        
        with torch.no_grad():
            speech = self.model.generate_speech(
                inputs["input_ids"],
                speaker_embeddings,
                vocoder=self.vocoder
            )
        
        audio_array = speech.cpu().numpy()
        return audio_array, 16000
    
    def _generate_ai4bharat(self, text, tone, language):
        """Generate audio using AI4Bharat models."""
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            
            inputs = self.processor(
                text,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=512
            )
            
            inputs = {k: v.to(self.device) for k, v in inputs.items()}
            
            if 'attention_mask' not in inputs and 'input_ids' in inputs:
                inputs['attention_mask'] = torch.ones_like(inputs['input_ids'])
            
            with torch.no_grad():
                if hasattr(self.model, 'generate'):
                    output = self.model.generate(**inputs, max_length=2048)
                    audio_array = output.cpu().numpy().squeeze()
                else:
                    output = self.model(**inputs)
                    if hasattr(output, 'waveform'):
                        audio_array = output.waveform.cpu().numpy().squeeze()
                    elif hasattr(output, 'audio'):
                        audio_array = output.audio.cpu().numpy().squeeze()
                    else:
                        audio_array = output[0].cpu().numpy().squeeze()
            
            sample_rate = getattr(self.model.config, 'sampling_rate', 24000)
            return audio_array, sample_rate


class HumanLikeTTSGenerator:
    """Main generator class."""
    
    def __init__(self, model_name, model_type="auto", device="cuda", output_dir=".", language="auto", hf_token=None):
        self.model_name = model_name
        self.model_type = model_type
        self.device = device
        self.output_dir = Path(output_dir)
        self.language = language
        
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        self.parser = TranscriptionParser()
        self.engine = MultilingualTTSEngine(model_name, model_type, device, language, hf_token)
    
    def generate_from_transcription(self, transcription_file):
        """Generate audio from transcription file."""
        print(f"\n{'=' * 70}")
        print(f"üé¨ HUMAN-LIKE TTS GENERATION")
        print(f"{'=' * 70}")
        print(f"üìÑ Input: {transcription_file}")
        print(f"ü§ñ Model: {self.model_name} ({self.model_type})")
        print(f"üñ•Ô∏è Device: {self.device}")
        print(f"üåê Language: {self.language}")
        print(f"{'=' * 70}\n")
        
        transcription_path = Path(transcription_file)
        
        if transcription_path.suffix == '.json':
            with open(transcription_path, 'r', encoding='utf-8') as f:
                data = json.load(f)
                text = self._extract_text_from_json(data)
        else:
            with open(transcription_path, 'r', encoding='utf-8') as f:
                text = f.read()
        
        primary_lang = self.engine.detect_language(text)
        print(f"üîç Detected primary language: {'Hindi' if primary_lang == 'hi' else 'English'}")
        
        print(f"\nüìù Parsing transcription...")
        segments = self.parser.parse(text)
        print(f"   Found {len(segments)} segments")
        
        print(f"\nüéµ Generating audio segments...")
        audio_segments = []
        
        start_time = time.time()
        
        for i, segment in enumerate(segments, 1):
            if segment['type'] == 'pause':
                duration_ms = int(segment['duration'] * 1000)
                silence = AudioSegment.silent(duration=duration_ms)
                audio_segments.append(silence)
                print(f"   [{i}/{len(segments)}] üîá Pause ({segment['duration']}s)")
            
            else:
                text = segment['text']
                tone = segment.get('tone', 'neutral')
                
                if not text.strip():
                    continue
                
                seg_lang = self.engine.detect_language(text)
                lang_label = "HI" if seg_lang == "hi" else "EN"
                
                display_text = text[:50] + "..." if len(text) > 50 else text
                print(f"   [{i}/{len(segments)}] üéôÔ∏è [{lang_label}] ({tone}): {display_text}")
                
                try:
                    seg_start = time.time()
                    audio_array, sample_rate = self.engine.generate_with_emotion(text, tone)
                    seg_time = time.time() - seg_start
                    
                    audio_array = (audio_array * 32767).astype(np.int16)
                    audio_seg = AudioSegment(
                        audio_array.tobytes(),
                        frame_rate=sample_rate,
                        sample_width=2,
                        channels=1
                    )
                    
                    audio_segments.append(audio_seg)
                    print(f"       ‚úÖ Generated in {seg_time:.1f}s")
                
                except Exception as e:
                    print(f"       ‚ö†Ô∏è Failed: {e}")
                    audio_segments.append(AudioSegment.silent(duration=500))
                    continue
        
        print(f"\nüîó Combining {len(audio_segments)} audio segments...")
        final_audio = AudioSegment.empty()
        for seg in audio_segments:
            final_audio += seg
        
        print(f"üéõÔ∏è Post-processing audio...")
        final_audio = self._post_process(final_audio)
        
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        output_file = self.output_dir / f"narration_{primary_lang}_{timestamp}.mp3"
        
        print(f"üíæ Exporting to: {output_file}")
        final_audio.export(
            str(output_file),
            format="mp3",
            bitrate="192k",
            parameters=["-ar", "44100"]
        )
        
        total_time = time.time() - start_time
        duration_sec = len(final_audio) / 1000
        
        print(f"\n{'=' * 70}")
        print(f"üéâ AUDIO GENERATION COMPLETE!")
        print(f"{'=' * 70}")
        print(f"üåê Language: {primary_lang.upper()}")
        print(f"‚è±Ô∏è Generation time: {total_time/60:.2f} minutes")
        print(f"üéµ Audio duration: {duration_sec/60:.2f} minutes")
        print(f"‚ö° Speed: {duration_sec/total_time:.2f}x realtime")
        print(f"üìä File size: {output_file.stat().st_size / 1e6:.2f} MB")
        print(f"üíæ Output: {output_file}")
        print(f"{'=' * 70}")
        
        return str(output_file)
    
    def _extract_text_from_json(self, data):
        """Extract narration text from JSON transcription."""
        text_parts = []
        
        for chapter in data.get('chapters', []):
            if chapter.get('title'):
                text_parts.append(f"[PAUSE-SHORT] {chapter['title']} [PAUSE-MEDIUM]")
            
            for chunk in chapter.get('chunks', []):
                if chunk.get('narration'):
                    text_parts.append(chunk['narration'])
                    text_parts.append('[PAUSE-SHORT]')
        
        return '\n\n'.join(text_parts)
    
    def _post_process(self, audio):
        """Post-process audio for quality."""
        target_dBFS = -20.0
        change_in_dBFS = target_dBFS - audio.dBFS
        audio = audio.apply_gain(change_in_dBFS)
        
        audio = compress_dynamic_range(
            audio,
            threshold=-20.0,
            ratio=4.0,
            attack=5.0,
            release=50.0
        )
        
        audio = normalize(audio)
        return audio


print("‚úÖ TTS Engine loaded and ready!")

## üéµ Step 5: Generate TTS Audio
Run this cell to generate the audio from your transcription.

In [None]:
# Create output directory
OUTPUT_DIR = "./tts_output"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Initialize the generator
print("üöÄ Initializing TTS Generator...")
generator = HumanLikeTTSGenerator(
    model_name=SELECTED_MODEL,
    model_type="auto",
    device=DEVICE,
    output_dir=OUTPUT_DIR,
    language=LANGUAGE,
    hf_token=HF_TOKEN
)

# Generate audio
print(f"\nüéôÔ∏è Starting TTS generation...")
OUTPUT_FILE = generator.generate_from_transcription(UPLOADED_FILE)

print(f"\n‚úÖ Audio file generated: {OUTPUT_FILE}")

## üéß Step 6: Preview & Download Audio
Listen to your generated audio and download it.

In [None]:
from IPython.display import Audio, display, HTML
import os

if os.path.exists(OUTPUT_FILE):
    # Display audio player
    print("üéß Preview your generated audio:")
    display(Audio(OUTPUT_FILE))
    
    # Get file info
    file_size = os.path.getsize(OUTPUT_FILE) / (1024 * 1024)
    print(f"\nüìä File size: {file_size:.2f} MB")
else:
    print("‚ùå Output file not found. Please run the generation step again.")

In [None]:
# Download the generated audio file
from google.colab import files

print("üì• Downloading your TTS audio file...")
files.download(OUTPUT_FILE)
print("‚úÖ Download started! Check your browser's download folder.")

## üíæ (Optional) Save to Google Drive
If you want to save the audio to your Google Drive instead of downloading.

In [None]:
# Mount Google Drive
from google.colab import drive
import shutil

print("üìÇ Mounting Google Drive...")
drive.mount('/content/drive')

# Create TTS output folder in Drive
DRIVE_OUTPUT_DIR = "/content/drive/MyDrive/TTS_Output"
os.makedirs(DRIVE_OUTPUT_DIR, exist_ok=True)

# Copy file to Drive
drive_output_path = os.path.join(DRIVE_OUTPUT_DIR, os.path.basename(OUTPUT_FILE))
shutil.copy(OUTPUT_FILE, drive_output_path)

print(f"\n‚úÖ Audio saved to Google Drive:")
print(f"   üìÅ {drive_output_path}")

---

## üìö Quick Reference

### Supported Markers in Transcription Files:
- `[TONE: happy/sad/excited/serious/thoughtful/angry/calm/worried/determined/curious]` - Set emotional tone
- `[PAUSE-SHORT]` - 0.3 second pause
- `[PAUSE-MEDIUM]` - 0.6 second pause  
- `[PAUSE-LONG]` - 1.0 second pause
- `word [PRONOUNCE: pronunciation]` - Pronunciation guide
- `[EMPHASIS: word]` - Emphasize a word

### Recommended Models:
| Model | Best For | Speed | Quality |
|-------|----------|-------|--------|
| `suno/bark` | Emotional Hindi/English | Slow | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê |
| `facebook/mms-tts-hin` | Fast Hindi | Fast | ‚≠ê‚≠ê‚≠ê‚≠ê |
| `microsoft/speecht5_tts` | English | Medium | ‚≠ê‚≠ê‚≠ê‚≠ê |

### Tips:
- Use Colab GPU runtime for faster generation (Runtime ‚Üí Change runtime type ‚Üí GPU)
- For long texts, generation may take 10-30 minutes
- Bark produces the most natural-sounding speech but is slower