# üåê Hindi Literary Translation Generator for Google Colab

**Enhanced Multilingual Translation with Real-time Streaming**

This notebook allows you to:
1. Upload your text file to translate
2. Select AI provider (Ollama/HuggingFace) and model
3. Choose target language and translation quality tier
4. Generate translation and download the result

**Supported Providers:**
- ü§ó HuggingFace - Various translation models
- ü¶ô Ollama - Local models (if running locally)

**Recommended Models for Hindi:**
- `facebook/nllb-200-distilled-600M` - Fast, multilingual
- `ai4bharat/indictrans2-en-indic-1B` - Best for English‚ÜíHindi
- `google/madlad400-3b-mt` - High quality, slower

## üì¶ Step 1: Install Dependencies
Run this cell to install all required packages.

In [None]:
# Install required packages
!pip install -q torch transformers accelerate sentencepiece
!pip install -q colorama huggingface-hub
print("‚úÖ All dependencies installed!")

## üì§ Step 2: Upload Your Text File
Upload the text file you want to translate.

In [None]:
from google.colab import files
import os

print("üì§ Please upload your text file to translate:")
uploaded = files.upload()

# Get the uploaded file name
UPLOADED_FILE = list(uploaded.keys())[0]
print(f"\n‚úÖ Uploaded: {UPLOADED_FILE}")
print(f"üìÑ File size: {len(uploaded[UPLOADED_FILE])} bytes")

# Display preview
with open(UPLOADED_FILE, 'r', encoding='utf-8') as f:
    content = f.read()
    word_count = len(content.split())
    char_count = len(content)

print(f"\nüìä Content stats:")
print(f"   Words: {word_count:,}")
print(f"   Characters: {char_count:,}")
print(f"\nüìù Preview (first 500 chars):\n{content[:500]}...")

## ‚öôÔ∏è Step 3: Select AI Provider, Model & Language
Choose your preferred translation model and settings.

In [None]:
import ipywidgets as widgets
from IPython.display import display, HTML

# Provider options
PROVIDER_OPTIONS = {
    "HuggingFace (Recommended for Colab)": "huggingface",
    "Ollama (Local only)": "ollama"
}

# Model options by provider
HF_MODEL_OPTIONS = {
    "facebook/nllb-200-distilled-600M (Fast, Multilingual)": "facebook/nllb-200-distilled-600M",
    "facebook/nllb-200-1.3B (Better Quality)": "facebook/nllb-200-1.3B",
    "ai4bharat/indictrans2-en-indic-1B (Best English‚ÜíHindi)": "ai4bharat/indictrans2-en-indic-1B",
    "google/madlad400-3b-mt (High Quality, Slow)": "google/madlad400-3b-mt",
    "Helsinki-NLP/opus-mt-en-hi (Simple EN‚ÜíHI)": "Helsinki-NLP/opus-mt-en-hi",
    "Custom Model (enter below)": "custom"
}

OLLAMA_MODEL_OPTIONS = {
    "qwen2.5:3b (Fast)": "qwen2.5:3b",
    "qwen2.5:7b (Balanced)": "qwen2.5:7b",
    "deepseek-r1:7b (Reasoning)": "deepseek-r1:7b",
    "llama3.2:3b (Fast)": "llama3.2:3b",
    "Custom Model (enter below)": "custom"
}

# Language options (NLLB language codes)
LANGUAGE_OPTIONS = {
    "Hindi (‡§π‡§ø‡§®‡•ç‡§¶‡•Ä)": "hin_Deva",
    "Bengali (‡¶¨‡¶æ‡¶Ç‡¶≤‡¶æ)": "ben_Beng",
    "Tamil (‡Æ§‡ÆÆ‡Æø‡Æ¥‡Øç)": "tam_Taml",
    "Telugu (‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å)": "tel_Telu",
    "Marathi (‡§Æ‡§∞‡§æ‡§†‡•Ä)": "mar_Deva",
    "Gujarati (‡™ó‡´Å‡™ú‡™∞‡™æ‡™§‡´Ä)": "guj_Gujr",
    "Kannada (‡≤ï‡≤®‡≥ç‡≤®‡≤°)": "kan_Knda",
    "Malayalam (‡¥Æ‡¥≤‡¥Ø‡¥æ‡¥≥‡¥Ç)": "mal_Mlym",
    "Punjabi (‡®™‡©∞‡®ú‡®æ‡®¨‡©Ä)": "pan_Guru",
    "Odia (‡¨ì‡¨°‡¨º‡¨ø‡¨Ü)": "ory_Orya",
    "Urdu (ÿßÿ±ÿØŸà)": "urd_Arab",
    "Spanish (Espa√±ol)": "spa_Latn",
    "French (Fran√ßais)": "fra_Latn",
    "German (Deutsch)": "deu_Latn",
    "Chinese (‰∏≠Êñá)": "zho_Hans",
    "Japanese (Êó•Êú¨Ë™û)": "jpn_Jpan",
    "Custom (enter code below)": "custom"
}

# Translation quality tiers
TIER_OPTIONS = {
    "BASIC - Fast, good quality": "BASIC",
    "INTERMEDIATE - Balanced (recommended)": "INTERMEDIATE",
    "ADVANCED - Best quality, slower": "ADVANCED"
}

# Provider dropdown
provider_dropdown = widgets.Dropdown(
    options=list(PROVIDER_OPTIONS.keys()),
    value="HuggingFace (Recommended for Colab)",
    description='Provider:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='400px')
)

# Model dropdown (HuggingFace by default)
model_dropdown = widgets.Dropdown(
    options=list(HF_MODEL_OPTIONS.keys()),
    value="facebook/nllb-200-distilled-600M (Fast, Multilingual)",
    description='Model:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='500px')
)

# Custom model input
custom_model_input = widgets.Text(
    value='',
    placeholder='Enter HuggingFace model name',
    description='Custom Model:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='500px')
)

# Target language dropdown
language_dropdown = widgets.Dropdown(
    options=list(LANGUAGE_OPTIONS.keys()),
    value="Hindi (‡§π‡§ø‡§®‡•ç‡§¶‡•Ä)",
    description='Target Language:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='400px')
)

# Custom language code input
custom_language_input = widgets.Text(
    value='',
    placeholder='Enter NLLB language code (e.g., hin_Deva)',
    description='Custom Lang:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='400px')
)

# Translation tier dropdown
tier_dropdown = widgets.Dropdown(
    options=list(TIER_OPTIONS.keys()),
    value="INTERMEDIATE - Balanced (recommended)",
    description='Quality:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='400px')
)

# HuggingFace token (optional)
hf_token_input = widgets.Password(
    value='',
    placeholder='Optional: HF token for gated models',
    description='HF Token:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='400px')
)

# Chunk size slider
chunk_size_slider = widgets.IntSlider(
    value=350,
    min=100,
    max=1000,
    step=50,
    description='Chunk Size:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='400px')
)

display(HTML("<h3>üéõÔ∏è Configure Translation Settings</h3>"))
display(provider_dropdown)
display(model_dropdown)
display(custom_model_input)
display(HTML("<br>"))
display(language_dropdown)
display(custom_language_input)
display(HTML("<br>"))
display(tier_dropdown)
display(chunk_size_slider)
display(hf_token_input)

print("\nüí° Tip: facebook/nllb-200-distilled-600M is recommended for fast Hindi translation!")

In [None]:
# Store the selected configuration
SELECTED_PROVIDER = PROVIDER_OPTIONS[provider_dropdown.value]

# Get model based on provider
if SELECTED_PROVIDER == "huggingface":
    selected_model_key = model_dropdown.value
    SELECTED_MODEL = HF_MODEL_OPTIONS.get(selected_model_key, "custom")
else:
    SELECTED_MODEL = OLLAMA_MODEL_OPTIONS.get(model_dropdown.value, "custom")

if SELECTED_MODEL == "custom":
    SELECTED_MODEL = custom_model_input.value
    if not SELECTED_MODEL:
        raise ValueError("Please enter a custom model name!")

# Get target language
TARGET_LANGUAGE = LANGUAGE_OPTIONS[language_dropdown.value]
if TARGET_LANGUAGE == "custom":
    TARGET_LANGUAGE = custom_language_input.value
    if not TARGET_LANGUAGE:
        raise ValueError("Please enter a custom language code!")

TRANSLATION_TIER = TIER_OPTIONS[tier_dropdown.value]
CHUNK_SIZE = chunk_size_slider.value
HF_TOKEN = hf_token_input.value if hf_token_input.value else None

print(f"\n‚úÖ Configuration saved:")
print(f"   ü§ñ Provider: {SELECTED_PROVIDER}")
print(f"   üì¶ Model: {SELECTED_MODEL}")
print(f"   üåê Target Language: {TARGET_LANGUAGE}")
print(f"   üéØ Quality Tier: {TRANSLATION_TIER}")
print(f"   üì¶ Chunk Size: {CHUNK_SIZE} words")
print(f"   üîë HF Token: {'Provided' if HF_TOKEN else 'Not provided'}")

## üöÄ Step 4: Translation Engine Setup
This cell contains the complete translation engine code.

In [None]:
#!/usr/bin/env python3
"""
Enhanced Translation Engine for Google Colab
Supports HuggingFace models with multiple language targets
"""

import os
import sys
import json
import time
import warnings
import re
from pathlib import Path
from datetime import datetime

warnings.filterwarnings("ignore")

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM, pipeline

# Check GPU availability
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"üñ•Ô∏è Using device: {DEVICE}")
if DEVICE == "cuda":
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")


# Translation Prompts for LLM-based models
TRANSLATION_PROMPTS = {
    "BASIC": {
        "system": """You are a professional translator. Translate the text accurately.""",
        "user": """Translate the following text to {target_lang}:\n\n{chunk}\n\nTranslation:"""
    },
    "INTERMEDIATE": {
        "system": """You are an expert literary translator. Create translations that feel natural in the target language while preserving all meaning and nuance.""",
        "user": """Translate the following text to {target_lang}. Maintain all details, dialogue, and descriptions:\n\n{chunk}\n\nComplete Translation:"""
    },
    "ADVANCED": {
        "system": """You are a master literary translator. Your translations should feel like they were originally written in the target language by a native speaker. Preserve every sentence, every detail, every nuance.""",
        "user": """Translate the COMPLETE passage below to {target_lang}.\n\nRequirements:\n- Translate EVERY sentence\n- Maintain ALL dialogue\n- Preserve ALL descriptions\n- Keep similar length\n\nText:\n{chunk}\n\nComplete Translation:"""
    }
}


def chunk_text(text, chunk_words=350):
    """Split text into chunks at paragraph boundaries."""
    paragraph_patterns = [
        r'\n\s*\n',
        r'\r\n\s*\r\n',
        r'\n\s{2,}\n',
    ]
    paragraph_split_pattern = '|'.join(paragraph_patterns)
    paragraphs = re.split(paragraph_split_pattern, text)
    paragraphs = [para.strip() for para in paragraphs if para.strip()]

    chunks = []
    current_chunk = []
    current_count = 0

    for para in paragraphs:
        para_words = para.split()
        para_count = len(para_words)

        if para_count > chunk_words:
            if current_chunk:
                chunks.append('\n\n'.join(current_chunk))
                current_chunk = []
                current_count = 0

            words = para.split()
            for i in range(0, len(words), chunk_words):
                chunk_words_list = words[i:i + chunk_words]
                chunk_text = ' '.join(chunk_words_list)
                chunks.append(chunk_text)
        else:
            if current_count + para_count > chunk_words and current_chunk:
                chunks.append('\n\n'.join(current_chunk))
                current_chunk = [para]
                current_count = para_count
            else:
                current_chunk.append(para)
                current_count += para_count

    if current_chunk:
        chunks.append('\n\n'.join(current_chunk))

    return chunks


def clean_translation(text):
    """Clean up translation artifacts."""
    text = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
    text = re.sub(r'```\w*\n?', '', text)
    text = re.sub(r'(Translation:|Hindi Translation:|Here\'s the translation:)', '', text, flags=re.IGNORECASE)
    lines = [line.strip() for line in text.split('\n')]
    text = '\n\n'.join(line for line in lines if line)
    return text.strip()


class TranslationEngine:
    """Translation engine with HuggingFace support."""
    
    def __init__(self, model_name, target_lang, device="cuda", hf_token=None):
        self.model_name = model_name
        self.target_lang = target_lang
        self.device = device
        self.hf_token = hf_token
        self.model = None
        self.tokenizer = None
        self.translator = None
        self.model_type = self._detect_model_type(model_name)
        
        if self.hf_token:
            os.environ['HF_TOKEN'] = self.hf_token
        
        self.load_model()
    
    def _detect_model_type(self, model_name):
        """Detect model type from name."""
        model_lower = model_name.lower()
        if 'nllb' in model_lower:
            return 'nllb'
        elif 'indictrans' in model_lower:
            return 'indictrans'
        elif 'opus-mt' in model_lower or 'helsinki' in model_lower:
            return 'opus'
        elif 'madlad' in model_lower:
            return 'madlad'
        elif 'mbart' in model_lower:
            return 'mbart'
        else:
            return 'causal'  # LLM-based translation
    
    def load_model(self):
        """Load translation model."""
        print(f"üì• Loading model: {self.model_name}")
        print(f"   Model type: {self.model_type}")
        
        try:
            if self.model_type in ['nllb', 'opus', 'mbart']:
                self._load_seq2seq_model()
            elif self.model_type == 'indictrans':
                self._load_indictrans_model()
            elif self.model_type == 'madlad':
                self._load_madlad_model()
            else:
                self._load_causal_model()
            
            print("‚úÖ Model loaded successfully!")
        except Exception as e:
            print(f"‚ùå Failed to load model: {e}")
            raise
    
    def _load_seq2seq_model(self):
        """Load Seq2Seq translation model (NLLB, OPUS, mBART)."""
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.model_name,
            token=self.hf_token,
            src_lang="eng_Latn" if self.model_type == 'nllb' else None
        )
        self.model = AutoModelForSeq2SeqLM.from_pretrained(
            self.model_name,
            token=self.hf_token,
            torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
            device_map="auto" if self.device == "cuda" else None
        )
        
        if self.device != "cuda":
            self.model = self.model.to(self.device)
        
        self.translator = pipeline(
            "translation",
            model=self.model,
            tokenizer=self.tokenizer,
            src_lang="eng_Latn" if self.model_type == 'nllb' else "en",
            tgt_lang=self.target_lang if self.model_type == 'nllb' else None,
            max_length=1024,
            device=0 if self.device == "cuda" else -1
        )
    
    def _load_indictrans_model(self):
        """Load IndicTrans2 model."""
        from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
        
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.model_name,
            token=self.hf_token,
            trust_remote_code=True
        )
        self.model = AutoModelForSeq2SeqLM.from_pretrained(
            self.model_name,
            token=self.hf_token,
            trust_remote_code=True,
            torch_dtype=torch.float16 if self.device == "cuda" else torch.float32
        )
        
        if self.device == "cuda":
            self.model = self.model.cuda()
    
    def _load_madlad_model(self):
        """Load MADLAD-400 model."""
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.model_name,
            token=self.hf_token
        )
        self.model = AutoModelForSeq2SeqLM.from_pretrained(
            self.model_name,
            token=self.hf_token,
            torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
            device_map="auto" if self.device == "cuda" else None
        )
    
    def _load_causal_model(self):
        """Load causal LM for translation."""
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.model_name,
            token=self.hf_token
        )
        self.model = AutoModelForCausalLM.from_pretrained(
            self.model_name,
            token=self.hf_token,
            torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
            device_map="auto" if self.device == "cuda" else None
        )
        
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
    
    def translate(self, text, tier="INTERMEDIATE"):
        """Translate text based on model type."""
        if self.model_type in ['nllb', 'opus', 'mbart']:
            return self._translate_seq2seq(text)
        elif self.model_type == 'indictrans':
            return self._translate_indictrans(text)
        elif self.model_type == 'madlad':
            return self._translate_madlad(text)
        else:
            return self._translate_causal(text, tier)
    
    def _translate_seq2seq(self, text):
        """Translate using Seq2Seq model."""
        result = self.translator(text, max_length=1024)
        return result[0]['translation_text']
    
    def _translate_indictrans(self, text):
        """Translate using IndicTrans2."""
        inputs = self.tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
        
        if self.device == "cuda":
            inputs = {k: v.cuda() for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_length=512,
                num_beams=5,
                num_return_sequences=1
            )
        
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    def _translate_madlad(self, text):
        """Translate using MADLAD-400."""
        # MADLAD uses language tags like <2hi> for Hindi
        lang_code = self.target_lang.split('_')[0][:2]  # Extract 2-letter code
        tagged_text = f"<2{lang_code}> {text}"
        
        inputs = self.tokenizer(tagged_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
        
        if self.device == "cuda":
            inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_length=512,
                num_beams=4
            )
        
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    def _translate_causal(self, text, tier):
        """Translate using causal LM with prompts."""
        prompts = TRANSLATION_PROMPTS[tier]
        
        # Get language name from code
        lang_names = {'hin_Deva': 'Hindi', 'ben_Beng': 'Bengali', 'tam_Taml': 'Tamil'}
        lang_name = lang_names.get(self.target_lang, self.target_lang)
        
        full_prompt = f"{prompts['system']}\n\n{prompts['user'].format(target_lang=lang_name, chunk=text)}"
        
        inputs = self.tokenizer(full_prompt, return_tensors="pt", padding=True, truncation=True, max_length=2048)
        
        if self.device == "cuda":
            inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=1024,
                temperature=0.4,
                do_sample=True,
                pad_token_id=self.tokenizer.pad_token_id
            )
        
        generated = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        # Extract translation from response
        if "Translation:" in generated:
            return generated.split("Translation:")[-1].strip()
        return generated[len(full_prompt):].strip()


class TranslationGenerator:
    """Main translation generator class."""
    
    def __init__(self, model_name, target_lang, device="cuda", output_dir=".", tier="INTERMEDIATE", chunk_size=350, hf_token=None):
        self.model_name = model_name
        self.target_lang = target_lang
        self.device = device
        self.output_dir = Path(output_dir)
        self.tier = tier
        self.chunk_size = chunk_size
        
        self.output_dir.mkdir(parents=True, exist_ok=True)
        self.engine = TranslationEngine(model_name, target_lang, device, hf_token)
    
    def translate_file(self, input_file):
        """Translate entire file."""
        print(f"\n{'=' * 70}")
        print(f"üåê TRANSLATION GENERATOR")
        print(f"{'=' * 70}")
        print(f"üìÑ Input: {input_file}")
        print(f"ü§ñ Model: {self.model_name}")
        print(f"üåê Target: {self.target_lang}")
        print(f"üéØ Quality: {self.tier}")
        print(f"üñ•Ô∏è Device: {self.device}")
        print(f"{'=' * 70}\n")
        
        # Read input
        with open(input_file, 'r', encoding='utf-8') as f:
            text = f.read()
        
        # Clean markers
        lines = text.split('\n')
        cleaned = [l for l in lines if not (l.strip().startswith('===') and l.strip().endswith('==='))]
        text = '\n'.join(cleaned).strip()
        
        orig_words = len(text.split())
        orig_chars = len(text)
        print(f"üìä Input: {orig_chars:,} chars, {orig_words:,} words")
        
        # Chunk text
        print(f"\nüì¶ Creating chunks ({self.chunk_size} words each)...")
        chunks = chunk_text(text, self.chunk_size)
        print(f"‚úÖ Created {len(chunks)} chunks")
        
        # Translate chunks
        print(f"\nüéØ STARTING TRANSLATION\n")
        
        translations = []
        start_time = time.time()
        
        for i, chunk in enumerate(chunks, 1):
            chunk_start = time.time()
            
            print(f"\n{'=' * 50}")
            print(f"üìÑ Chunk {i}/{len(chunks)}")
            print(f"   Input: {len(chunk.split())} words, {len(chunk)} chars")
            
            try:
                translated = self.engine.translate(chunk, self.tier)
                translated = clean_translation(translated)
                translations.append(translated)
                
                chunk_time = time.time() - chunk_start
                print(f"   Output: {len(translated)} chars")
                print(f"   ‚úÖ Completed in {chunk_time:.1f}s")
                
                # Progress
                elapsed = time.time() - start_time
                avg = elapsed / i
                remaining = len(chunks) - i
                eta = remaining * avg
                print(f"   üìà Progress: {i/len(chunks)*100:.1f}% | ETA: {eta/60:.1f}m")
                
            except Exception as e:
                print(f"   ‚ùå Error: {e}")
                translations.append(f"[TRANSLATION ERROR: {e}]")
        
        # Combine translations
        final_translation = "\n\n".join(translations)
        
        # Save output
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        lang_code = self.target_lang.split('_')[0]
        output_file = self.output_dir / f"translation_{lang_code}_{timestamp}.txt"
        
        with open(output_file, 'w', encoding='utf-8') as f:
            f.write(final_translation)
        
        # Summary
        total_time = time.time() - start_time
        trans_chars = len(final_translation)
        
        print(f"\n{'=' * 70}")
        print(f"üéâ TRANSLATION COMPLETE!")
        print(f"{'=' * 70}")
        print(f"‚è±Ô∏è Time: {total_time/60:.1f} minutes")
        print(f"üì¶ Chunks: {len(chunks)}")
        print(f"‚ö° Avg/chunk: {total_time/len(chunks):.1f}s")
        print(f"üìù Input: {orig_chars:,} chars")
        print(f"üìù Output: {trans_chars:,} chars")
        print(f"üìä Ratio: {trans_chars/orig_chars:.2f}x")
        print(f"üíæ Output: {output_file}")
        print(f"{'=' * 70}")
        
        return str(output_file)


print("‚úÖ Translation Engine loaded and ready!")

## üåê Step 5: Generate Translation
Run this cell to translate your uploaded file.

In [None]:
# Create output directory
OUTPUT_DIR = "./translation_output"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Initialize the generator
print("üöÄ Initializing Translation Generator...")
generator = TranslationGenerator(
    model_name=SELECTED_MODEL,
    target_lang=TARGET_LANGUAGE,
    device=DEVICE,
    output_dir=OUTPUT_DIR,
    tier=TRANSLATION_TIER,
    chunk_size=CHUNK_SIZE,
    hf_token=HF_TOKEN
)

# Translate
print(f"\nüåê Starting translation...")
OUTPUT_FILE = generator.translate_file(UPLOADED_FILE)

print(f"\n‚úÖ Translation file generated: {OUTPUT_FILE}")

## üìñ Step 6: Preview & Download Translation
View your translation and download it.

In [None]:
from IPython.display import display, HTML
import os

if os.path.exists(OUTPUT_FILE):
    # Read and display translation
    with open(OUTPUT_FILE, 'r', encoding='utf-8') as f:
        translation = f.read()
    
    file_size = os.path.getsize(OUTPUT_FILE) / 1024  # KB
    word_count = len(translation.split())
    
    print(f"üìä Translation stats:")
    print(f"   Words: {word_count:,}")
    print(f"   Characters: {len(translation):,}")
    print(f"   File size: {file_size:.2f} KB")
    
    print(f"\nüìñ Preview (first 1000 chars):")
    print(f"{'=' * 50}")
    print(translation[:1000])
    print(f"{'=' * 50}")
    if len(translation) > 1000:
        print(f"... [truncated, {len(translation) - 1000:,} more chars]")
else:
    print("‚ùå Output file not found. Please run the translation step again.")

In [None]:
# Download the translated file
from google.colab import files

print("üì• Downloading your translated file...")
files.download(OUTPUT_FILE)
print("‚úÖ Download started! Check your browser's download folder.")

## üíæ (Optional) Save to Google Drive
If you want to save the translation to your Google Drive.

In [None]:
# Mount Google Drive
from google.colab import drive
import shutil

print("üìÇ Mounting Google Drive...")
drive.mount('/content/drive')

# Create output folder in Drive
DRIVE_OUTPUT_DIR = "/content/drive/MyDrive/Translation_Output"
os.makedirs(DRIVE_OUTPUT_DIR, exist_ok=True)

# Copy file to Drive
drive_output_path = os.path.join(DRIVE_OUTPUT_DIR, os.path.basename(OUTPUT_FILE))
shutil.copy(OUTPUT_FILE, drive_output_path)

print(f"\n‚úÖ Translation saved to Google Drive:")
print(f"   üìÅ {drive_output_path}")

---

## üìö Quick Reference

### Supported Languages (NLLB Codes):
| Language | Code |
|----------|------|
| Hindi | `hin_Deva` |
| Bengali | `ben_Beng` |
| Tamil | `tam_Taml` |
| Telugu | `tel_Telu` |
| Marathi | `mar_Deva` |
| Gujarati | `guj_Gujr` |
| Spanish | `spa_Latn` |
| French | `fra_Latn` |
| German | `deu_Latn` |

### Recommended Models:
| Model | Best For | Speed |
|-------|----------|-------|
| `facebook/nllb-200-distilled-600M` | Fast multilingual | ‚ö° Fast |
| `facebook/nllb-200-1.3B` | Better quality | üîÑ Medium |
| `ai4bharat/indictrans2-en-indic-1B` | Best EN‚ÜíHindi | üîÑ Medium |
| `google/madlad400-3b-mt` | Highest quality | üê¢ Slow |

### Quality Tiers:
- **BASIC**: Fast, good for simple texts
- **INTERMEDIATE**: Balanced quality and speed (recommended)
- **ADVANCED**: Best quality, preserves all nuances

### Tips:
- Use Colab GPU for faster translation
- For long texts, use smaller chunk sizes (200-300 words)
- NLLB models are best for multilingual translation