# TTS with XTTS v2 - Voice Cloning

**What this does:**
- Clone any voice with just 30 seconds of audio
- Generate speech in Hindi and English
- No training required (uses pre-trained XTTS v2)

**GPU Required:** Settings ‚Üí Accelerator ‚Üí GPU P100 or T4

## Step 1: Check GPU

In [None]:
import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
else:
    print("‚ùå Enable GPU in Settings!")

## Step 2: Install Packages

In [None]:
!pip install -q TTS
!pip install -q torchaudio
print("‚úÖ Packages installed!")

## Step 3: Load XTTS v2 Model

In [None]:
from TTS.api import TTS
import torch

# Load XTTS v2 (multilingual, voice cloning)
print("Loading XTTS v2 model (this may take a few minutes)...")
device = "cuda" if torch.cuda.is_available() else "cpu"

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

print(f"‚úÖ Model loaded on {device}!")
print(f"\nSupported languages: {tts.languages}")

## Step 4: Upload Your Reference Audio

**Requirements for reference audio:**
- Duration: 10-30 seconds (longer is better)
- Quality: Clear, no background noise
- Format: WAV or MP3
- Content: Natural speech (reading a paragraph works well)

In [None]:
# Upload your reference audio file
# Option 1: Use Kaggle file upload (+ Add data)
# Option 2: Upload directly:

from google.colab import files
import os

# For Kaggle, you can upload via the file browser on the left
# Or set the path to your uploaded dataset

REFERENCE_AUDIO = "/kaggle/input/your-dataset/reference_voice.wav"  # UPDATE THIS PATH

# Or upload from local machine (works in Colab):
# uploaded = files.upload()
# REFERENCE_AUDIO = list(uploaded.keys())[0]

if os.path.exists(REFERENCE_AUDIO):
    print(f"‚úÖ Reference audio found: {REFERENCE_AUDIO}")
else:
    print("‚ùå Reference audio not found!")
    print("Please upload your reference audio file.")

In [None]:
# Check reference audio quality
import torchaudio
import IPython.display as ipd

if os.path.exists(REFERENCE_AUDIO):
    waveform, sample_rate = torchaudio.load(REFERENCE_AUDIO)
    duration = waveform.shape[1] / sample_rate
    
    print(f"Duration: {duration:.1f} seconds")
    print(f"Sample rate: {sample_rate} Hz")
    print(f"Channels: {waveform.shape[0]}")
    
    if duration < 6:
        print("‚ö†Ô∏è Warning: Audio is short. 10-30 seconds recommended.")
    elif duration > 30:
        print("‚ö†Ô∏è Warning: Audio is long. Will use first 30 seconds.")
    else:
        print("‚úÖ Audio length is good!")
    
    # Play the audio
    print("\nüîä Playing reference audio:")
    ipd.display(ipd.Audio(REFERENCE_AUDIO))

## Step 5: Generate Speech (Voice Cloning)

In [None]:
# Test with English
english_text = "Hello! This is a test of the text to speech system. The voice should sound similar to the reference audio."

print("Generating English speech...")
tts.tts_to_file(
    text=english_text,
    file_path="/kaggle/working/output_english.wav",
    speaker_wav=REFERENCE_AUDIO,
    language="en"
)

print("‚úÖ Generated: output_english.wav")
ipd.display(ipd.Audio("/kaggle/working/output_english.wav"))

In [None]:
# Test with Hindi
hindi_text = "‡§®‡§Æ‡§∏‡•ç‡§§‡•á! ‡§Ø‡§π ‡§ü‡•á‡§ï‡•ç‡§∏‡•ç‡§ü ‡§ü‡•Ç ‡§∏‡•ç‡§™‡•Ä‡§ö ‡§∏‡§ø‡§∏‡•ç‡§ü‡§Æ ‡§ï‡§æ ‡§™‡§∞‡•Ä‡§ï‡•ç‡§∑‡§£ ‡§π‡•à‡•§ ‡§Ü‡§µ‡§æ‡§ú‡§º ‡§∏‡§Ç‡§¶‡§∞‡•ç‡§≠ ‡§ë‡§°‡§ø‡§Ø‡•ã ‡§ï‡•á ‡§∏‡§Æ‡§æ‡§® ‡§π‡•ã‡§®‡•Ä ‡§ö‡§æ‡§π‡§ø‡§è‡•§"

print("Generating Hindi speech...")
tts.tts_to_file(
    text=hindi_text,
    file_path="/kaggle/working/output_hindi.wav",
    speaker_wav=REFERENCE_AUDIO,
    language="hi"
)

print("‚úÖ Generated: output_hindi.wav")
ipd.display(ipd.Audio("/kaggle/working/output_hindi.wav"))

## Step 6: Generate Multiple Samples

In [None]:
# Generate multiple test samples
test_texts = {
    "en": [
        "Welcome to our text to speech demonstration.",
        "This technology can convert any text into natural sounding speech.",
        "The voice quality depends on the reference audio provided.",
    ],
    "hi": [
        "‡§Ü‡§ú ‡§ï‡§æ ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à‡•§",
        "‡§ï‡•É‡§™‡§Ø‡§æ ‡§Ö‡§™‡§®‡§æ ‡§®‡§æ‡§Æ ‡§î‡§∞ ‡§™‡§§‡§æ ‡§¨‡§§‡§æ‡§è‡§Ç‡•§",
        "‡§ß‡§®‡•ç‡§Ø‡§µ‡§æ‡§¶, ‡§Ü‡§™‡§ï‡§æ ‡§¶‡§ø‡§® ‡§∂‡•Å‡§≠ ‡§π‡•ã‡•§",
    ]
}

import os
os.makedirs("/kaggle/working/samples", exist_ok=True)

for lang, texts in test_texts.items():
    print(f"\n=== Generating {lang.upper()} samples ===")
    for i, text in enumerate(texts):
        output_file = f"/kaggle/working/samples/{lang}_sample_{i+1}.wav"
        print(f"\nText: {text}")
        
        tts.tts_to_file(
            text=text,
            file_path=output_file,
            speaker_wav=REFERENCE_AUDIO,
            language=lang
        )
        
        print(f"‚úÖ Saved: {output_file}")
        ipd.display(ipd.Audio(output_file))

## Step 7: Create TTS Function for Easy Use

In [None]:
def speak(text, language="en", output_path=None):
    """
    Convert text to speech using cloned voice.
    
    Args:
        text: Text to convert to speech
        language: 'en' for English, 'hi' for Hindi
        output_path: Path to save audio (optional)
    
    Returns:
        Path to generated audio file
    """
    if output_path is None:
        output_path = f"/kaggle/working/speech_{hash(text) % 10000}.wav"
    
    tts.tts_to_file(
        text=text,
        file_path=output_path,
        speaker_wav=REFERENCE_AUDIO,
        language=language
    )
    
    return output_path

# Test the function
audio_file = speak("This is a quick test of our speak function.", language="en")
ipd.display(ipd.Audio(audio_file))

## Step 8: Fine-tune XTTS (Optional - For Better Quality)

If you want even better voice matching, you can fine-tune XTTS on your voice data.

In [None]:
# Fine-tuning requires more data (at least 2-3 minutes of audio with transcriptions)
# Skip this if voice cloning quality is already good enough

FINETUNE = False  # Set to True to enable fine-tuning

if FINETUNE:
    print("Fine-tuning requires:")
    print("1. Multiple audio samples (2-5 minutes total)")
    print("2. Text transcriptions for each audio")
    print("3. A manifest.jsonl file with audio paths and text")
    print("\nFormat of manifest.jsonl:")
    print('{"audio_filepath": "audio1.wav", "text": "Hello world", "language": "en"}')
else:
    print("Fine-tuning skipped. Voice cloning is usually good enough!")

In [None]:
# If you have training data, you can fine-tune like this:
if FINETUNE:
    from TTS.tts.configs.xtts_config import XttsConfig
    from TTS.tts.models.xtts import Xtts
    
    # This is a simplified example
    # Full fine-tuning code is in ml-service/training/tts/train_xtts.py
    
    config = XttsConfig()
    # ... configure training
    print("See train_xtts.py for full fine-tuning code")

## Step 9: Save for Later Use

In [None]:
# Save reference audio and config for deployment
import shutil
import json
import os

OUTPUT_DIR = "/kaggle/working/tts_model"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Copy reference audio
shutil.copy(REFERENCE_AUDIO, f"{OUTPUT_DIR}/reference_voice.wav")

# Save config
config = {
    "model": "tts_models/multilingual/multi-dataset/xtts_v2",
    "reference_audio": "reference_voice.wav",
    "languages": ["en", "hi"],
    "sample_rate": 22050,
}

with open(f"{OUTPUT_DIR}/config.json", "w") as f:
    json.dump(config, f, indent=2)

# Create inference script
inference_code = '''
from TTS.api import TTS
import torch

# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

# Reference audio path
REFERENCE = "reference_voice.wav"

def speak(text, language="en", output_path="output.wav"):
    tts.tts_to_file(
        text=text,
        file_path=output_path,
        speaker_wav=REFERENCE,
        language=language
    )
    return output_path

# Example usage:
# speak("Hello world!", language="en", output_path="hello.wav")
# speak("‡§®‡§Æ‡§∏‡•ç‡§§‡•á ‡§¶‡•Å‡§®‡§ø‡§Ø‡§æ!", language="hi", output_path="namaste.wav")
'''

with open(f"{OUTPUT_DIR}/inference.py", "w") as f:
    f.write(inference_code)

print(f"‚úÖ Saved to: {OUTPUT_DIR}")
print("\nFiles:")
for f in os.listdir(OUTPUT_DIR):
    print(f"  - {f}")

## Step 10: Download

In [None]:
# Create zip for download
import shutil

# Zip all samples
shutil.make_archive("/kaggle/working/tts_samples", 'zip', "/kaggle/working/samples")

# Zip model config
shutil.make_archive("/kaggle/working/tts_deployment", 'zip', OUTPUT_DIR)

print("‚úÖ Created:")
print("  - tts_samples.zip (generated audio samples)")
print("  - tts_deployment.zip (reference audio + inference script)")
print("\nüì• Download from Output panel on the right!")

---

## How to Use Later (Local/Server)

```python
from TTS.api import TTS
import torch

# Load model (first run downloads ~2GB)
device = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

# Generate speech with your voice
tts.tts_to_file(
    text="Hello, this is my cloned voice!",
    file_path="output.wav",
    speaker_wav="reference_voice.wav",  # Your 30-sec recording
    language="en"
)
```

## Supported Languages

XTTS v2 supports: en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh-cn, ja, hu, ko, hi