# 🎙️ Chatterbox TTS - Google Colab Demo

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/viveksurmay/chatterbox-colab/blob/master/chatterbox_colab.ipynb)

Welcome to **Chatterbox TTS** by [Resemble AI](https://resemble.ai) - a state-of-the-art open-source text-to-speech model!

## 🌟 Key Features
- **Zero-shot TTS**: Generate speech in any voice with just a short audio sample
- **Emotion Control**: Unique exaggeration/intensity control for expressive speech
- **High Quality**: Outperforms many closed-source systems like ElevenLabs
- **Voice Conversion**: Convert existing speech to different voices
- **Watermarked**: Built-in Perth watermarking for responsible AI

---

## 📦 Installation

First, let's install Chatterbox TTS and its dependencies:

In [None]:
# Install Chatterbox TTS
!pip install chatterbox-tts

# Install additional dependencies for Colab
!pip install ipywidgets

print("✅ Installation complete!")

## 🚀 Setup and Model Loading

In [None]:
import torch
import torchaudio as ta
import numpy as np
import random
from pathlib import Path
import IPython.display as ipd
from google.colab import files
import ipywidgets as widgets
from IPython.display import display, HTML, Audio

from chatterbox.tts import ChatterboxTTS
from chatterbox.vc import ChatterboxVC

print("📚 Libraries imported successfully!")

In [None]:
# Automatically detect the best available device
if torch.cuda.is_available():
    device = "cuda"
    print(f"🚀 Using GPU: {torch.cuda.get_device_name(0)}")
elif torch.backends.mps.is_available():
    device = "mps"
    print("🍎 Using Apple Silicon MPS")
else:
    device = "cpu"
    print("💻 Using CPU (this will be slower)")

print(f"Device: {device}")

In [None]:
# Load the Chatterbox TTS model
print("🔄 Loading Chatterbox TTS model... (this may take a few minutes)")
model = ChatterboxTTS.from_pretrained(device=device)
print("✅ Model loaded successfully!")
print(f"Sample rate: {model.sr} Hz")

## 🎯 Basic Text-to-Speech Demo

Let's start with a simple example using the default voice:

In [None]:
# Example text
text = "Hello! Welcome to Chatterbox TTS. I'm excited to demonstrate high-quality text-to-speech synthesis with emotion control."

print(f"🎤 Generating speech for: '{text}'")

# Generate speech with default settings
wav = model.generate(text)

# Save and play the audio
output_path = "basic_demo.wav"
ta.save(output_path, wav, model.sr)

print("✅ Audio generated!")
print("🔊 Click play to listen:")
ipd.display(ipd.Audio(output_path))

## 🎭 Voice Cloning Demo

Upload your own audio file to clone a voice! The audio should be:
- Clear speech (no background noise)
- At least 3-10 seconds long
- WAV, MP3, or other common audio format

In [None]:
# Upload audio file for voice cloning
print("📁 Please upload an audio file for voice cloning:")
uploaded = files.upload()

if uploaded:
    # Get the uploaded file name
    audio_file = list(uploaded.keys())[0]
    print(f"✅ Uploaded: {audio_file}")
    
    # Play the uploaded audio
    print("🔊 Your uploaded audio:")
    ipd.display(ipd.Audio(audio_file))
else:
    print("❌ No file uploaded. Using default voice.")
    audio_file = None

In [None]:
# Text for voice cloning demo
clone_text = "This is a demonstration of voice cloning using Chatterbox TTS. The model can replicate the voice characteristics from the uploaded audio sample."

if audio_file:
    print(f"🎭 Cloning voice from '{audio_file}'...")
    print(f"📝 Text: '{clone_text}'")
    
    # Generate speech with the uploaded voice
    cloned_wav = model.generate(
        clone_text,
        audio_prompt_path=audio_file,
        exaggeration=0.5,
        cfg_weight=0.5
    )
    
    # Save and play the cloned audio
    clone_output_path = "voice_cloned.wav"
    ta.save(clone_output_path, cloned_wav, model.sr)
    
    print("✅ Voice cloning complete!")
    print("🔊 Cloned voice result:")
    ipd.display(ipd.Audio(clone_output_path))
else:
    print("⚠️ No audio file uploaded. Skipping voice cloning demo.")

## 🎛️ Interactive Demo with Advanced Controls

Experiment with different parameters to control the speech generation:

In [None]:
# Create interactive widgets
text_widget = widgets.Textarea(
    value="The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet!",
    description="Text:",
    layout=widgets.Layout(width='100%', height='80px')
)

exaggeration_widget = widgets.FloatSlider(
    value=0.5,
    min=0.25,
    max=2.0,
    step=0.05,
    description="Exaggeration:",
    tooltip="Controls emotion intensity (0.5 = neutral, higher = more expressive)"
)

cfg_weight_widget = widgets.FloatSlider(
    value=0.5,
    min=0.0,
    max=1.0,
    step=0.05,
    description="CFG Weight:",
    tooltip="Controls pacing (lower = slower, more deliberate)"
)

temperature_widget = widgets.FloatSlider(
    value=0.8,
    min=0.1,
    max=2.0,
    step=0.1,
    description="Temperature:",
    tooltip="Controls randomness (lower = more consistent)"
)

generate_button = widgets.Button(
    description="🎤 Generate Speech",
    button_style='primary',
    layout=widgets.Layout(width='200px', height='40px')
)

output_widget = widgets.Output()

# Display widgets
display(text_widget)
display(widgets.HBox([exaggeration_widget, cfg_weight_widget]))
display(temperature_widget)
display(generate_button)
display(output_widget)

In [None]:
# Interactive generation function
def generate_interactive(button):
    with output_widget:
        output_widget.clear_output()
        
        text = text_widget.value.strip()
        if not text:
            print("❌ Please enter some text!")
            return
        
        print(f"🎤 Generating speech...")
        print(f"📝 Text: '{text[:50]}{'...' if len(text) > 50 else ''}'")
        print(f"🎭 Exaggeration: {exaggeration_widget.value}")
        print(f"⚡ CFG Weight: {cfg_weight_widget.value}")
        print(f"🌡️ Temperature: {temperature_widget.value}")
        
        try:
            # Generate speech with custom parameters
            wav = model.generate(
                text,
                audio_prompt_path=audio_file if 'audio_file' in globals() and audio_file else None,
                exaggeration=exaggeration_widget.value,
                cfg_weight=cfg_weight_widget.value,
                temperature=temperature_widget.value
            )
            
            # Save and play the audio
            interactive_output_path = "interactive_demo.wav"
            ta.save(interactive_output_path, wav, model.sr)
            
            print("✅ Generation complete!")
            print("🔊 Click play to listen:")
            ipd.display(ipd.Audio(interactive_output_path))
            
        except Exception as e:
            print(f"❌ Error: {str(e)}")

# Connect button to function
generate_button.on_click(generate_interactive)

print("🎛️ Interactive demo ready! Adjust the parameters above and click 'Generate Speech'")

## 💡 Tips for Best Results

### General Use (TTS and Voice Agents):
- Default settings (`exaggeration=0.5`, `cfg_weight=0.5`) work well for most prompts
- If the reference speaker has a fast speaking style, lower `cfg_weight` to around `0.3`

### Expressive or Dramatic Speech:
- Try lower `cfg_weight` values (e.g. `~0.3`) and increase `exaggeration` to around `0.7` or higher
- Higher `exaggeration` tends to speed up speech; reducing `cfg_weight` helps compensate

### Voice Cloning Tips:
- Use clear, high-quality audio samples (3-10 seconds)
- Avoid background noise or music
- Single speaker recordings work best
- The model works with various accents and languages (though optimized for English)

## 🔄 Voice Conversion Demo

Convert existing speech to a different voice using Chatterbox VC:

In [None]:
# Load the Voice Conversion model
print("🔄 Loading Chatterbox VC model...")
vc_model = ChatterboxVC.from_pretrained(device=device)
print("✅ Voice Conversion model loaded!")

In [None]:
# Voice conversion demo
print("📁 Upload source audio (speech to convert):")
source_uploaded = files.upload()

if source_uploaded:
    source_audio = list(source_uploaded.keys())[0]
    print(f"✅ Source audio: {source_audio}")
    
    print("🔊 Original audio:")
    ipd.display(ipd.Audio(source_audio))
    
    # Use the previously uploaded target voice or upload a new one
    target_voice = audio_file if 'audio_file' in globals() and audio_file else None
    
    if not target_voice:
        print("📁 Upload target voice (voice to convert to):")
        target_uploaded = files.upload()
        if target_uploaded:
            target_voice = list(target_uploaded.keys())[0]
    
    if target_voice:
        print(f"🎭 Converting '{source_audio}' to sound like '{target_voice}'...")
        
        # Perform voice conversion
        converted_wav = vc_model.generate(
            source_audio,
            target_voice_path=target_voice
        )
        
        # Save and play the converted audio
        vc_output_path = "voice_converted.wav"
        ta.save(vc_output_path, converted_wav, vc_model.sr)
        
        print("✅ Voice conversion complete!")
        print("🔊 Converted audio:")
        ipd.display(ipd.Audio(vc_output_path))
    else:
        print("❌ No target voice provided. Skipping voice conversion.")
else:
    print("❌ No source audio uploaded. Skipping voice conversion demo.")

## 📝 Example Texts to Try

Here are some interesting texts to experiment with different voices and emotions:

In [None]:
# Example texts for different scenarios
example_texts = {
    "Neutral Narration": "In a world where artificial intelligence meets human creativity, new possibilities emerge every day. Technology continues to reshape how we communicate and express ourselves.",
    
    "Excited Announcement": "Ladies and gentlemen, welcome to the most incredible show on Earth! Tonight, you'll witness amazing feats that will leave you speechless and wanting more!",
    
    "Dramatic Storytelling": "The storm raged through the night, lightning illuminating the dark castle on the hill. Inside, a mysterious figure waited by the window, watching for signs of dawn.",
    
    "Educational Content": "Today we'll learn about the fascinating world of machine learning. Neural networks process information much like the human brain, using interconnected nodes to recognize patterns.",
    
    "Gaming Commentary": "That was an absolutely incredible play! The team coordination was perfect, and that final move secured their victory in the championship match!",
    
    "Meditation Guide": "Take a deep breath and let your mind settle. Feel the tension leaving your body as you focus on the present moment. Peace and calm surround you."
}

print("📝 Example texts loaded! Copy any of these into the interactive demo above:")
print()

for category, text in example_texts.items():
    print(f"**{category}:**")
    print(f'"{text}"')
    print()

## 💾 Download Generated Audio

Download your generated audio files:

In [None]:
# Download generated audio files
import os

audio_files = [
    "basic_demo.wav",
    "voice_cloned.wav", 
    "interactive_demo.wav",
    "voice_converted.wav"
]

print("💾 Available audio files for download:")
for file in audio_files:
    if os.path.exists(file):
        print(f"✅ {file}")
        files.download(file)
    else:
        print(f"❌ {file} (not generated yet)")

print("\n📁 Files downloaded to your local machine!")

## 🔍 Watermark Detection

All Chatterbox-generated audio includes Perth watermarks for responsible AI:

In [None]:
# Install Perth watermarker if not already installed
!pip install resemble-perth

import perth
import librosa

def check_watermark(audio_path):
    """Check if an audio file contains a Perth watermark"""
    if not os.path.exists(audio_path):
        return f"❌ File {audio_path} not found"
    
    try:
        # Load the audio
        watermarked_audio, sr = librosa.load(audio_path, sr=None)
        
        # Initialize watermarker
        watermarker = perth.PerthImplicitWatermarker()
        
        # Extract watermark
        watermark = watermarker.get_watermark(watermarked_audio, sample_rate=sr)
        
        if watermark > 0.5:
            return f"✅ {audio_path}: Watermarked (confidence: {watermark:.3f})"
        else:
            return f"❌ {audio_path}: No watermark detected (confidence: {watermark:.3f})"
    except Exception as e:
        return f"❌ Error checking {audio_path}: {str(e)}"

print("🔍 Checking watermarks in generated audio files:")
print()

for file in audio_files:
    result = check_watermark(file)
    print(result)

## 🎉 Conclusion

Congratulations! You've successfully used Chatterbox TTS in Google Colab. Here's what you've learned:

✅ **Basic TTS**: Generate speech from text with the default voice  
✅ **Voice Cloning**: Clone any voice from a short audio sample  
✅ **Parameter Control**: Fine-tune exaggeration, pacing, and temperature  
✅ **Voice Conversion**: Convert existing speech to different voices  
✅ **Watermark Detection**: Verify responsible AI watermarking  

### 🚀 Next Steps:
- Try different voices and experiment with parameters
- Use Chatterbox in your own projects via the Python API
- Check out the [official repository](https://github.com/resemble-ai/chatterbox) for more examples
- Join the [Discord community](https://discord.gg/rJq9cRJBJ6) for support and discussions

### 📚 Resources:
- [Chatterbox GitHub](https://github.com/resemble-ai/chatterbox)
- [Resemble AI](https://resemble.ai)
- [Demo Samples](https://resemble-ai.github.io/chatterbox_demopage/)
- [Hugging Face Space](https://huggingface.co/spaces/ResembleAI/Chatterbox)

---

**Made with ♥️ by [Resemble AI](https://resemble.ai)**

*Remember to use this technology responsibly and ethically!*