# üéôÔ∏è Qwen3-TTS: Advanced Text-to-Speech AI

<div align="center">
  
[![Support](https://img.shields.io/badge/Buy%20Me%20a%20Coffee-Support-FFDD00?style=for-the-badge&logo=buymeacoffee&logoColor=black)](https://buymeacoffee.com/lynettethecat)
[![License](https://img.shields.io/badge/License-MIT-blue?style=for-the-badge)](https://opensource.org/licenses/MIT)

**Created by [Lynette](https://www.youtube.com/@LynetteTheCatOfficial)**

</div>

---

## üìå What is This Notebook?

This notebook allows you to run Qwen3-TTS directly in your browser with GPU support ‚Äî no complex setup required.

‚úÖ **Voice Cloning** - Clone any voice with just 3 seconds of audio  
‚úÖ **Custom Voices** - 9 preset character voices in multiple languages  
‚úÖ **Voice Design** - Create unique voices from text descriptions  

---

## üöÄ Features

| Feature | Description | Speed |
|---------|-------------|-------|
| üé§ **Voice Cloning** | Clone any voice with reference audio | RTF 3.5-5x |
| üé≠ **Custom Voice** | 9 preset voices (English, Chinese, Japanese, Korean, German, French, Spanish) | RTF 3.5-5x |
| üé® **Voice Design** | Design voices from text descriptions | RTF 3.5-5x |
| üåç **Multilingual** | Support for 9 languages | - |
| ‚ö° **Optimized** | FP16 + SDPA + TF32 enabled | Max Speed |

---

## üìñ How to Use

1. **Run all cells** (Runtime ‚Üí Run all) or click ‚ñ∂Ô∏è on each cell
2. **Wait for Gradio interface** to launch (~2-3 minutes for first model load)
3. **Click the public link** that appears
4. **Choose a tab** and start generating!

---

## üí° Tips for Best Results

### Voice Cloning
- Use **clear reference audio** (3+ seconds)
- Provide **transcript** for better quality
- Enable **Fast Mode** for quicker results

### Custom Voice
- Try different voices to find your favorite
- Use **style instructions** like "speak slowly" or "cheerful tone"

### Voice Design
- Be **specific** in descriptions: age, gender, emotion, accent
- Example: *"A young female, cheerful, speaking clearly with British accent"*

---

## ‚öôÔ∏è Technical Details

- **Model**: Qwen3-TTS 1.7B (Base, CustomVoice, VoiceDesign)
- **Hardware**: Google Colab T4 GPU
- **Optimizations**: FP16 precision, SDPA attention, TF32
- **Expected Speed**: RTF 5x (100 seconds for 20 seconds of audio)

---

## üì∫ More AI Tutorials

---

## üôè Credits

- **Qwen Team** for the amazing TTS models
- **Hugging Face** for model hosting
- **AIQuest Academy** for this notebook

---

## üìú License

This notebook is free to use and modify. Qwen3-TTS models are licensed under their respective terms.

---

**‚¨áÔ∏è Run the cells below to get started! ‚¨áÔ∏è**


In [None]:
# Install with FlashAttention support
# Install compatible Torch first
!pip uninstall -y torch torchvision torchaudio
!pip install torch==2.4.1+cu121 torchvision==0.19.1+cu121 torchaudio==2.4.1+cu121 --index-url https://download.pytorch.org/whl/cu121
!pip install ninja
!pip install flash-attn --no-build-isolation
!pip install -U qwen-tts gradio huggingface_hub

In [None]:
import gradio as gr
from qwen_tts import Qwen3TTSModel
import torch
import soundfile as sf
import tempfile
import gc
import time

# Global variables
current_model = None
current_model_type = None

# Enable PyTorch optimizations
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.conv.fp32_precision = 'tf32'
torch.backends.cuda.matmul.fp32_precision = 'tf32'

print(f"GPU: {torch.cuda.get_device_name(0)}")

def load_model(model_type):
    """Load model with SDPA optimization"""
    global current_model, current_model_type

    if current_model_type == model_type:
        print(f"‚úÖ Using cached {model_type} model")
        return current_model

    if current_model is not None:
        print(f"Unloading {current_model_type} model...")
        del current_model
        gc.collect()
        torch.cuda.empty_cache()

    print(f"Loading {model_type} model (1.7B)...")
    start = time.time()

    try:
        if model_type == "base":
            model_name = "Qwen/Qwen3-TTS-12Hz-1.7B-Base"
        elif model_type == "custom":
            model_name = "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"
        elif model_type == "design":
            model_name = "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign"

        current_model = Qwen3TTSModel.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="cuda:0",
            attn_implementation="sdpa"
        )

        current_model_type = model_type
        load_time = time.time() - start

        allocated = torch.cuda.memory_allocated(0) / 1024**3
        print(f"‚úÖ Loaded in {load_time:.1f}s | GPU: {allocated:.2f}GB")

        return current_model

    except Exception as e:
        print(f"‚ùå Error: {str(e)}")
        import traceback
        traceback.print_exc()
        return None

def voice_clone(text, reference_audio, ref_transcript, use_fast_mode):
    """Generate speech by cloning a reference voice"""
    if not text or not reference_audio:
        return None

    try:
        total_start = time.time()
        model = load_model("base")
        if model is None:
            return None

        print(f"‚è±Ô∏è Creating prompt...")
        prompt_start = time.time()

        if use_fast_mode or not ref_transcript:
            prompt_items = model.create_voice_clone_prompt(
                ref_audio=reference_audio,
                x_vector_only_mode=True
            )
        else:
            prompt_items = model.create_voice_clone_prompt(
                ref_audio=reference_audio,
                ref_text=ref_transcript,
                x_vector_only_mode=False
            )

        prompt_time = time.time() - prompt_start
        print(f"   Prompt: {prompt_time:.1f}s")

        print(f"‚è±Ô∏è Generating audio...")
        gen_start = time.time()

        with torch.inference_mode():
            wavs, sr = model.generate_voice_clone(
                text=text,
                voice_clone_prompt=prompt_items
            )

        gen_time = time.time() - gen_start

        temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")
        sf.write(temp_file.name, wavs[0], sr)

        total_time = time.time() - total_start
        audio_duration = len(wavs[0]) / sr
        rtf = gen_time / audio_duration

        print(f"‚úÖ Done! Total: {total_time:.1f}s | Gen: {gen_time:.1f}s | Audio: {audio_duration:.1f}s | RTF: {rtf:.2f}x")

        torch.cuda.empty_cache()
        gc.collect()

        return temp_file.name

    except Exception as e:
        print(f"‚ùå Error in voice_clone: {str(e)}")
        import traceback
        traceback.print_exc()
        return None

def custom_voice(text, voice_name, instruction):
    """Generate speech using preset voices"""
    if not text:
        return None

    try:
        total_start = time.time()
        model = load_model("custom")
        if model is None:
            return None

        print(f"‚è±Ô∏è Generating with voice: {voice_name}...")
        if instruction and instruction.strip():
            print(f"   Style instruction: '{instruction}'")

        gen_start = time.time()

        with torch.inference_mode():
            if instruction and instruction.strip():
                wavs, sr = model.generate_custom_voice(
                    text=text,
                    speaker=voice_name,
                    instruct=instruction
                )
            else:
                wavs, sr = model.generate_custom_voice(
                    text=text,
                    speaker=voice_name
                )

        gen_time = time.time() - gen_start

        temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")
        sf.write(temp_file.name, wavs[0], sr)

        total_time = time.time() - total_start
        audio_duration = len(wavs[0]) / sr
        rtf = gen_time / audio_duration

        print(f"‚úÖ Done! Total: {total_time:.1f}s | Gen: {gen_time:.1f}s | Audio: {audio_duration:.1f}s | RTF: {rtf:.2f}x")

        torch.cuda.empty_cache()
        gc.collect()

        return temp_file.name

    except Exception as e:
        print(f"‚ùå Error in custom_voice: {str(e)}")
        import traceback
        traceback.print_exc()
        return None

def voice_design(text, voice_description):
    """Generate speech from text description"""
    if not text or not voice_description:
        return None

    try:
        total_start = time.time()
        model = load_model("design")
        if model is None:
            return None

        print(f"‚è±Ô∏è Generating...")
        gen_start = time.time()

        with torch.inference_mode():
            wavs, sr = model.generate_voice_design(
                text=text,
                instruct=voice_description
            )

        gen_time = time.time() - gen_start

        temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")
        sf.write(temp_file.name, wavs[0], sr)

        total_time = time.time() - total_start
        audio_duration = len(wavs[0]) / sr
        rtf = gen_time / audio_duration

        print(f"‚úÖ Done! Total: {total_time:.1f}s | Gen: {gen_time:.1f}s | Audio: {audio_duration:.1f}s | RTF: {rtf:.2f}x")

        torch.cuda.empty_cache()
        gc.collect()

        return temp_file.name

    except Exception as e:
        print(f"‚ùå Error in voice_design: {str(e)}")
        import traceback
        traceback.print_exc()
        return None

# Custom CSS for clean branding
custom_css = """
/* Creator Badge */
.creator-badge {
    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
    padding: 12px 20px;
    border-radius: 8px;
    text-align: center;
    margin-bottom: 20px;
    box-shadow: 0 4px 6px rgba(0,0,0,0.1);
}

.creator-badge p {
    color: white;
    margin: 0;
    font-size: 1em;
    font-weight: 500;
}

.creator-badge strong {
    font-weight: 700;
    font-size: 1.1em;
}

/* Social Buttons */
.social-buttons {
    display: flex;
    gap: 12px;
    justify-content: center;
    margin: 15px 0 25px 0;
    flex-wrap: wrap;
}

.social-btn {
    display: inline-flex;
    align-items: center;
    gap: 8px;
    padding: 10px 20px;
    border-radius: 8px;
    text-decoration: none;
    font-weight: 600;
    font-size: 14px;
    transition: all 0.2s ease;
    box-shadow: 0 2px 8px rgba(0,0,0,0.15);
}

.social-btn:hover {
    transform: translateY(-2px);
    box-shadow: 0 4px 12px rgba(0,0,0,0.25);
}

.youtube-btn {
    background: #FF0000;
    color: white !important;
}

.twitter-btn {
    background: #000000;
    color: white !important;
}

/* Footer */
.aiquest-footer {
    text-align: center;
    padding: 15px;
    background: linear-gradient(135deg, #f5f7fa 0%, #c3cfe2 100%);
    border-radius: 8px;
    margin-top: 20px;
    font-size: 0.9em;
    color: #555;
}

.aiquest-footer strong {
    color: #667eea;
}

/* Button styling */
.gr-button-primary {
    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%) !important;
    border: none !important;
}
"""

# Gradio Interface
with gr.Blocks(title="Qwen3-TTS - By AIQuest Academy", css=custom_css) as demo:

    # Creator Badge
    gr.HTML("""
        <div class="creator-badge">
            <p>üì∫ Notebook created by <strong>Lynette</strong></p>
        </div>
    """)

    # Main Title
    gr.Markdown("# üéôÔ∏è Qwen3-TTS: Voice Clone, Custom Voice & Voice Design")
    gr.Markdown("### Advanced Text-to-Speech AI | Using 1.7B models with SDPA optimization")

    with gr.Tab("üé§ Voice Cloning"):
        gr.Markdown("### Clone any voice with 3+ seconds of audio")

        with gr.Row():
            with gr.Column():
                clone_text = gr.Textbox(
                    label="Text to Synthesize",
                    placeholder="Enter text (shorter = faster)...",
                    lines=4
                )
                clone_audio = gr.Audio(
                    label="Reference Audio (3+ seconds)",
                    type="filepath"
                )
                clone_transcript = gr.Textbox(
                    label="Transcript (Optional - improves quality)",
                    placeholder="What's said in the audio...",
                    lines=3
                )
                clone_fast_mode = gr.Checkbox(
                    label="Fast Mode (skip transcript)",
                    value=True
                )
                clone_btn = gr.Button("üéµ Generate Speech", variant="primary", size="lg")

            with gr.Column():
                clone_output = gr.Audio(label="Generated Speech")
                gr.Markdown("""
                **Expected Timing (T4 GPU):**
                - First use: ~2-3 Minutes (model load)
                - RTF: 3.5-5x
                - 10s audio ‚âà 35-50s
                - 20s audio ‚âà 70-100s
                """)

        clone_btn.click(
            voice_clone,
            inputs=[clone_text, clone_audio, clone_transcript, clone_fast_mode],
            outputs=clone_output
        )

    with gr.Tab("üé≠ Custom Voice"):
        gr.Markdown("### Use 9 preset character voices with style control")

        with gr.Row():
            with gr.Column():
                custom_text = gr.Textbox(
                    label="Text to Synthesize",
                    placeholder="Enter text...",
                    lines=4
                )
                custom_voice_name = gr.Dropdown(
                    choices=[
                        "serena",     # Female voice
                        "vivian",     # Female voice
                        "ono_anna",   # Female voice (Japanese-style)
                        "sohee",      # Female voice (Korean-style)
                        "aiden",      # Male voice
                        "dylan",      # Male voice
                        "eric",       # Male voice
                        "ryan",       # Male voice
                        "uncle_fu"    # Male voice (Chinese-style)
                    ],
                    label="Voice Character",
                    value="serena"
                )
                custom_instruction = gr.Textbox(
                    label="Style Instruction (Optional)",
                    placeholder="e.g., 'speak slowly and cheerfully'",
                    lines=2
                )

                gr.Markdown("""
                **Voice Guide:**
                - **Female**: serena, vivian, ono_anna, sohee
                - **Male**: aiden, dylan, eric, ryan, uncle_fu

                **Style Instructions Examples:**
                - "speak slowly and clearly"
                - "cheerful and energetic"
                - "whisper softly"
                - "authoritative tone"
                """)

                custom_btn = gr.Button("üéµ Generate Speech", variant="primary", size="lg")

            with gr.Column():
                custom_output = gr.Audio(label="Generated Speech")
                gr.Markdown("**Timing**: RTF 3.5-5x")

        custom_btn.click(
            custom_voice,
            inputs=[custom_text, custom_voice_name, custom_instruction],
            outputs=custom_output
        )

    with gr.Tab("üé® Voice Design"):
        gr.Markdown("### Design a unique voice from text description")

        with gr.Row():
            with gr.Column():
                design_text = gr.Textbox(
                    label="Text to Synthesize",
                    placeholder="Enter text...",
                    lines=4
                )
                design_description = gr.Textbox(
                    label="Voice Description",
                    placeholder="A young female, cheerful, speaking clearly",
                    lines=4
                )

                gr.Markdown("""
                **Description Tips:**
                - Age: young / middle-aged / elderly
                - Gender: male / female
                - Emotion: cheerful / serious / calm / excited
                - Style: clear / soft / authoritative / energetic

                **Examples:**
                - "A middle-aged male, deep and authoritative, speaking slowly"
                - "A young female, cheerful and bubbly, speaking energetically"
                - "An elderly man, warm and gentle, speaking softly"
                """)

                design_btn = gr.Button("üéµ Generate Speech", variant="primary", size="lg")

            with gr.Column():
                design_output = gr.Audio(label="Generated Speech")
                gr.Markdown("**Timing**: RTF 3.5-5x")

        design_btn.click(
            voice_design,
            inputs=[design_text, design_description],
            outputs=design_output
        )

    # Performance Info
    gr.Markdown("---")
    with gr.Accordion("‚ö° Performance & Technical Info", open=False):
        gr.Markdown("""
        ### Performance Metrics
        - **RTF 5x (100s for 20s audio) is normal for T4 GPU**
        - HuggingFace Spaces use A100 GPUs (5-10x faster hardware)
        - Optimizations active: FP16, SDPA, TF32

        ### Speed Tips
        - Use shorter text for faster results
        - Enable Fast Mode in Voice Cloning
        - First generation includes model loading time

        ### Features
        - ‚úÖ Voice Cloning with optional transcript
        - ‚úÖ 9 Custom preset voices with style instructions
        - ‚úÖ Voice Design from text descriptions
        - ‚úÖ Multilingual support (9 languages)

        ### Models Used
        - Qwen3-TTS-12Hz-1.7B-Base (Voice Cloning)
        - Qwen3-TTS-12Hz-1.7B-CustomVoice (Preset Voices)
        - Qwen3-TTS-12Hz-1.7B-VoiceDesign (AI Voice Creation)
        """)

print("="*70)
print("üéôÔ∏è Qwen3-TTS Notebook")
print("üì∫ Created by: Lynette")
print("="*70)
print("\nStarting Qwen3-TTS with optimizations...")
demo.launch(share=True, debug=True, theme=gr.themes.Soft())