# üéôÔ∏è MOSS-TTS 1.7B  (Zero-Shot Voice Cloning)
### Free Colab Notebook (T4 GPU) | Made with ‚ù§Ô∏è by **AIQUEST**

---

<div align="center">

  <img src="https://img.shields.io/badge/AIQUESTAcademy-blueviolet?style=for-the-badge&logo=youtube&logoColor=white" />
  <img src="https://img.shields.io/badge/Colab-Free%20Tier-orange?style=for-the-badge&logo=googlecolab&logoColor=white" />
  <img src="https://img.shields.io/badge/Model-1.7B%20Params-green?style=for-the-badge" />

  <br><br>

  <a href="https://www.youtube.com/@aiquestacademy?sub_confirmation=1">
    <img src="https://img.shields.io/badge/‚ñ∂%20Subscribe%20on%20YouTube-FF0000?style=for-the-badge&logo=youtube&logoColor=white" />
  </a>
  &nbsp;
  <a href="https://x.com/aiquestacademy">
    <img src="https://img.shields.io/badge/Follow%20on%20ùïè-000000?style=for-the-badge&logo=x&logoColor=white" />
  </a>

</div>

---

### üìñ What is this Notebook?

This notebook lets you run **MOSS-TTS 1.7B** ‚Äî a state-of-the-art **zero-shot text-to-speech** and **voice cloning** model ‚Äî entirely **free** on Google Colab's T4 GPU.

**Key Features:**
- üó£Ô∏è **Zero-Shot Voice Cloning** ‚Äî Clone any voice from a short audio reference
- üéµ **High-Quality TTS** ‚Äî 1.7 billion parameter model for natural speech
- ‚ö° **T4 Optimized** ‚Äî Runs on Colab free tier (no Pro needed!)
- üéõÔ∏è **Quality Presets** ‚Äî From fast drafts (8 RVQ) to maximum quality (32 RVQ)
- üéöÔ∏è **Full Control** ‚Äî Temperature, top-p, top-k, repetition penalty, speed
- üåê **Gradio UI** ‚Äî Beautiful web interface with shareable public link

### üöÄ How to Use
1. Make sure **Runtime ‚Üí Change runtime type ‚Üí T4 GPU** is selected
2. Run all cells in order (Runtime ‚Üí Run all)
3. Wait for the Gradio link to appear (~5-8 min on first run)
4. Open the public link and start generating speech!

### üìå Credits
- **Model:** [MOSS-TTS](https://github.com/OpenMOSS/MOSS-TTS) by OpenMOSS Team
- **Notebook:** Optimized & packaged by **AIQUEST** ([@aiquestacademy](https://youtube.com/@aiquestacademy))

---


### üîç Step 1: Check GPU
Let's verify that a **T4 GPU** is available. If you see "Tesla T4", you're good to go!
If not, go to **Runtime ‚Üí Change runtime type ‚Üí T4 GPU**.


In [None]:
!nvidia-smi


### üì¶ Step 2: Install Dependencies
Installing PyTorch (CUDA 11.8), Transformers, Gradio, and other required packages.
This takes ~2-3 minutes.


In [None]:
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install -q transformers accelerate librosa soundfile gradio
!pip install -q einops omegaconf pyyaml scipy datasets sentencepiece protobuf
print("‚úÖ  Done")

### üì• Step 3: Clone MOSS-TTS Repository
Cloning the official MOSS-TTS repo from GitHub.


In [None]:
!git clone https://github.com/OpenMOSS/MOSS-TTS.git
%cd MOSS-TTS
print("‚úÖ MOSS-TTS repository cloned")


### ‚öôÔ∏è Step 4: Load Model & Launch Gradio Interface
This cell loads the **MOSS-TTS 1.7B** model and launches a **Gradio** web UI.

**First run downloads ~13GB** of model weights ‚Äî this takes ~5-8 minutes.
After that, you'll get a **public Gradio link** you can share with anyone!

> üí° **Tip:** Use the **Fast (8 RVQ)** preset for the longest audio on free tier.


In [None]:
import torch
import torchaudio
from transformers import AutoModel, AutoProcessor, GenerationConfig
import gradio as gr
import os
from datetime import datetime
import importlib.util
import traceback
import gc
import time
import atexit
import warnings

# Suppress warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)

# Memory optimization settings for T4
torch.backends.cuda.enable_cudnn_sdp(True)
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(True)
torch.backends.cuda.enable_math_sdp(True)

class DelayGenerationConfig(GenerationConfig):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.layers = kwargs.get("layers", [{} for _ in range(32)])
        self.do_samples = kwargs.get("do_samples", None)
        self.n_vq_for_inference = 32

# Global variables
model = None
processor = None
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32

def cleanup_model():
    """Unload model from GPU memory"""
    global model, processor
    if model is not None:
        print("üßπ Cleaning up model from GPU...")
        del model
        model = None
    if processor is not None:
        if hasattr(processor, 'audio_tokenizer'):
            del processor.audio_tokenizer
        del processor
        processor = None
    if device == "cuda":
        torch.cuda.empty_cache()
        gc.collect()
        print("‚úÖ GPU memory cleared!")

atexit.register(cleanup_model)

def resolve_attn_implementation() -> str:
    if device == "cuda":
        return "sdpa"
    return "eager"

def load_model():
    """Load model with optimized settings"""
    global model, processor

    if model is None:
        print("üîÑ Loading MOSS-TTS (this takes ~5 min on first run)...")

        attn_implementation = resolve_attn_implementation()
        print(f"Using attention: {attn_implementation}")

        processor = AutoProcessor.from_pretrained(
            "OpenMOSS-Team/MOSS-TTS-Local-Transformer",
            trust_remote_code=True,
        )

        processor.audio_tokenizer = processor.audio_tokenizer.to(device)

        if device == "cuda":
            torch.cuda.empty_cache()
            gc.collect()

        model = AutoModel.from_pretrained(
            "OpenMOSS-Team/MOSS-TTS-Local-Transformer",
            trust_remote_code=True,
            attn_implementation=attn_implementation,
            torch_dtype=dtype,
            low_cpu_mem_usage=True,
        ).to(device)

        model.eval()

        if device == "cuda":
            torch.cuda.empty_cache()
            gc.collect()
            vram = torch.cuda.memory_allocated() / 1024**3
            print(f"‚úÖ Model loaded! VRAM: {vram:.2f}GB")

    return model, processor

PRESETS = {
    "Fast (8 RVQ)": {
        "n_vq": 8,
        "text_temp": 1.5,
        "audio_temp": 0.95,
        "text_top_p": 1.0,
        "audio_top_p": 0.95,
        "text_top_k": 50,
        "audio_top_k": 50,
        "audio_rep_pen": 1.1
    },
    "Balanced (16 RVQ)": {
        "n_vq": 16,
        "text_temp": 1.5,
        "audio_temp": 0.95,
        "text_top_p": 1.0,
        "audio_top_p": 0.95,
        "text_top_k": 50,
        "audio_top_k": 50,
        "audio_rep_pen": 1.1
    },
    "High Quality (24 RVQ)": {
        "n_vq": 24,
        "text_temp": 1.5,
        "audio_temp": 0.95,
        "text_top_p": 1.0,
        "audio_top_p": 0.95,
        "text_top_k": 50,
        "audio_top_k": 50,
        "audio_rep_pen": 1.1
    },
    "Maximum (32 RVQ)": {
        "n_vq": 32,
        "text_temp": 1.5,
        "audio_temp": 0.95,
        "text_top_p": 1.0,
        "audio_top_p": 0.95,
        "text_top_k": 50,
        "audio_top_k": 50,
        "audio_rep_pen": 1.1
    }
}

def apply_preset(preset_name):
    """Return preset values"""
    preset = PRESETS[preset_name]
    return (
        preset["n_vq"],
        preset["text_temp"],
        preset["text_top_p"],
        preset["text_top_k"],
        preset["audio_temp"],
        preset["audio_top_p"],
        preset["audio_top_k"],
        preset["audio_rep_pen"]
    )

def generate_speech(
    text,
    reference_audio,
    max_new_tokens,
    speed,
    text_temp,
    text_top_p,
    text_top_k,
    audio_temp,
    audio_top_p,
    audio_top_k,
    audio_repetition_penalty,
    n_vq,
    progress=gr.Progress()
):
    """Generate TTS with memory-efficient long-form generation"""

    if not text or len(text.strip()) == 0:
        return None, "‚ö†Ô∏è Please enter text!"

    try:
        os.makedirs("outputs", exist_ok=True)

        progress(0, desc="Loading model...")
        model, processor = load_model()

        text_length = len(text)
        estimated_duration = max_new_tokens / 12.5

        status = f"üìù Text: {text_length:,} chars\n"
        status += f"üéØ Target: {max_new_tokens} tokens (~{estimated_duration/60:.1f} min)\n\n"

        yield None, status

        # Build conversation
        progress(0.1, desc="Processing...")
        if reference_audio is not None:
            status += f"üéôÔ∏è Voice cloning: {os.path.basename(reference_audio)}\n"
            conversations = [[
                processor.build_user_message(text=text, reference=[reference_audio])
            ]]
        else:
            status += "üéôÔ∏è Default voice\n"
            conversations = [[
                processor.build_user_message(text=text)
            ]]

        yield None, status

        # Process input
        batch = processor(conversations, mode="generation")
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        # Fix temperature bug
        if text_temp == 1.0:
            text_temp = 1.001
        if audio_temp == 1.0:
            audio_temp = 1.001

        # Generation config
        generation_config = DelayGenerationConfig()
        generation_config.pad_token_id = processor.tokenizer.pad_token_id
        generation_config.eos_token_id = 151653
        generation_config.max_new_tokens = max_new_tokens
        generation_config.use_cache = True
        generation_config.do_sample = True
        generation_config.num_beams = 1

        generation_config.n_vq_for_inference = n_vq
        generation_config.do_samples = [True] * (n_vq + 1)
        generation_config.layers = [
            {
                "repetition_penalty": 1.0,
                "temperature": text_temp,
                "top_p": text_top_p,
                "top_k": text_top_k
            }
        ] + [
            {
                "repetition_penalty": audio_repetition_penalty,
                "temperature": audio_temp,
                "top_p": audio_top_p,
                "top_k": audio_top_k
            }
        ] * n_vq

        # Clear cache
        progress(0.2, desc="Clearing cache...")
        if device == "cuda":
            torch.cuda.empty_cache()
            gc.collect()

        status += f"\nüéµ Generating...\n"
        yield None, status

        # Generate
        start_time = time.time()
        progress(0.3, desc="Generating...")

        with torch.no_grad():
            outputs = model.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                generation_config=generation_config
            )

        gen_time = time.time() - start_time

        progress(0.85, desc="Decoding...")
        status += f"‚úÖ Generated in {gen_time:.1f}s\n"
        status += "üîä Decoding...\n"
        yield None, status

        # Decode
        decoded_messages = processor.decode(outputs)
        audio = decoded_messages[0].audio_codes_list[0]

        # Clear memory
        if device == "cuda":
            del outputs, input_ids, attention_mask, batch, decoded_messages
            torch.cuda.empty_cache()
            torch.cuda.synchronize()
            gc.collect()

        # Speed
        progress(0.94, desc="Speed adjust...")
        if speed != 1.0:
            sample_rate = processor.model_config.sampling_rate
            new_sample_rate = int(sample_rate * speed)
            resampler = torchaudio.transforms.Resample(
                orig_freq=sample_rate,
                new_freq=new_sample_rate
            )
            audio_resampled = resampler(audio.unsqueeze(0)).squeeze(0)
            resampler_back = torchaudio.transforms.Resample(
                orig_freq=new_sample_rate,
                new_freq=sample_rate
            )
            audio = resampler_back(audio_resampled.unsqueeze(0)).squeeze(0)

        progress(0.97, desc="Saving...")

        # Save
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        output_path = f"outputs/moss_tts_{timestamp}.wav"
        torchaudio.save(
            output_path,
            audio.unsqueeze(0),
            processor.model_config.sampling_rate
        )

        duration = len(audio) / processor.model_config.sampling_rate
        vram = torch.cuda.memory_allocated() / 1024**3 if device == "cuda" else 0
        rtf = gen_time / duration if duration > 0 else 0

        progress(1.0, desc="Done!")

        status += f"\nüéâ SUCCESS!\n"
        status += f"üìè Audio: {duration:.1f}s ({duration/60:.2f} min)\n"
        status += f"‚è±Ô∏è Generation: {gen_time:.1f}s ({gen_time/60:.1f} min)\n"
        status += f"üöÄ RTF: {rtf:.2f}x\n"
        status += f"üéöÔ∏è Speed: {speed}x\n"
        status += f"üìä VRAM: {vram:.2f}GB\n"
        status += f"üéõÔ∏è RVQ: {n_vq}/32\n"
        status += f"üíæ {output_path}"

        yield output_path, status

    except torch.cuda.OutOfMemoryError as e:
        error_msg = f"‚ùå OUT OF MEMORY!\n\n"
        error_msg += f"Tried: {max_new_tokens} tokens with {n_vq} RVQ\n\n"
        error_msg += f"Solutions:\n"
        error_msg += f"1. Reduce Max Tokens\n"
        error_msg += f"2. Use Fast (8 RVQ) preset\n"
        error_msg += f"3. Click 'Clear GPU' and retry\n\n"
        error_msg += f"T4 Limits:\n"
        error_msg += f"‚Ä¢ 8 RVQ: ~7200 tokens (12 min)\n"
        error_msg += f"‚Ä¢ 16 RVQ: ~4800 tokens (8 min)\n"
        error_msg += f"‚Ä¢ 24 RVQ: ~3000 tokens (5 min)\n"
        error_msg += f"‚Ä¢ 32 RVQ: ~2400 tokens (4 min)"
        yield None, error_msg
    except Exception as e:
        error_msg = f"‚ùå Error: {str(e)}\n\n{traceback.format_exc()}"
        yield None, error_msg

# =============================================
# GRADIO INTERFACE WITH AIQUEST BRANDING
# =============================================

custom_css = """
.aiquest-header {
    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
    padding: 20px;
    border-radius: 12px;
    margin-bottom: 16px;
    text-align: center;
    color: white;
}
.aiquest-header h1 {
    margin: 0 0 8px 0;
    font-size: 1.8em;
    color: white !important;
}
.aiquest-header p {
    margin: 4px 0;
    opacity: 0.95;
    color: white !important;
}
.social-buttons {
    display: flex;
    justify-content: center;
    gap: 12px;
    margin-top: 12px;
    flex-wrap: wrap;
}
.social-buttons a {
    display: inline-flex;
    align-items: center;
    gap: 6px;
    padding: 8px 18px;
    border-radius: 8px;
    text-decoration: none;
    font-weight: 600;
    font-size: 0.95em;
    transition: transform 0.2s, box-shadow 0.2s;
}
.social-buttons a:hover {
    transform: translateY(-2px);
    box-shadow: 0 4px 12px rgba(0,0,0,0.3);
}
.yt-btn {
    background: #FF0000;
    color: white !important;
}
.x-btn {
    background: #000000;
    color: white !important;
}
.aiquest-footer {
    text-align: center;
    padding: 14px;
    margin-top: 16px;
    border-radius: 10px;
    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
    color: white;
    font-size: 0.9em;
}
.aiquest-footer a {
    color: #ffd700 !important;
    text-decoration: none;
    font-weight: 600;
}
"""

with gr.Blocks(
    title="MOSS-TTS by AIQUEST",
    theme=gr.themes.Soft(),
    css=custom_css
) as demo:

    # ---- HEADER WITH BRANDING ----
    gr.HTML("""
    <div class="aiquest-header">
        <h1>üéôÔ∏è MOSS-TTS 1.7B Zero-Shot Voice Cloning</h1>
        <p>1.7B parameter TTS & voice cloning ‚Ä¢ Optimized for Colab Free Tier (T4 GPU)</p>
        <p style="font-size:0.85em; opacity:0.8;">Model by OpenMOSS ‚Ä¢ Notebook by <b>AIQUEST</b></p>
        <div class="social-buttons">
            <a href="https://www.youtube.com/@aiquestacademy?sub_confirmation=1" target="_blank" class="yt-btn">
                ‚ñ∂ Subscribe on YouTube
            </a>
            <a href="https://x.com/aiquestacademy" target="_blank" class="x-btn">
                ùïè Follow on X
            </a>
        </div>
    </div>
    """)

    with gr.Row():
        with gr.Column(scale=1):
            text_input = gr.Textbox(
                label="üìù Text",
                placeholder="Paste your script here...",
                lines=10,
                value="Hello! This is MOSS text-to-speech, running on Google Colab free tier. Notebook by AIQUEST."
            )

            reference_audio = gr.Audio(
                label="üé§ Reference Voice (Optional ‚Äî upload for voice cloning)",
                type="filepath",
                sources=["upload"]
            )

            preset_dropdown = gr.Dropdown(
                choices=list(PRESETS.keys()),
                value="Balanced (16 RVQ)",
                label="Preset"
            )

            with gr.Row():
                max_tokens = gr.Slider(
                    50, 5000, 2500, step=100,
                    label="Max Tokens"
                )
                speed = gr.Slider(
                    0.5, 2.0, 1.0, step=0.1,
                    label="Speed"
                )

            with gr.Accordion("‚öôÔ∏è Advanced Settings", open=False):
                n_vq = gr.Slider(8, 32, 8, step=1, label="RVQ Layers")
                with gr.Row():
                    text_temp = gr.Slider(0.1, 2.0, 1.5, step=0.1, label="Text Temp")
                    text_top_p = gr.Slider(0.1, 1.0, 1.0, step=0.05, label="Text Top-P")
                    text_top_k = gr.Slider(1, 100, 50, step=1, label="Text Top-K")
                with gr.Row():
                    audio_temp = gr.Slider(0.1, 2.0, 0.95, step=0.05, label="Audio Temp")
                    audio_top_p = gr.Slider(0.1, 1.0, 0.95, step=0.05, label="Audio Top-P")
                with gr.Row():
                    audio_top_k = gr.Slider(1, 100, 50, step=1, label="Audio Top-K")
                    audio_rep_pen = gr.Slider(1.0, 1.5, 1.1, step=0.05, label="Rep Penalty")

            with gr.Row():
                generate_btn = gr.Button("üéµ Generate Speech", variant="primary", size="lg", scale=3)
                clear_btn = gr.Button("üßπ Clear GPU", variant="secondary", size="lg", scale=1)

        with gr.Column(scale=1):
            audio_output = gr.Audio(label="üîä Generated Audio", type="filepath")
            status_output = gr.Textbox(label="üìä Status", lines=16, interactive=False)

    # ---- FOOTER WITH BRANDING ----
    gr.HTML("""
    <div class="aiquest-footer">
        Made with ‚ù§Ô∏è by <b>AIQUEST</b> &nbsp;|&nbsp;
        <a href="https://www.youtube.com/@aiquestacademy?sub_confirmation=1" target="_blank">‚ñ∂ YouTube</a> &nbsp;|&nbsp;
        <a href="https://x.com/aiquestacademy" target="_blank">ùïè / Twitter</a>
        <br><span style="opacity:0.7;">If you found this useful, please subscribe & share! üôè</span>
    </div>
    """)

    preset_dropdown.change(
        fn=apply_preset,
        inputs=[preset_dropdown],
        outputs=[n_vq, text_temp, text_top_p, text_top_k,
                audio_temp, audio_top_p, audio_top_k, audio_rep_pen]
    )

    generate_btn.click(
        fn=generate_speech,
        inputs=[text_input, reference_audio, max_tokens, speed,
                text_temp, text_top_p, text_top_k,
                audio_temp, audio_top_p, audio_top_k,
                audio_rep_pen, n_vq],
        outputs=[audio_output, status_output]
    )

    def clear_memory():
        cleanup_model()
        return "‚úÖ GPU cleared! Ready for next generation."

    clear_btn.click(fn=clear_memory, inputs=[], outputs=[status_output])

print("‚úÖ MOSS-TTS ready! Launching Gradio...")
demo.launch(share=True, debug=True)


---

<div align="center">

### üéâ Enjoyed this notebook?

If this was helpful, please **‚≠ê star the repo** and **subscribe** for more free AI tools & tutorials!

  <a href="https://www.youtube.com/@aiquestacademy?sub_confirmation=1">
    <img src="https://img.shields.io/badge/‚ñ∂%20Subscribe%20on%20YouTube-FF0000?style=for-the-badge&logo=youtube&logoColor=white" />
  </a>
  &nbsp;
  <a href="https://x.com/aiquestacademy">
    <img src="https://img.shields.io/badge/Follow%20on%20ùïè-000000?style=for-the-badge&logo=x&logoColor=white" />
  </a>

**Made with ‚ù§Ô∏è by AIQUEST** | [@aiquestacademy](https://youtube.com/@aiquestacademy)

</div>
