<a href="https://colab.research.google.com/github/supertone-inc/supertonic-py/blob/main/notebook/supertonic_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Supertonic ‚Äî Lightning Fast, On-Device TTS

This demo introduces the basic usage of the official Python package for Supertonic.

## Key Features

- **‚ö° Blazingly Fast** ‚Äî Generates speech up to **167√ó faster than real-time** on consumer hardware (M4 Pro)
- **ü™∂ Ultra Lightweight** ‚Äî Only **66M parameters**, optimized for efficient on-device performance
- **üì± On-Device Capable** ‚Äî **Complete privacy** and **zero latency**
- **üé® Natural Text Handling** ‚Äî Seamlessly processes complex expressions without G2P module
- **‚öôÔ∏è Highly Configurable** ‚Äî Adjust inference steps, batch processing, and other parameters
- **üß© Flexible Deployment** ‚Äî Deploy across servers, browsers, and edge devices


## Step 1: Install Python Package

Supertonic has minimal dependencies ‚Äî just 4 core libraries:
- **onnxruntime** ‚Äî Fast ONNX model inference
- **numpy** ‚Äî Numerical operations
- **soundfile** ‚Äî Audio file I/O
- **huggingface-hub** ‚Äî Model downloads

In [None]:
# Install Supertonic
%pip install -q supertonic

## Step 2: Load Supertonic

### The model will auto-download when you create a TTS instance

- The model will be automatically downloaded from Hugging Face the first time you run this cell
- Please ensure you have at least 260MB of storage available for the model files
- [View models on Hugging Face ü§ó](https://huggingface.co/Supertone/supertonic)

In [None]:
from IPython.utils import io
from IPython.display import Audio, display
import time

from supertonic import TTS

# Suppress download progress output for cleaner display
with io.capture_output() as captured:
    tts = TTS(auto_download=True)

print("‚úÖ Download complete!")

## Step 3: Basic Usage

You can use `get_voice_style` to get a voice style by name and `synthesize` to generate speech.

### NOTE: All examples are running on Colab CPU (not on GPU)

In [None]:
# Generate speech with a simple text
text = "The train delay was announced at 4:45 PM on Wed, Apr 3, 2024 due to track maintenance."
style = tts.get_voice_style(voice_name="M1")
wav, duration = tts.synthesize(text, voice_style=style, total_steps=5)

display(Audio(wav.squeeze(), rate=tts.sample_rate))

## Step 4: Try Different Voice Styles

Supertonic provides 10 voice styles: **M1, M2, ..., M5** (male) and **F1, F2, ..., F5** (female).


In [None]:
text = "Each voice style brings unique tonal qualities and expressiveness to your content."

voice_list = tts.voice_style_names

# Show first 4 styles for brevity
for voice_name in voice_list[:4]:
    style = tts.get_voice_style(voice_name)
    wav, duration = tts.synthesize(text, voice_style=style, total_steps=5)
    print(f"\n Voice: {voice_name}")
    display(Audio(wav.squeeze(), rate=tts.sample_rate))

## Step 5: Natural Text Handling

Supertonic is designed to handle complex, real-world text inputs that contain numbers, currency symbols, abbreviations, dates, and proper nouns.

| Category | Key Challenges | Supertonic | ElevenLabs | OpenAI | Gemini | Microsoft |
|:--------:|:--------------:|:----------:|:----------:|:------:|:------:|:---------:|
| Financial Expression | Decimal currency, abbreviated magnitudes (M, K), currency symbols, currency codes | ‚úÖ | ‚ùå | ‚ùå | ‚ùå | ‚ùå |
| Time and Date | Time notation, abbreviated weekdays/months, date formats | ‚úÖ | ‚ùå | ‚ùå | ‚ùå | ‚ùå |
| Phone Number | Area codes, hyphens, extensions (ext.) | ‚úÖ | ‚ùå | ‚ùå | ‚ùå | ‚ùå |
| Technical Unit | Decimal numbers with units, abbreviated technical notations | ‚úÖ | ‚ùå | ‚ùå | ‚ùå | ‚ùå |

For more details and to listen to audio samples, visit [our demo page](https://huggingface.co/spaces/Supertone/supertonic#text-handling).

In [None]:
style = tts.get_voice_style("M1")

# Complex expressions that Supertonic handles naturally
test_cases = [
    "The startup secured $5.2M in venture capital.",
    "The train delay was announced at 4:45 PM on Wed, Apr 3, 2024.",
    "You can reach us at (212) 555-0142 ext. 402.",
    "Our battery lasts 2.3h when flying at 30kph."
]

for text in test_cases:
    wav, duration = tts.synthesize(text, voice_style=style, total_steps=5)
    print(f"\n Text: {text}")
    display(Audio(wav.squeeze(), rate=tts.sample_rate))

## Step 6: Long Text Auto-Chunking

For longer texts, Supertonic automatically splits them into manageable chunks:

- Respects sentence boundaries and common abbreviations (Mr., Mrs., Dr., etc.)
- Adds silence between chunks for natural flow
- Adjustable chunk size and silence duration

In [None]:
style = tts.get_voice_style("F1")

# Long text that will be automatically chunked
long_text = """
This is a very long text that will be automatically chunked into smaller parts.
The chunking algorithm splits text by paragraphs and sentences intelligently.
It respects sentence boundaries and common abbreviations like Mr., Mrs., Dr., etc.

Each chunk will be processed separately and then combined with silence in between.
This makes it possible to generate speech for arbitrarily long texts without running into memory issues.
The default chunk size is 300 characters, but you can adjust it based on your needs.

Here's another paragraph to make the text even longer. This demonstrates how the chunking
works across multiple paragraphs. The algorithm preserves the natural flow of speech by
adding appropriate silence between chunks.
"""

# Auto-chunking with default settings
wav, duration = tts.synthesize(
    long_text,
    voice_style=style,
    total_steps=5,
    max_chunk_length=300,  # 300 chars per chunk
    silence_duration=0.3,  # 0.3s silence between chunks
)

display(Audio(wav.squeeze(), rate=tts.sample_rate))

## Step 7: Speed Control

You can adjust speech speed ‚Äî for best results, we recommend values between **0.7√ó** (slow) and **2.0√ó** (ultra fast).

Let's test with **1.5√ó** speed on a longer narrative:


In [None]:
style = tts.get_voice_style("F1")

# A longer narrative text that works well with 1.5x speed
speed_text = """
In the heart of Silicon Valley, a small startup was working on revolutionary technology. 
Their team had spent countless hours developing an AI system that could understand human 
emotions through voice patterns. The breakthrough came on a rainy Tuesday morning when 
Sarah, the lead engineer, discovered a novel approach to processing audio signals. 
This innovation would eventually transform how millions of people interact with technology, 
making voice interfaces more natural and responsive than ever before imagined.
"""

# Generate speech at 1.5x speed
wav, duration = tts.synthesize(
    speed_text,
    voice_style=style,
    total_steps=5,
    speed=1.5
)
print(f"‚ö° Speed: 1.5√ó")
display(Audio(wav.squeeze(), rate=tts.sample_rate))

## Related Projects

- **üè† Main Repository**: [github.com/supertone-inc/supertonic](https://github.com/supertone-inc/supertonic)
- **üéß Demo (WebGPU)**: [Hugging Face Spaces](https://huggingface.co/spaces/Supertone/supertonic#interactive-demo)
- **ü§ó Model Repository**: [Hugging Face Models](https://huggingface.co/Supertone/supertonic)