<a href="https://colab.research.google.com/github/shubhammore1310/Wt-Cp/blob/master/kani_tts_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 😻 Kani TTS - Fast and Expressive Speech Generation Model

### Welcome to Kani TTS, a breakthrough in neural text-to-speech that delivers human-like speech generation with incredible speed and quality!

<img src="https://www.nineninesix.ai/kitty.png" width="300">

In [None]:
!pip install torch
!pip install librosa
!pip install --upgrade numpy==1.24.3
!pip install "nemo_toolkit[tts]"

In [None]:
!pip install -U "git+https://github.com/huggingface/transformers.git"

In [None]:
#@title Importing modules

import librosa
import torch
from nemo.collections.tts.models import AudioCodecModel
from IPython.display import Audio as aplay
from dataclasses import dataclass
from transformers import AutoTokenizer, AutoModelForCausalLM
import os
import numpy as np

from nemo.utils.nemo_logging import Logger

nemo_logger = Logger()
nemo_logger.remove_stream_handlers()

In [None]:
#@title config class
@dataclass
class Config:
    model_name:str = 'nineninesix/kani-tts-370m'
    device_map:str = "auto"
    tokeniser_length:int = 64400
    start_of_text:int = 1
    end_of_text:int = 2
    max_new_tokens:int = 1200
    temperature:float = 1.4
    top_p:float = .95
    repetition_penalty:float = 1.1

## README

If you’re using our model `nineninesix/kani-tts-370m`, please review the supported speakers:

* `david` — David, English (British)
* `puck` — Puck, English (Gemini)
* `kore` — Kore, English (Gemini)
* `andrew` — Andrew, English
* `jenny` — Jenny, English (Irish)
* `simon` — Simon, English
* `katie` — Katie, English
* `seulgi` — Seulgi, Korean
* `bert` — Bert, German
* `thorsten` — Thorsten, German (Hessisch)
* `maria` — Maria, Spanish
* `mei` — Mei, Chinese (Cantonese)
* `ming` — Ming, Chinese (Shanghai OpenAI)
* `karim` — Karim, Arabic
* `nur` — Nur, Arabic



In [None]:
#@title player

class NemoAudioPlayer:

    """
    Audio processing and playback handler for Kani TTS model using NVIDIA NeMo Codec.

    This class manages the conversion between model-generated token sequences and playable audio
    waveforms using NVIDIA's NeMo Nano Codec. It handles specialized token vocabularies that
    include both text tokens and audio codec tokens, enabling seamless audio generation from
    language model outputs.

    The class implements a multi-modal tokenization scheme where:
    - Standard text tokens: 0 to tokeniser_length-1
    - Special control tokens: tokeniser_length+1 to tokeniser_length+10
    - Audio codec tokens: tokeniser_length+10 onwards (4032 codes per codebook, 4 codebooks)

    Attributes:
        conf (Config): Configuration object containing model parameters
        nemo_codec_model: Pre-trained NVIDIA NeMo Nano Codec model for audio encoding/decoding
        device (str): Computing device ('cuda' or 'cpu')
        tokenizer: Optional HuggingFace tokenizer for text processing
            NOTE: This tokenizer is a development artifact and not required for production use.
            It was used during development for debugging and validating model outputs, but is
            no longer necessary for the core functionality of audio generation.

        Token IDs:
            start_of_text (int): Marks beginning of text sequence
            end_of_text (int): Marks end of text sequence
            start_of_speech (int): Marks beginning of audio token sequence
            end_of_speech (int): Marks end of audio token sequence
            start_of_human/end_of_human (int): Human speaker markers
            start_of_ai/end_of_ai (int): AI speaker markers
            pad_token (int): Padding token for sequence alignment
            audio_tokens_start (int): Starting index for audio codec tokens
            codebook_size (int): Number of codes per audio codebook (4032)

    Methods:
        output_validation: Validates presence of required speech control tokens
        get_nano_codes: Extracts and processes audio codec tokens from model output
        get_text: Decodes text portion from tokenized sequence
        get_waveform: Converts model output tokens to playable audio waveform

    Example:
        >>> config = Config()
        >>> player = NemoAudioPlayer(config)  # tokenizer not needed for production
        >>> audio, text = player.get_waveform(model_output_tokens)
        >>> # audio is numpy array ready for playback at 22kHz

        # Legacy usage with tokenizer (for debugging only):
        >>> player = NemoAudioPlayer(config, text_tokenizer_name="microsoft/DialoGPT-medium")

    Note:
        The audio codec uses 4 codebooks with 4032 codes each, operating at 22kHz sample rate
        with 0.6kbps bitrate and 12.5fps frame rate. Audio tokens must come in groups of 4
        (one per codebook) and are offset-encoded in the token sequence.
    """

    def __init__(self, config, text_tokenizer_name: str = None) -> None:
        self.conf = config
        self.nemo_codec_model = AudioCodecModel\
                .from_pretrained("nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps").eval()
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.nemo_codec_model.to(self.device)
        self.text_tokenizer_name = text_tokenizer_name
        if self.text_tokenizer_name:
            self.tokenizer = AutoTokenizer.from_pretrained(self.text_tokenizer_name)

        self.tokeniser_length = self.conf.tokeniser_length
        self.start_of_text = self.conf.start_of_text
        self.end_of_text = self.conf.end_of_text
        self.start_of_speech = self.tokeniser_length + 1
        self.end_of_speech = self.tokeniser_length + 2
        self.start_of_human = self.tokeniser_length + 3
        self.end_of_human = self.tokeniser_length + 4
        self.start_of_ai = self.tokeniser_length + 5
        self.end_of_ai = self.tokeniser_length + 6
        self.pad_token = self.tokeniser_length + 7
        self.audio_tokens_start = self.tokeniser_length + 10
        self.codebook_size = 4032

    def output_validation(self, out_ids):
        start_of_speech_flag = self.start_of_speech in out_ids
        end_of_speech_flag = self.end_of_speech in out_ids
        if not (start_of_speech_flag and end_of_speech_flag):
            raise ValueError('Special speech tokens not exist!')

    def get_nano_codes(self, out_ids):
        start_a_idx = (out_ids == self.start_of_speech).nonzero(as_tuple=True)[0].item()
        end_a_idx   = (out_ids == self.end_of_speech).nonzero(as_tuple=True)[0].item()
        if start_a_idx >= end_a_idx:
            raise ValueError('Invalid audio codes sequence!')

        audio_codes = out_ids[start_a_idx+1 : end_a_idx]
        if len(audio_codes) % 4:
            raise ValueError('The length of the sequence must be a multiple of 4!')
        audio_codes = audio_codes.reshape(-1, 4)
        audio_codes = audio_codes - torch.tensor([self.codebook_size * i for i in range(4)])
        audio_codes = audio_codes - self.audio_tokens_start
        if (audio_codes < 0).sum().item() > 0:
            raise ValueError('Invalid audio tokens!')

        audio_codes = audio_codes.T.unsqueeze(0)
        len_ = torch.tensor([audio_codes.shape[-1]])
        return audio_codes, len_

    def get_text(self, out_ids):
        start_t_idx = (out_ids == self.start_of_text).nonzero(as_tuple=True)[0].item()
        end_t_idx   = (out_ids == self.end_of_text).nonzero(as_tuple=True)[0].item()
        txt_tokens = out_ids[start_t_idx : end_t_idx+1]
        text = self.tokenizer.decode(txt_tokens, skip_special_tokens=True)
        return text

    def get_waveform(self, out_ids):
        out_ids = out_ids.flatten()
        self.output_validation(out_ids)
        audio_codes, len_ = self.get_nano_codes(out_ids)
        audio_codes, len_ = audio_codes.to(self.device), len_.to(self.device)
        with torch.inference_mode():
            reconstructed_audio, _ = self.nemo_codec_model.decode(tokens=audio_codes, tokens_len=len_)
            output_audio = reconstructed_audio.cpu().detach().numpy().squeeze()

        if self.text_tokenizer_name:
            text = self.get_text(out_ids)
            return output_audio, text
        else:
            return output_audio, None

In [None]:
#@title KaniModel

class KaniModel:

    """
    Main inference engine for Kani TTS - converts text prompts to natural speech audio.

    This class orchestrates the complete text-to-speech pipeline by managing the large language
    model that generates multi-modal token sequences (text + audio tokens) and coordinating
    with NemoAudioPlayer for audio synthesis. It implements a conversational approach where
    text input is wrapped in special speaker tokens and the model generates corresponding
    speech responses with optional speaker identity control.

    The model architecture is based on a causal language model (450M parameters) fine-tuned
    to understand both text semantics and audio codec representations, enabling direct
    generation of speech without intermediate representations like mel-spectrograms.

    Attributes:
        conf (Config): Configuration object containing model hyperparameters
        player (NemoAudioPlayer): Audio processing handler for token-to-waveform conversion
        device (str): Computing device ('cuda' or 'cpu')
        model: Pre-trained causal language model (AutoModelForCausalLM)
        tokenizer: HuggingFace tokenizer matching the model's vocabulary

    Methods:
        get_input_ids: Formats text input with conversation markers, special tokens, and optional speaker ID
        model_request: Performs inference with sampling parameters optimized for speech
        run_model: Complete pipeline from text input to audio output with optional speaker control

    Token Flow:
        Input:  [START_OF_HUMAN] + [speaker_id:] + text_tokens + [END_OF_TEXT, END_OF_HUMAN]
        Output: [START_OF_AI] + text_tokens + [END_OF_TEXT] + [START_OF_SPEECH] + audio_tokens + [END_OF_SPEECH]

    Generation Parameters:
        - Temperature: 0.6 (balanced creativity vs coherence)
        - Top-p: 0.95 (nucleus sampling for natural variation)
        - Repetition penalty: 1.1 (reduces repetitive patterns)
        - Max tokens: 1200 (configurable via Config.max_new_tokens)
        - EOS token: END_OF_SPEECH (stops generation after audio sequence)

    Example:
        >>> config = Config()
        >>> player = NemoAudioPlayer(config)
        >>> kani = KaniModel(config, player)
        >>> # Generate speech without speaker ID
        >>> audio, text = kani.run_model("Hello, how are you today?")
        >>> # Generate speech with specific speaker
        >>> audio, text = kani.run_model("Hello, how are you today?", speaker_id="Alice")
        >>> # audio contains synthesized speech as numpy array
        >>> # text contains the original input prompt

    Speaker Control:
        The speaker_id parameter allows for voice cloning or selection when the model
        has been trained with speaker-conditional data. The ID is prepended to the input
        text in lowercase format (e.g., "alice: Hello") to condition the model's output.

    Performance:
        - Model size: 450M parameters
        - Precision: bfloat16 for memory efficiency
        - Device mapping: Automatic GPU/CPU allocation
        - Inference mode: torch.no_grad() for optimal speed

    Note:
        The model generates both text and audio tokens in a single forward pass,
        making it significantly faster than traditional TTS pipelines that require
        separate text analysis, acoustic modeling, and vocoding stages.
    """


    def __init__(self, config, player:NemoAudioPlayer)->None:
        self.conf = config
        self.player = player
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.model = AutoModelForCausalLM.from_pretrained(
                                self.conf.model_name,
                                torch_dtype=torch.bfloat16,
                                device_map=self.conf.device_map,
                            )

        self.tokenizer = AutoTokenizer.from_pretrained(self.conf.model_name)

    def get_input_ids(self, text_promt:str, speaker_id:str = None)->tuple[torch.tensor]:
        START_OF_HUMAN = self.player.start_of_human
        END_OF_TEXT = self.player.end_of_text
        END_OF_HUMAN = self.player.end_of_human

        if speaker_id is not None:
            text_promt = f"{speaker_id.lower()}: {text_promt}"

        input_ids = self.tokenizer(text_promt, return_tensors="pt").input_ids
        start_token = torch.tensor([[START_OF_HUMAN]], dtype=torch.int64)
        end_tokens = torch.tensor([[END_OF_TEXT, END_OF_HUMAN]], dtype=torch.int64)
        modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)

        attention_mask = torch.ones(1, modified_input_ids.shape[1], dtype=torch.int64)
        return modified_input_ids, attention_mask


    def model_request(self, input_ids:torch.tensor,
                          attention_mask:torch.tensor)->torch.tensor:

        input_ids = input_ids.to(self.device)
        attention_mask = attention_mask.to(self.device)

        with torch.no_grad():
            generated_ids = self.model.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                max_new_tokens=self.conf.max_new_tokens,
                do_sample=True,
                temperature= self.conf.temperature,
                top_p= self.conf.top_p,
                repetition_penalty= self.conf.repetition_penalty,
                num_return_sequences=1,
                eos_token_id=self.player.end_of_speech,
            )
        return generated_ids.to('cpu')

    def run_model(self, text:str, speaker_id:str = None):
        input_ids, attention_mask = self.get_input_ids(text, speaker_id)
        model_output = self.model_request(input_ids, attention_mask)
        audio, _ = self.player.get_waveform(model_output)
        return audio, text

## Model initialization

In [None]:
conf = Config()
player = NemoAudioPlayer(conf)
kani_model = KaniModel(conf, player)

## Inference

In [None]:
prompt =  "Hi there, I'm Kani TTS and I'm a speech generation model that can talk like a human."

speaker_id = 'andrew' # or None if your model does not support multi-speaker mode

In [None]:
#@title Model request
audio, text = kani_model.run_model(prompt, speaker_id)

print(f'TEXT: {text}')
aplay(audio, rate=22050)

TEXT: Hi there, I'm Kani TTS and I'm a speech generation model that can talk like a human.
