# **Multilingual Customer Support Bot**

This notebook demonstrates how to build a full voice loop for Indian-language customer
support using Sarvam AI's speech and language models.

### **Use Case**
Automate multilingual customer support for telecom, fintech, and e-commerce companies
serving India's diverse linguistic population.
1. **Transcribe:** Use **Sarvam STT (Saarika v2.5)** to convert a customer voice query to text.
2. **Respond:** Use **Sarvam-M** to generate a helpful support reply in the customer's language.
3. **Speak:** Use **Bulbul v3 TTS** to synthesize the reply as a WAV audio file.

### **Supported Languages**

| Language | Code | STT | TTS |
| :--- | :--- | :--- | :--- |
| Hindi | hi-IN | Yes | Yes |
| Tamil | ta-IN | Yes | Yes |
| Telugu | te-IN | Yes | Yes |
| Kannada | kn-IN | Yes | Yes |
| Malayalam | ml-IN | Yes | Yes |
| Gujarati | gu-IN | Yes | Yes |
| Marathi | mr-IN | Yes | Yes |
| Bengali | bn-IN | Yes | Yes |
| English (India) | en-IN | Yes | Yes |

In [None]:
# Pinning versions for reproducibility
!pip install -Uqq sarvamai>=0.1.24 python-dotenv>=1.0.0 scipy>=1.10.0 numpy>=1.24.0

### **1. Setup & API Key**

Obtain your API key from the [Sarvam AI Dashboard](https://dashboard.sarvam.ai).
Create a `.env` file in this directory with `SARVAM_API_KEY=your_key_here`, or set the
environment variable directly.

In [None]:
from __future__ import annotations

import base64
import mimetypes
import os
import traceback
from pathlib import Path

from dotenv import load_dotenv
from sarvamai import SarvamAI

load_dotenv()

SARVAM_API_KEY = os.environ.get("SARVAM_API_KEY", "")
if not SARVAM_API_KEY or SARVAM_API_KEY == "YOUR_SARVAM_API_KEY":
    raise RuntimeError(
        "SARVAM_API_KEY is not set. Add it to your .env file or set the environment variable."
    )

client = SarvamAI(api_subscription_key=SARVAM_API_KEY)

print("Client initialised.")

### **2. Step 1 — TRANSCRIBE: Speech-to-Text**

`transcribe_query` sends the audio file to **Sarvam STT (Saarika v2.5)** and returns the
transcribed text along with the detected BCP-47 language code.

Supported input: WAV (16 kHz mono recommended) or MP3.

In [None]:
_LANGUAGE_LABELS = {
    "hi-IN": "Hindi",
    "ta-IN": "Tamil",
    "te-IN": "Telugu",
    "kn-IN": "Kannada",
    "ml-IN": "Malayalam",
    "gu-IN": "Gujarati",
    "mr-IN": "Marathi",
    "bn-IN": "Bengali",
    "en-IN": "English (India)",
}


def transcribe_query(file_path: str) -> tuple[str, str]:
    """Transcribe a customer voice query using Sarvam STT (Saarika v2.5).

    Args:
        file_path: Path to a WAV or MP3 audio file.

    Returns:
        Tuple of (transcript, language_code) where language_code is a BCP-47 code
        such as 'hi-IN', 'ta-IN', or 'en-IN'. Defaults to 'hi-IN' if not detected.
    """
    path = Path(file_path)
    content_type = mimetypes.guess_type(str(path))[0] or 'audio/wav'
    with open(path, 'rb') as audio_file:
        response = client.speech_to_text.transcribe(
            file=(path.name, audio_file, content_type),
            model="saarika:v2.5",
        )

    transcript    = (getattr(response, 'transcript', '') or '').strip()
    language_code = getattr(response, 'language_code', 'hi-IN') or 'hi-IN'

    if not transcript:
        print("WARNING: STT returned an empty transcript. Audio quality may be insufficient.")

    label = _LANGUAGE_LABELS.get(language_code, language_code)
    print(f"Transcript ({label}): {transcript!r}")
    return transcript, language_code


print("transcribe_query defined.")

### **3. Step 2 — RESPOND: Support Agent**

`generate_response` sends the transcribed query to **Sarvam-M** with a system prompt
that instructs the model to reply as a helpful Indian customer support agent **in the
same language** as the customer.

If the transcript is empty (e.g. from a low-quality audio), the agent defaults to a
Hindi greeting so the pipeline still produces a valid audio reply.

In [None]:
SUPPORT_SYSTEM_PROMPT = (
    "You are a helpful and polite Indian customer support agent for a telecom company. "
    "Your role is to assist customers with queries about mobile plans, billing, data packs, "
    "recharges, network issues, and general account questions.\n\n"
    "Instructions:\n"
    "- Always respond in the SAME language as the customer's query.\n"
    "- If the customer queries in Hindi, respond in Hindi. If in Tamil, respond in Tamil. "
    "Apply this rule for all Indian languages.\n"
    "- Keep responses concise (2-3 sentences) and friendly.\n"
    "- If the query is unclear or empty, greet the customer warmly in Hindi and ask how "
    "you can help.\n"
    "- Do not switch to English unless the customer queries in English."
)

_FALLBACK_QUERY = "\u0928\u092e\u0938\u094d\u0924\u0947, \u092e\u0941\u091d\u0947 \u0938\u0939\u093e\u092f\u0924\u093e \u091a\u093e\u0939\u093f\u090f\u0964"


def generate_response(transcript: str, language_code: str) -> str:
    """Generate a customer support response using Sarvam-M.

    The model replies in the same language as the customer query.
    An empty transcript falls back to a Hindi greeting so the pipeline
    always produces a valid response.
    """
    user_message = transcript.strip() if transcript.strip() else _FALLBACK_QUERY

    response = client.chat.completions(
        messages=[
            {"role": "system", "content": SUPPORT_SYSTEM_PROMPT},
            {"role": "user", "content": user_message},
        ]
    )

    if not response or not response.choices:
        raise ValueError("Sarvam-M returned no response. Check your API quota.")

    content = response.choices[0].message.content
    if content is None:
        raise ValueError("Sarvam-M returned an empty message content.")

    reply = content.strip()
    label = _LANGUAGE_LABELS.get(language_code, language_code)
    print(f"Response ({label}): {reply}")
    return reply


print("generate_response defined.")

### **4. Step 3 — SPEAK: Text-to-Speech**

`speak_response` converts the support reply to audio using **Bulbul v3** and saves the
WAV file to the `outputs/` folder.

Each language is paired with a natural-sounding speaker voice. The function falls back
to `shubh` (Hindi) for any unrecognised language code.

In [None]:
_SPEAKER_MAP = {
    "hi-IN": "shubh",
    "ta-IN": "kavya",
    "te-IN": "priya",
    "kn-IN": "arvind",
    "ml-IN": "anu",
    "gu-IN": "priya",
    "mr-IN": "shubh",
    "bn-IN": "priya",
    "en-IN": "shubh",
}


def speak_response(
    text: str,
    language_code: str,
    output_dir: str = "outputs",
) -> str:
    """Convert a support response to audio using Bulbul v3 TTS.

    Args:
        text:          The response text to synthesize.
        language_code: BCP-47 language code (e.g. 'hi-IN').
        output_dir:    Directory where the WAV file is saved.

    Returns:
        Path to the saved WAV file.
    """
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    speaker = _SPEAKER_MAP.get(language_code, 'shubh')

    tts_response = client.text_to_speech.convert(
        text=text,
        target_language_code=language_code,
        model="bulbul:v3",
        speaker=speaker,
        speech_sample_rate=24000,
    )

    if not tts_response.audios:
        raise RuntimeError(
            f"Bulbul TTS returned no audio for language {language_code}. "
            "Check that the language code and speaker are supported."
        )

    audio_bytes = base64.b64decode(tts_response.audios[0])
    output_path = str(Path(output_dir) / f"response_{language_code}.wav")
    with open(output_path, 'wb') as f:
        f.write(audio_bytes)

    print(f"Audio response saved to: {output_path}")
    return output_path


print("speak_response defined.")

### **5. End-to-End Pipeline**

`handle_customer_query` ties all three steps together.
Pass any WAV or MP3 file path and receive a dict with the transcript,
detected language, response text, and path to the synthesized audio reply.

In [None]:
def handle_customer_query(
    audio_path: str,
    output_dir: str = "outputs",
) -> dict | None:
    """Full voice pipeline: transcribe -> respond -> speak.

    Args:
        audio_path: Path to a WAV or MP3 audio file of the customer query.
        output_dir: Directory where the TTS audio reply is saved.

    Returns:
        Dict with keys 'transcript', 'language_code', 'response_text', 'audio_path',
        or None if the pipeline fails.
    """
    print(f"Processing query: {audio_path}")
    try:
        print("  Step 1/3 — Transcribing customer query with Saarika STT...")
        transcript, language_code = transcribe_query(audio_path)

        print("  Step 2/3 — Generating support response with Sarvam-M...")
        response_text = generate_response(transcript, language_code)

        print("  Step 3/3 — Synthesizing audio reply with Bulbul TTS...")
        audio_out = speak_response(response_text, language_code, output_dir)

        return {
            "transcript":    transcript,
            "language_code": language_code,
            "response_text": response_text,
            "audio_path":    audio_out,
        }

    except Exception as e:
        traceback.print_exc()
        print(f"ERROR: Failed to process query: {e}")
        return None


print("handle_customer_query defined.")

### **6. Demo — Run the Pipeline**

The cell below generates a synthetic WAV file using `scipy` — no microphone or real
recording required — then runs the full pipeline on it.

> **Note:** A programmatically generated audio signal will yield a minimal STT transcript.
> The pipeline is designed to handle this gracefully: an empty transcript triggers a
> Hindi greeting so that Sarvam-M and Bulbul TTS always produce a valid audio reply.
> In production, replace the synthetic WAV with a real customer recording.

In [None]:
import numpy as np
from scipy.io import wavfile


def _create_sample_query(
    output_path: str = "sample_data/sample_query_hi.wav",
    sample_rate: int = 16000,
    duration: float = 2.5,
) -> str:
    """Write a synthetic speech-like WAV file for demo purposes.

    The signal mixes several frequencies to loosely approximate the spectral
    envelope of a spoken utterance. Fade-in and fade-out are applied to
    avoid audible clicks.

    In production, supply a real WAV recording of the customer query instead.
    """
    Path(output_path).parent.mkdir(parents=True, exist_ok=True)

    t = np.linspace(0, duration, int(sample_rate * duration), endpoint=False)

    # Mix frequencies that span the typical human speech range (100 Hz – 3 kHz)
    components = [
        (120,  1.0),
        (250,  0.8),
        (500,  0.6),
        (800,  0.5),
        (1600, 0.3),
        (3000, 0.15),
    ]
    audio = sum(w * np.sin(2 * np.pi * f * t) for f, w in components)

    # Fade in / out to eliminate clicks
    fade = int(0.05 * sample_rate)
    audio[:fade]  *= np.linspace(0, 1, fade)
    audio[-fade:] *= np.linspace(1, 0, fade)

    # Normalise to 70 % of full scale and convert to 16-bit PCM
    audio = (audio / audio.max() * 0.70 * 32767).astype(np.int16)

    wavfile.write(output_path, sample_rate, audio)
    print(
        f"Sample query WAV created: {output_path} "
        f"({duration:.1f}s, {sample_rate} Hz, 16-bit PCM)"
    )
    return output_path


# --- Generate synthetic input and run the full pipeline ---
wav_path = _create_sample_query()
result   = handle_customer_query(wav_path)


### **7. Results**

Inspect the transcription and response text, then listen to or download the
synthesized audio reply.

In [None]:
from IPython.display import Audio, FileLink, display

if result:
    lang_label = _LANGUAGE_LABELS.get(result['language_code'], result['language_code'])

    print("=== Pipeline Result ===")
    print(f"Detected language : {lang_label} ({result['language_code']})")
    print(f"Transcript        : {result['transcript'] or '(empty — synthetic audio)'}")
    print(f"Response          : {result['response_text']}")
    print()
    print("Audio reply:")
    display(Audio(filename=result['audio_path']))
    print()
    print("Download:")
    display(FileLink(result['audio_path'], result_html_prefix="Click to download: "))
else:
    print("Processing failed. Check the error messages above.")

### **8. Error Reference**

| Error | HTTP Status | Cause | Solution |
| :--- | :--- | :--- | :--- |
| `invalid_api_key_error` | 403 | Invalid API key | Verify at [dashboard.sarvam.ai](https://dashboard.sarvam.ai). |
| `insufficient_quota_error` | 429 | Quota exceeded | Check your usage limits. |
| `internal_server_error` | 500 | Server-side issue | Wait and retry the request. |
| `WARNING: STT returned empty transcript` | — | Silent or synthetic audio | Use a real customer recording. |
| `RuntimeError: no audio returned` | — | Unsupported language/speaker | Check `_SPEAKER_MAP` language code. |
| `RuntimeError: SARVAM_API_KEY is not set` | — | Missing API key | Add key to `.env` file. |

### **9. Using Real Customer Audio**

Replace the synthetic demo with any real WAV or MP3 recording:

```python
result = handle_customer_query("path/to/real_query.wav")
```

The pipeline auto-detects the spoken language and produces a reply in the same language.

### **10. Conclusion & Resources**

This recipe chains **Saarika STT**, **Sarvam-M**, and **Bulbul TTS** into a production-ready
voice support loop that handles all major Indian languages out of the box.

* [Sarvam AI Docs](https://docs.sarvam.ai)
* [Saarika STT API](https://docs.sarvam.ai/api-reference-docs/speech-to-text)
* [Sarvam-M Chat API](https://docs.sarvam.ai/api-reference-docs/chat)
* [Bulbul TTS API](https://docs.sarvam.ai/api-reference-docs/text-to-speech)
* [Indic Language Support](https://docs.sarvam.ai/language-support)

**Keep Building!**