
# Speech-to-Text & Text-to-Speech (Hugging Face) — Colab-ready Notebook

This notebook is a teaching resource for a class on **Speech-to-Text (ASR)** and **Text-to-Speech (TTS)**. It uses **Hugging Face models** and other open-source tools, with examples for **English** and **Bengali (Bangla)**.

Features included:
- Install & setup (Colab-friendly)
- ASR: Whisper (multilingual) + Bengali Wav2Vec2 models
- TTS: Facebook MMS (English & Bengali) + community Bengali TTS
- Voice cloning / zero-shot cloning using Coqui XTTS (local in Colab)
- Saving and playing audio in the notebook

Notes:
- First run requires downloading models (internet). After that, caching enables offline use.
- Stable GPU (Colab GPU) is recommended for faster model performance but not required for all examples.

---

## 0) Install required libraries (run in Colab)

In [None]:
# Run these in Colab (uncomment to execute)
# !pip install -q  transformers==4.39.3 datasets==2.18.0 soundfile==0.12.1 librosa==0.10.2
!pip install -q "huggingface_hub>=0.14.1"
# !pip install -q transformers[speech]
!pip install -q git+https://github.com/coqui-ai/TTS.git@main  # for XTTS voice-cloning
!pip install -q soundfile
!pip install -q ipywidgets
# !pip install -q torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121

text -> speech (tts) | speech -> text (stt) | chatbot (text) -> text

openai realtime api -> speech - speech

# If you run into CUDA/torch wheel issues on Colab, follow colab recommended torch install.

## 1) Utilities: load audio, play, save

In [None]:
import torchaudio
import soundfile as sf
from IPython.display import Audio, display
import requests
from io import BytesIO
import os


In [None]:
# Helper: download audio from URL
def load_audio_from_url(url, target_sr=16000):
    resp = requests.get(url)
    audio_bytes = BytesIO(resp.content)
    data, sr = sf.read(audio_bytes)
    # convert stereo to mono if needed
    if len(data.shape) > 1:
        data = data.mean(axis=1)
    # resample if needed
    if sr != target_sr:
        data = torchaudio.functional.resample(torch.tensor(data).unsqueeze(0), sr, target_sr).squeeze(0).numpy()
        sr = target_sr
    return data, sr

# Helper: save numpy audio to file
def save_audio_np(waveform, sr, outpath):
    sf.write(outpath, waveform, sr)

# Play audio in notebook
def play_audio_file(path, autoplay=False):
    display(Audio(path, autoplay=autoplay))

In [None]:
print('utils loaded')


https://deepgram.com/ ->STT

In [None]:
gpu -> model host-> latency | performance

## 2) ASR: English — Whisper (Hugging Face `transformers` pipeline)


We'll use the `pipeline('automatic-speech-recognition')` with an OpenAI Whisper model. Whisper is robust & multilingual.


In [None]:
from transformers import pipeline

print('Loading Whisper ASR pipeline (this will download model weights)')
asr_whisper = pipeline('automatic-speech-recognition', model='openai/whisper-small')


print('Transcribing...')
res = asr_whisper('/content/LJ001-0002.wav')
print('Transcription:', res['text'])
# play_audio_file('en_example.wav') # Uncomment if 'en_example.wav' exists

In [None]:
# call recording -> audio split -> model -> Transcription

# stt (model)-> opensource

##finetuning

In [None]:
play_audio_file('LJ001-0002.wav')

In [None]:
print('Transcribing...')
res = asr_whisper('/content/LJ001-0006.wav')
print('Transcription:', res['text'])
# play_audio_file('en_example.wav') # Uncomment if 'en_example.wav' exists

In [None]:
play_audio_file('LJ001-0006.wav')

## 3) ASR: Bengali — Wav2Vec2 models


Many community models fine-tuned for Bengali exist (wav2vec2 / XLSR). We provide two options: a small demo model and a stronger XLSR fine-tuned checkpoint.

- `arijitx/wav2vec2-xls-r-300m-bengali` (fine-tuned XLSR)
- `ai4bharat/indicwav2vec_v1_bengali` (another community model)

We'll use the pipeline API for simplicity.


In [None]:
print('Loading Bengali ASR (wav2vec2) — may take a moment')
asr_bengali = pipeline('automatic-speech-recognition', model='arijitx/wav2vec2-xls-r-300m-bengali')

# Example Bengali short audio (upload your own or replace URL)
# NOTE: Provide your own Bengali audio in Colab by uploading or using a URL
# Here's a placeholder: you should replace with a proper Bengali audio file.

# If you have a file 'bn_example.wav' in Colab, run:
res_bn = asr_bengali('/content/train_barishal (1).wav')
print('Bengali transcription:', res_bn['text'])

In [None]:
print('Loading Bengali ASR (wav2vec2) — may take a moment')
asr_bengali = pipeline('automatic-speech-recognition', model='ai4bharat/indicwav2vec_v1_bengali')

# Example Bengali short audio (upload your own or replace URL)
# NOTE: Provide your own Bengali audio in Colab by uploading or using a URL
# Here's a placeholder: you should replace with a proper Bengali audio file.

# If you have a file 'bn_example.wav' in Colab, run:
res_bn = asr_bengali('/content/train_barishal (1).wav')
print('Bengali transcription:', res_bn['text'])

## 4) TTS: English & Bengali


In [None]:
# Install the latest version of transformers and accelerate
# !pip install -q --upgrade transformers accelerate scipy soundfile
import torch

from transformers import VitsModel, AutoTokenizer, set_seed
import scipy.io.wavfile as wavfile
from IPython.display import Audio

print("Loading MMS-TTS English model...")
tokenizer = AutoTokenizer.from_pretrained("facebook/mms-tts-eng")
model = VitsModel.from_pretrained("facebook/mms-tts-eng")

text = "Hello, students! This is a demo of English TTS using Facebook MMS."
inputs = tokenizer(text, return_tensors="pt")

# Set seed for deterministic output
set_seed(42)

with torch.no_grad():
    outputs = model(**inputs).waveform

waveform = outputs[0].cpu().numpy()
sr = model.config.sampling_rate

wavfile.write("mms_tts_eng.wav", rate=sr, data=waveform)
print("Saved as mms_tts_eng.wav — sample below:")

Audio("mms_tts_eng.wav", rate=sr)


In [None]:
# youtubevideo -> link -> audio -> transcription -> model(transcription)-> Refine .... -> summary

In [None]:
assignment - voice to voice chatbot (stt , tts)
Exam week (Virtual Try on)

In [None]:
# speech to speech chatbot
# speech -> stt -> text -> chatbot ->rag (agent) ->text -> tts -> speech

## 7) Tips & Colab pointers for the class

- **Use GPU runtime** in Colab for faster model downloads and generation (Runtime > Change runtime type > GPU).
- **Model caching**: HF models will be cached in `~/.cache/huggingface` on first use.
- **Rate limits**: For large class, students should run at different times or use small models to avoid HF rate limits.
- **Audio formats**: Ensure input audio is 16 kHz mono for many ASR models (wav). Use `torchaudio` or `librosa` for resampling.
- **Safety & Licenses**: Check model license (e.g., Coqui models may be non-commercial).

---

# Exercises for students
# 1. Compare Whisper and wav2vec2 on a short Bengali-English mixed audio file.
# 2. Fine-tune a small wav2vec2 model on a tiny Bengali dataset (use the `datasets` library).
# 3. Try zero-shot voice cloning with coqui XTTS: record a 6-second clip and synthesize a paragraph.

# End of notebook