# **2.0** ‎ Transcribing Service

This notebook serves as a hands-on trial and evaluation environment for various **Speech-to-Text (STT)** services and models. \
The goal is to identify a suitable transcription solution for use within our municipal chatbot pipeline. \
**NOTE**: Due to time limitations for now, we'll only be exploring Whisper, a state-of-the-art model for automatic speech recognition (ASR)

A brief summary of what we aim to achieve here is to:
- Compare multiple STT services or models in terms of **accuracy**, **latency**, **speaker handling**, and **language support**

- Identify the trade-offs between **open-source** vs **API-based** STT systems

- Pre-process audio if required (e.g., format conversion, downsampling)

- Optionally, demonstrate a **Text-to-Speech (TTS)** service for response generation

### **2.0.1** ‎ ‎ Install Required Libraries

Many transcribing services uses ffmpeg under the hood. You’ll need ffmpeg installed on your system.

In [None]:
!pip install ffmpeg-python --quiet

Depending on which OS you're on, you can run the following command in your terminal to install it:
- **macOS**: `brew install ffmpeg`

- **Ubuntu (Linux)**: `sudo apt install ffmpeg`

- **Windows**: `choco install ffmpeg` or you may manually install it here: https://www.ffmpeg.org/download.html

You can check whether ffmpeg has been successfully installed by running the cell below:

In [None]:
!ffmpeg -version

ffmpeg version 7.1.1-essentials_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developers
built with gcc 14.2.0 (Rev1, Built by MSYS2 project)
configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-zlib --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-sdl2 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-libaom --enable-libopenjpeg --enable-libvpx --enable-mediafoundation --enable-libass --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-dxva2 --enable-d3d11va --enable-d3d12va --enable-ffnvcodec --enable-libvpl --enable-nvdec --enable-nvenc --enable-vaapi --enable-libgme --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --ena

### **2.0.2** ‎ ‎ Load or Record Audio

You can choose between two input methods in this notebook:
1. Upload or reference an audio file (WAV/MP3)

2. Record directly from microphone (ipynb or CLI)

If you like to record the audio from your microphone it, you can run the below cell, and it'll generate an audio file (`.wav` or `.mp3`).

In [None]:
# Record from microphone (may not work in hosted notebooks)
import sounddevice as sd
from scipy.io.wavfile import write

def record_audio(filename="../data/assets/input/mic_input_01.wav", duration=5, samplerate=16000):
    print("Recording...")
    recording = sd.rec(int(duration * samplerate), samplerate=samplerate, channels=1)
    sd.wait()
    write(filename, samplerate, recording)
    print("Recording saved as", filename)

# Record for 5 seconds
record_audio(duration=5)

Recording...
Recording saved as mic_input.wav


### **2.0.3** ‎ ‎ Audio Preprocessing

Before passing audio files into a STT model, it is important to ensure that they meet the model's expected input format. \
Preprocessing helps improve transcription accuracy and prevent inference errors.
The following table shows the common preprocessing procedures involved when it comes to STT:

| Step         | Purpose                                                     |
|------------------|-----------------------------------------------------------------|
| **Resampling**     | Ensures consistent sample rate (e.g. Whisper expects 16 kHz audio)                      |
| **Mono Conversion**      | STT models expect a single audio channel                                 |
| **Volume Normalisation**   | Prevents clipping and ensures even loudness                     |
| **Silence Trimming** | Reduces irrelevant audio and improves accuracy            |
| **File Format Conversion** | Converts to `.wav` (PCM) or other supported formats     |

However, since for now, all audio input will be generated from our side, this step is very unlikely needed. \
But if we ever do need it, we'll need to first install these dependencies:

In [None]:
!pip install pydub torchaudio librosa --quiet

Then, run the below function with your input file, and it will follow the steps outlined above.

In [None]:
from pydub import AudioSegment
import torchaudio
import torch
import os

def preprocess_audio(input_path, output_path="../data/assets/input/mic_input_preprocessed.wav", target_sample_rate=16000, trim_silence=True):
    """
    Preprocess an audio file:
    - Converts to mono
    - Resamples to 16 kHz
    - Trims silence (optional)
    - Saves as 16-bit PCM WAV
    """
    # Load audio
    audio = AudioSegment.from_file(input_path)

    # Convert to mono
    audio = audio.set_channels(1)

    # Resample to 16 kHz
    audio = audio.set_frame_rate(target_sample_rate)

    # Normalize volume
    audio = audio.apply_gain(-audio.max_dBFS)

    # Trim silence from beginning and end
    if trim_silence:
        # Convert to raw samples for trimming
        samples = audio.get_array_of_samples()
        waveform = torch.tensor(samples, dtype=torch.float32).unsqueeze(0)
        trimmed_waveform, _ = torchaudio.transforms.Vad(sample_rate=target_sample_rate)(waveform)
        torchaudio.save(output_path, trimmed_waveform, target_sample_rate)
    else:
        # Export directly as WAV
        audio.export(output_path, format="wav", parameters=["-ar", str(target_sample_rate), "-ac", "1"])

    print(f"Preprocessed audio saved to: {output_path}")
    return output_path

preprocessed_path = preprocess_audio("../data/assets/input/mic_input_01.wav", output_path="../data/assets/input/mic_input_preprocessed_01.wav")

# **2.1** ‎ Implementing Speech-to-Text (STT)

### Why STT Matters for a Municipal Chatbot in Singapore?

For a conversational assistant targeting municipal services in Singapore, STT plays a **critical role in accessibility and multi-modal interaction**:

- **Voice Input Support**: Residents can speak instead of type, which is faster and more natural for many users.

- **Multilingual Use Cases**: STT enables understanding across languages commonly spoken in Singapore (e.g., English, Chinese, Malay, Tamil).

- **On-the-Go Reporting**: Users can report potholes or noise complaints hands-free while on the move.

- **Inclusive Design**: Makes the system more usable for elderly or less tech-savvy citizens.

A robust STT module allows the chatbot to transcribe and understand citizen-reported issues with high accuracy, even in noisy outdoor environments. \
It also lays the groundwork for future **voice-to-voice** interaction, where the system can both listen and respond with speech. \
We will evaluate both open-source and cloud-based STT solutions to find a balance between performance, privacy, and deployment feasibility in a Singaporean context.

### Evaluation Criteria
| Criteria         | Description                                                     |
|------------------|-----------------------------------------------------------------|
| **Accuracy**     | Word Error Rate (WER) or perceived quality                      |
| **Latency**      | Time taken to transcribe input                                  |
| **Robustness**   | Handles different accents, noise, or speeds                     |
| **Local vs Cloud** | Trade-offs in privacy, cost, setup, and ease of use            |
| **Multi-language** | Whether models can handle code-switching or regional terms     |



### **2.1.1** ‎ ‎ OpenAI Whisper

**Whisper** is an open-source ASR system developed by OpenAI, trained on 680,000 hours of multilingual and multitask supervised data collected from the web. \
It is designed to be **robust, general-purpose, and highly accurate** across a wide range of accents, background noise levels, and speech patterns.

Below are some reasons why we believe Whisper can be an excellent choice for our solution:
- **Multilingual**: Supports over 90 languages, including English, Chinese, Malay, and Tamil – making it ideal for use in multilingual environments like SG.

- **Punctuation and Formatting**: Automatically adds punctuation and casing for readable output for the chatbot.

- **Robust to Noise**: Performs well in real-world, noisy conditions (e.g., urban street recordings).

- **Language Identification**: Automatically detects the spoken language.

- **Open Source**: Fully available via GitHub (https://github.com/openai/whisper), with multiple model sizes (tiny, large etc.).

There needs to be a good balance between speed and it's abilty to handle a good amount of scenarios, so choosing the right size is important.  
Below is an general estimation comparison between the model variants Whisper provides:
| Variant     | Size     | Speed     | Accuracy       |
|-------------|----------|-----------|----------------|
| `tiny`      | ~39 MB   | Fast  | Basic          | 
| `base`      | ~74 MB   | Fast   | Fair           | 
| `small`     | ~244 MB  | Medium  | Good           | 
| `medium`    | ~769 MB  | Slower | Very Good      | 
| `large`     | ~1.55 GB | Slow    | Best  |

We use the `openai-whisper` Python package to load a pre-trained Whisper model.

In [None]:
!pip install openai-whisper --quiet

**Load Whisper Model**

In [None]:
import whisper

# Load tiny model
model = whisper.load_model("tiny")

100%|███████████████████████████████████████| 139M/139M [00:09<00:00, 15.7MiB/s]


**Transcribing with Whisper**

In [None]:
def transcribe_audio(audio_path):
    result = model.transcribe(audio_path)
    return result["text"]

# Example usage
text = transcribe_audio("../data/assets/input/mic_input_preprocessed_01.wav.wav")
print("Transcript:", text)

Transcript:  What is NEA responsible for?
