# LahStats - MERaLiON Transcription Server

Runs MERaLiON as a persistent API server. Your backend calls this endpoint.

**Setup:**
1. Run all cells
2. When prompted, paste your ngrok authtoken (it won't be saved)
3. Copy the ngrok URL
4. Keep this notebook running during the demo

In [1]:
# Check GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

CUDA available: True
GPU: NVIDIA A100-SXM4-40GB
Memory: 42.5 GB


In [2]:
# Install dependencies
!pip install -q transformers accelerate librosa soundfile flask pyngrok

In [3]:
# This notebook is gitignored - safe to paste token here
import os
os.environ["NGROK_AUTHTOKEN"] = "31vQrRz6XaMKTpIptTVkuUxhrdW_2McGvyGMxz6KSwgFSZNx8"  # <-- Paste your token between the quotes
print("Token set!" if os.environ.get("NGROK_AUTHTOKEN") else "Paste your token above!")

Token set!


In [4]:
# Load MERaLiON-2-3B-ASR model (smaller, fits on most GPUs)
# Use 3B instead of 10B to avoid OOM errors - matches backend service
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import gc

# Clear CUDA cache before loading
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    gc.collect()

MODEL_NAME = "MERaLiON/MERaLiON-2-3B"  # Changed from 10B to 3B to avoid OOM

print("Loading processor...")
processor = AutoProcessor.from_pretrained(MODEL_NAME, trust_remote_code=True)

print("Loading model (this may take a few minutes)...")
print(f"Using model: {MODEL_NAME}")

# Check GPU memory
if torch.cuda.is_available():
    total_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU Memory: {total_mem:.1f} GB")
    
    # Load with float16 for GPU
    model = AutoModelForSpeechSeq2Seq.from_pretrained(
        MODEL_NAME,
        torch_dtype=torch.float16,
        trust_remote_code=True,
        attn_implementation="eager",
        low_cpu_mem_usage=True,
        device_map="auto",  # Automatic device placement
    )
    print(f"Model loaded on GPU (float16)")
else:
    # CPU fallback
    print("No GPU available, loading for CPU...")
    model = AutoModelForSpeechSeq2Seq.from_pretrained(
        MODEL_NAME,
        torch_dtype=torch.float32,
        trust_remote_code=True,
        attn_implementation="eager",
    )
    print("Model loaded on CPU")

model.eval()  # Set to evaluation mode

print(f"Model loaded on {next(model.parameters()).device}!")

Loading processor...


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


processor_config.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

processing_meralion2.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/MERaLiON/MERaLiON-2-3B:
- processing_meralion2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


preprocessor_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/34.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

Loading model (this may take a few minutes)...
Using model: MERaLiON/MERaLiON-2-3B
GPU Memory: 42.5 GB


config.json: 0.00B [00:00, ?B/s]

configuration_meralion2.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/MERaLiON/MERaLiON-2-3B:
- configuration_meralion2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`torch_dtype` is deprecated! Use `dtype` instead!


modeling_meralion2.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/MERaLiON/MERaLiON-2-3B:
- modeling_meralion2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/197 [00:00<?, ?B/s]

Model loaded on GPU (float16)
Model loaded on cpu!


In [5]:
# Transcription function
import numpy as np

def transcribe(audio_data, sample_rate=16000):
    """Transcribe audio using MERaLiON."""
    # Ensure float32 numpy array
    if not isinstance(audio_data, np.ndarray):
        audio_data = np.array(audio_data)
    audio_data = audio_data.astype(np.float32)
    
    # Chat-style prompt for MERaLiON
    prompt_template = "Instruction: {query} \nFollow the text instruction based on the following audio: <SpeechHere>"
    transcribe_prompt = """Transcribe this Singlish speech using romanized text only. 
Do NOT use Chinese characters. 
Write Singlish words in romanized form: walao, shiok, lah, leh, lor, sia, paiseh, sian, etc.
Output format: Speaker labels with romanized transcription."""
    
    conversation = [[{"role": "user", "content": prompt_template.format(query=transcribe_prompt)}]]
    chat_prompt = processor.tokenizer.apply_chat_template(
        conversation=conversation,
        tokenize=False,
        add_generation_prompt=True
    )
    
    # Process inputs
    inputs = processor(text=chat_prompt, audios=[audio_data])
    
    # Move to device
    device = next(model.parameters()).device
    dtype = next(model.parameters()).dtype
    
    def move_to_device(v):
        if not hasattr(v, 'to'):
            return v
        v = v.to(device)
        if v.is_floating_point():
            v = v.to(dtype)
        return v
    
    inputs = {k: move_to_device(v) for k, v in inputs.items()}
    
    # Generate
    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_new_tokens=256)
    
    # Decode
    transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    # Clean up: extract only the model's response (after "model\n")
    if "model\n" in transcription:
        transcription = transcription.split("model\n", 1)[-1]
    
    return transcription.strip()

print("Transcription function ready!")

Transcription function ready!


In [6]:
# Post-processing functions
import re
from typing import Dict

CORRECTIONS = {
    'while up': 'walao', 'wah lao': 'walao', 'wa lao': 'walao',
    'wah low': 'walao', 'wa low': 'walao', 'while ah': 'walao',
    'wah lau': 'walao', 'wa lau': 'walao',
    'pie say': 'paiseh', 'pai seh': 'paiseh', 'pie seh': 'paiseh',
    'shook': 'shiok', 'she ok': 'shiok',
    'see ya': 'sia', 'see ah': 'sia',
    'see an': 'sian', 'si an': 'sian',
}

WORD_CORRECTIONS = {
    'la': 'lah', 'low': 'lor', 'loh': 'lor', 'leh': 'lah', 'seh': 'sia',
}

TARGET_WORDS = [
    'walao',  # Keep walao (mild)
    'lah', 'lor', 'sia', 'meh', 'leh', 'hor', 'ah',
    'can', 'paiseh', 'shiok', 'sian', 'alamak', 'aiyo', 'bodoh', 'kiasu', 'kiasi', 'bojio',
]

def apply_corrections(text: str) -> str:
    if not text:
        return text
    result = text
    for wrong, correct in sorted(CORRECTIONS.items(), key=lambda x: len(x[0]), reverse=True):
        pattern = re.compile(re.escape(wrong), re.IGNORECASE)
        result = pattern.sub(correct, result)
    for wrong, correct in WORD_CORRECTIONS.items():
        pattern = re.compile(r'\b' + re.escape(wrong) + r'\b', re.IGNORECASE)
        result = pattern.sub(correct, result)
    return result

def count_target_words(text: str) -> Dict[str, int]:
    if not text:
        return {}
    normalized = text.lower()
    counts = {}
    for word in TARGET_WORDS:
        pattern = re.compile(r'(?<![a-zA-Z])' + re.escape(word) + r'(?![a-zA-Z])', re.IGNORECASE)
        matches = pattern.findall(normalized)
        if matches:
            counts[word] = len(matches)
    return counts

print("Post-processing functions ready!")

Post-processing functions ready!


In [None]:
# üî• MANUALLY KILL NGROK SESSIONS (Run this if ngrok errors persist)
# This cell helps you kill ngrok sessions that pyngrok.kill() can't reach

import subprocess
import os

print("=" * 60)
print("KILLING ALL NGROK SESSIONS")
print("=" * 60)

# Method 1: Kill via pyngrok
try:
    from pyngrok import ngrok
    ngrok.kill()
    print("‚úÖ Killed via pyngrok")
except Exception as e:
    print(f"‚ùå pyngrok.kill() failed: {e}")

# Method 2: Kill processes (Linux/Mac/Colab)
try:
    result = subprocess.run(['pkill', '-9', '-f', 'ngrok'], 
                           capture_output=True, text=True, timeout=5)
    if result.returncode == 0:
        print("‚úÖ Killed ngrok processes via pkill")
    else:
        print("‚ÑπÔ∏è  No ngrok processes found (or pkill not available)")
except Exception as e:
    print(f"‚ÑπÔ∏è  pkill failed (normal on Windows/Colab): {e}")

# Method 3: Try psutil if available
try:
    import psutil
    killed = 0
    for proc in psutil.process_iter(['pid', 'name', 'cmdline']):
        try:
            cmdline = ' '.join(proc.info['cmdline'] or [])
            if 'ngrok' in cmdline.lower():
                proc.kill()
                killed += 1
                print(f"‚úÖ Killed process PID {proc.info['pid']}")
        except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
            pass
    if killed == 0:
        print("‚ÑπÔ∏è  No ngrok processes found via psutil")
except ImportError:
    print("‚ÑπÔ∏è  psutil not installed (optional)")
except Exception as e:
    print(f"‚ÑπÔ∏è  psutil failed: {e}")

print("\n" + "=" * 60)
print("‚ö†Ô∏è  IF ERRORS STILL PERSIST:")
print("=" * 60)
print("1. Visit: https://dashboard.ngrok.com/agents")
print("2. Click 'Stop' on ALL active sessions")
print("3. Wait 10 seconds")
print("4. Then re-run the ngrok cell below")
print("=" * 60)

In [7]:
# Start Flask API server
from flask import Flask, request, jsonify
import librosa
import io
import base64
import threading

app = Flask(__name__)

@app.route('/health', methods=['GET'])
def health():
    return jsonify({"status": "ok", "model": "MERaLiON-2-3B"})

@app.route('/transcribe', methods=['POST'])
def transcribe_endpoint():
    try:
        # Accept audio as base64 or file upload
        if request.is_json:
            data = request.get_json()
            audio_b64 = data.get('audio')
            audio_bytes = base64.b64decode(audio_b64)
        else:
            audio_file = request.files.get('audio')
            audio_bytes = audio_file.read()
        
        # Load audio
        audio_data, sr = librosa.load(io.BytesIO(audio_bytes), sr=16000)
        
        # Transcribe
        raw_text = transcribe(audio_data)
        
        # Post-process
        corrected = apply_corrections(raw_text)
        counts = count_target_words(corrected)
        
        return jsonify({
            "raw_transcription": raw_text,
            "corrected": corrected,
            "word_counts": counts,
            "total_singlish_words": sum(counts.values())
        })
    except Exception as e:
        return jsonify({"error": str(e)}), 500

# Run Flask in background thread
# Try different ports if 5000 is in use
import socket

def find_free_port(start_port=5000):
    for port in range(start_port, start_port + 10):
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
            if s.connect_ex(('localhost', port)) != 0:
                return port
    return None

PORT = find_free_port(5000)
if PORT is None:
    raise RuntimeError("Could not find a free port")

print(f"Starting Flask server on port {PORT}...")
threading.Thread(target=lambda: app.run(host='0.0.0.0', port=PORT, use_reloader=False)).start()
print(f"Flask server started on port {PORT}!")

Starting Flask server on port 5000...
Flask server started!


 * Serving Flask app '__main__'
 * Debug mode: off


In [10]:
# Expose server via ngrok
import os
from pyngrok import ngrok

# Kill any existing ngrok tunnels first (free tier allows only 1 session)
import subprocess
import time

# Method 1: Use pyngrok kill
try:
    ngrok.kill()
    print("Killed existing ngrok sessions (pyngrok)")
    time.sleep(1)  # Wait a bit
except Exception as e:
    print(f"pyngrok.kill() failed: {e}")

# Method 2: Kill ngrok processes directly
try:
    # Find and kill all ngrok processes
    result = subprocess.run(['pkill', '-f', 'ngrok'], capture_output=True, text=True)
    if result.returncode == 0:
        print("Killed ngrok processes via pkill")
    time.sleep(1)
except Exception as e:
    print(f"pkill failed (might be Windows/Colab): {e}")
    # Try Windows/alternative method
    try:
        import psutil
        for proc in psutil.process_iter(['pid', 'name', 'cmdline']):
            try:
                if 'ngrok' in ' '.join(proc.info['cmdline'] or []).lower():
                    proc.kill()
                    print(f"Killed ngrok process PID {proc.info['pid']}")
            except (psutil.NoSuchProcess, psutil.AccessDenied):
                pass
        time.sleep(1)
    except ImportError:
        print("psutil not available, skipping process kill")
    except Exception as e:
        print(f"Process kill failed: {e}")

print("\n‚ö†Ô∏è  If errors persist, manually kill ngrok:")
print("  1. Visit: https://dashboard.ngrok.com/agents")
print("  2. Stop all active sessions")
print("  3. Or run: ngrok kill (if ngrok CLI installed)")
print("  4. Or run: pkill -f ngrok (Linux/Mac)")
print("  5. Or run: taskkill /F /IM ngrok.exe (Windows)")

# Set authtoken from environment or paste directly
NGROK_TOKEN = os.environ.get("NGROK_AUTHTOKEN", "YOUR_TOKEN_HERE")
ngrok.set_auth_token(NGROK_TOKEN)

# Start tunnel (use PORT variable from previous cell, or default to 5000)
# Check if PORT is defined, otherwise use default or try to detect Flask port
import socket

if 'PORT' not in globals():
    # Try to find which port Flask is running on
    flask_port = None
    for port in range(5000, 5010):
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
            if s.connect_ex(('localhost', port)) == 0:
                flask_port = port
                break
    
    if flask_port:
        PORT = flask_port
        print(f"Detected Flask running on port {PORT}")
    else:
        PORT = 5000
        print(f"PORT not defined, using default port {PORT}")
        print("‚ö†Ô∏è  Make sure Flask server cell was run first!")
else:
    print(f"Using PORT={PORT} from Flask server cell")

try:
    public_url = ngrok.connect(PORT).public_url
except Exception as e:
    error_msg = str(e)
    print(f"\n{'='*60}")
    print(f"‚ùå NGROK CONNECTION FAILED")
    print(f"{'='*60}")
    print(f"Error: {error_msg}")
    print(f"Tried to connect to port {PORT}")
    print(f"\n{'='*60}")
    print("üîß TO FIX THIS:")
    print(f"{'='*60}")
    print("1. Run the cell ABOVE this one (the kill helper cell)")
    print("2. OR manually visit: https://dashboard.ngrok.com/agents")
    print("3. Stop ALL active ngrok sessions")
    print("4. Wait 10 seconds")
    print("5. Re-run THIS cell")
    print(f"{'='*60}")
    print("\nüí° The free ngrok tier only allows 1 session at a time.")
    print("   You must kill the existing session before starting a new one.")
    print(f"{'='*60}\n")
    
    # Don't raise - let user fix it and retry
    public_url = None

if public_url:
    print(f"\n{'='*60}")
    print(f"‚úÖ TRANSCRIPTION API READY!")
    print(f"{'='*60}")
    print(f"\nPublic URL: {public_url}")
    print(f"\nSet in your backend .env:")
    print(f"TRANSCRIPTION_API_URL={public_url}")
    print(f"\nEndpoints:")
    print(f"  GET  {public_url}/health")
    print(f"  POST {public_url}/transcribe")
else:
    print("\n‚ùå ngrok tunnel not created. Fix the error above and re-run this cell.")
print(f"\nKeep this notebook running!")
print(f"{'='*60}")



Killed existing ngrok sessions
Detected Flask running on port 5000


ERROR:pyngrok.process.ngrok:t=2026-01-17T18:44:51+0000 lvl=eror msg="failed to reconnect session" obj=tunnels.session err="authentication failed: Your account is limited to 1 simultaneous ngrok agent sessions.\nRun multiple endpoints at the same time from a single agent by defining them in your agent configuration file and running `ngrok start --all`.\nRead more about the agent configuration file: https://ngrok.com/docs/agent/config/ \nYou can view your current agent sessions in the dashboard: https://dashboard.ngrok.com/agents. Upgrade to a paid plan to remove this limit:\nhttps://dashboard.ngrok.com/billing/choose-a-plan\r\n\r\nERR_NGROK_108\r\n"
ERROR:pyngrok.process.ngrok:t=2026-01-17T18:44:51+0000 lvl=eror msg="session closing" obj=tunnels.session err="authentication failed: Your account is limited to 1 simultaneous ngrok agent sessions.\nRun multiple endpoints at the same time from a single agent by defining them in your agent configuration file and running `ngrok start --all`.\n

Error connecting ngrok: The ngrok process errored on start: authentication failed: Your account is limited to 1 simultaneous ngrok agent sessions.\nRun multiple endpoints at the same time from a single agent by defining them in your agent configuration file and running `ngrok start --all`.\nRead more about the agent configuration file: https://ngrok.com/docs/agent/config/ \nYou can view your current agent sessions in the dashboard: https://dashboard.ngrok.com/agents. Upgrade to a paid plan to remove this limit:\nhttps://dashboard.ngrok.com/billing/choose-a-plan\r\n\r\nERR_NGROK_108\r\n.
Tried to connect to port 5000
Make sure you've killed any existing ngrok sessions
Check: https://dashboard.ngrok.com/agents


: 

In [None]:
  !nvidia-smi