# **3.0** ‎ Multilingual Preprocessing 

This notebook focuses on designing and evaluating a **lightweight multilingual processing pipeline**. \
It is to enable the municipal chatbot to handle input in multiple languages commonly used in Singapore (English, Chinese, Malay & Tamil). \
We will explore two approaches throughout this notebook:

1. **Language Detection + Translation to English**

2. **Multilingual LLMs that natively understand input in different languages**

Based on our testings and discoveries, our objective is to find a solution that can offer:
- The ability to understand inputs in English, Mandarin, Malay, & Tamil and respond appropriately regardless of input language

- The abiliy to handle code-switched or colloquial speech (e.g., Singlish) 

- A balance between **accuracy**, **speed**, and **resource efficiency**—making it suitable for real-time uses 

# **3.1** ‎ Language Detection + Translation

This approach involves two stages:
1. **Detect the language** of the incoming text using a lightweight model.

2. **Translate** the input into English using an automatic translation model.

Once translated, the query can be routed to an English-only intent classifier or chatbot logic.

### **3.1a.1** ‎ *Detection* – langid.py

[`langid`](https://github.com/saffsd/langid.py) is a lightweight, offline language identification library trained on over 97 languages. \
It uses a Naive Bayes classifier over character n-grams, making it compact, fast, and suitable for embedded or low-resource environments.

Why `langid` Might Be a Good Choice:
- **Offline and Self-Contained**: Requires no internet connection or external models.
- **Fast Inference**: Performs language detection in milliseconds, ideal for real-time applications.
- **Small Footprint**: No large dependencies; easy to integrate into lightweight deployments.
- **Good Accuracy on Short Texts**: Designed to handle short snippets like search queries and tweets—similar in length to typical chatbot inputs.
- **Stable and Mature**: Despite its simplicity, `langid` has been widely used in production systems.
- **Language Support**: Offers strong support for the 4 main languages in Singapore.

Noticeable Limitations of `langid`:
- May not recognize **code-switched** text (e.g., mixed English-Chinese or Singlish).
- Limited support for **regional variations or dialects** (e.g., Tamil variants used in Singapore).
- Trained on formal text corpora—accuracy may drop for informal or slang-heavy messages.

In [None]:
!pip install langid --quiet

In [None]:
import langid
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

# Load NLLB distilled model
model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

translator_pipeline = pipeline("translation", model=model, tokenizer=tokenizer)

Device set to use cpu


### **3.1b.1** ‎ *Detection* – Meta AI's (Facebook) NLLB 

**NLLB** (No Language Left Behind) supports high-quality machine translation across 200+ languages, inclusive of many underrepresented and and low-resource languages. \
The NLLB-200 Distilled 600M model is a compressed, faster version optimised for inference speed while retaining strong translation quality.

**Why NLLB-200 Distilled 600M Is a Strong Candidate**

- **Supports 200+ Languages**: Includes Singapore's major languages—English, Chinese (`zho_Hans`), Malay (`zsm_Latn`), and Tamil (`tam_Taml`).
- **High Translation Quality**: Benchmarked against FLORES-200, with competitive performance in many low-resource languages.
- **Distilled for Speed**: The 600M version balances translation quality with faster inference and lower memory usage.
- **Open Source**: Available via Hugging Face (`facebook/nllb-200-distilled-600M`) with permissive licensing for integration.

However, some consideration and drawbacks of it to take note off:
- Requires specifying the correct source and target language codes (other larger size versions of the model has it in-built)
- May struggle with informal phrases or code-switching
- Requires the use of a language detection system first before it can translate


**Install Depdencies**

In [None]:
!pip install transformers --quiet

**Define Supported Language Mapping**

As mentioned, we have to manually map language codes from `langid` (ISO 639-1) to NLLB's internal codes. For now, we support:

- English ➠ `eng_Latn` (no translation needed)
- Chinese ➠ `zho_Hans`
- Malay ➠ `msa_Latn`
- Tamil ➠ `tam_Taml`

Any other languages will be marked as unsupported for now.

In [13]:
# Supported languages
lang_map = {
    "en": "eng_Latn",
    "zh": "zho_Hans",
    "ms": "msa_Latn",
    "ta": "tam_Taml"
}

def detect_language(text):
    lang, prob = langid.classify(text)
    return lang, prob

def translate_to_english(text):
    lang, prob = detect_language(text)
    
    if lang == "en":
        return text, "eng_Latn", False  # No translation needed

    if lang not in lang_map:
        return text, lang, False  # Unsupported language

    src_lang = lang_map[lang]
    translation = translator_pipeline(text, src_lang=src_lang, tgt_lang="eng_Latn")
    return translation[0]['translation_text'], src_lang, True

**Test the Translation Pipeline**

In [None]:
test_queries = [
    "这个垃圾桶已经满了",  # Chinese
    "Di mana saya boleh laporkan kebersihan kawasan awam?",  # Malay
    "நாங்கள் எங்கு புகார் அளிக்கலாம்?",  # Tamil
    "Where do I report illegal dumping?"  # English
]

for q in test_queries:
    translated, src_lang, was_translated = translate_to_english(q)
    print("Original:", q)
    print("Detected Language:", src_lang)
    print("Was Translated:", was_translated)
    print("Final Query (English):", translated)
    print("-" * 60)

# **3.2** ‎ Multilingual LLMs

This approach uses LLMs trained to natively understand and reason across multiple languages. \
These models can directly process inputs in Mandarin, Malay, Tamil, and more—without needing translation.
With this section, it will help us evaluate if:
- LLMs already understand these languages well enough?
- Is translation necessary?

### **3.2.1** ‎ Deepseek Chat

In [None]:
from langchain_ollama.llms import OllamaLLM

llm = OllamaLLM(model="deepseek-r1:7b")

def query_llm_directly(prompt):
    return llm.invoke(prompt).strip()

for q in test_queries:
    prompt = f"You are a helpful municipal assistant. The user said:\n\n{q}\n\nRespond helpfully."
    print(f"Input: {q}")
    print("LLM Response:")
    print(query_llm_directly(prompt))
    print("=" * 80)

### **3.2.2** ‎ OpenAI ChatGPT-4

In [None]:
!pip install openai --quiet

In [None]:
from openai import OpenAI

client = OpenAI(
    api_key="",
)
def gpt4_response(query):
    response = client.responses.create(
        model="gpt-4o",
        instructions="You are a helpful municipal assistant, that answers user's queries clear and consisely.",
        input=query,
    )
    return response.output_text.strip()

for q in test_queries:
    prompt = f"{q}"
    print(f"Input: {q}")
    print("LLM Response:")
    print(gpt4_response(prompt))
    print("=" * 80)