# Agentic LLM-as-a-Judge for English-to-Filipino Translations
Enrique Lejano & Monica Manlises | CSC420M G01

## Install Dependencies

In [4]:
%pip install -U langchain-google-genai python-dotenv langchain-core "langchain-chroma>=0.1.2" google-genai chromadb langchain-huggingface langchain-chroma scikit-learn --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Import Libraries and Setup

In [5]:
from dotenv import load_dotenv
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate, FewShotChatMessagePromptTemplate, AIMessagePromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate
from langchain_core.messages import HumanMessage, AIMessage
from langchain_core.example_selectors import SemanticSimilarityExampleSelector
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_core.rate_limiters import InMemoryRateLimiter    
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from google import genai
import chromadb
import pandas as pd
import numpy as np
import re


load_dotenv()

True

### Rate Limiter
To accommodate Gemini Free Tier Plan

In [2]:
rate_limiter = InMemoryRateLimiter(
    requests_per_second=0.25,  # 1 request every 4 seconds
    check_every_n_seconds=0.1,
    max_bucket_size=1.0        # No piling up of surplus tokens
)

llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    rate_limiter=rate_limiter
)

## Zero-shot Prompting

In [None]:
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

In [None]:
template = """
You are a professional translation evaluator. Your job is to judge the quality of a Filipino translation of an English sentence.

Assess the translation using the following criteria:
- **Adequacy (0–100)**: How well does it preserve the original meaning?
- **Fluency (0–100)**: How natural and grammatically correct is the Filipino?
- **Lexical Choice (0–100)**: Are the word choices appropriate and accurate?

Then:
- Provide a final verdict: Good or Needs Improvement
- Give a short explanation of why.

English Sentence:
{english}

Filipino Translation:
{filipino}

Please format your response like this:

Verdict: ...
Adequacy: ...
Fluency: ...
Lexical Choice: ...
Explanation:
...
"""

prompt = PromptTemplate(
    input_variables=["english", "filipino"],
    template=template
)

In [7]:
def evaluate_translation(english: str, filipino: str):
    formatted_prompt = prompt.format(english=english, filipino=filipino)
    response = llm([HumanMessage(content=formatted_prompt)])
    return response.content

In [None]:
eng = "Please ensure the system is compliant with the new policies."b
flawed_tl = "Siguraduhin ang sistema ay sumunod sa bagong patakaran."

result = evaluate_translation(eng, flawed_tl)
print(result)

  response = llm([HumanMessage(content=formatted_prompt)])


Verdict: Needs Improvement
Adequacy: 85
Fluency: 75
Lexical Choice: 80

Explanation:
While the translation captures the core meaning, it could be improved in terms of fluency and lexical choice.

*   **Adequacy:** The translation conveys the essential meaning of ensuring compliance.
*   **Fluency:** The sentence structure is slightly awkward. A more natural phrasing might be "Pakisiguro na ang sistema ay sumusunod sa mga bagong patakaran." The use of "ay" after "sistema" is grammatically correct but can sound less fluent in modern Filipino.
*   **Lexical Choice:** "Patakaran" is generally correct for "policy," but "mga bagong patakaran" (plural) would be more accurate since "policies" is plural in the original English sentence. Also, "sumunod" is a good choice for "compliant," but "tumalima" could also be considered for a more formal tone.


## Load Dataset and Data Preprocessing

In [6]:
training_set = "../datasets/training.csv"
test_set = "../datasets/test.csv"

### Training Set

In [7]:
training_df = pd.read_csv(training_set).dropna(how='all')
training_df.drop(columns=["Contributor"], inplace=True)
training_df.head()

Unnamed: 0,English,Filipino-Correct,Filipino-Flawed,Remarks
0,The Philippines is an archipelago made up of o...,"Ang Pilipinas ay isang kapulaang binubuo ng 7,...",Ang Pilipinas ay isang puno na binubuo ng mahi...,
1,Philippines is the world's second-largest arch...,Ang Pilipinas ang pangalawa sa pinakamalaking ...,Ang Pilipinas ay ang pangalawang malaking isla...,
2,Filipino and English are the two official lang...,Filipino at Ingles ang dalawang opisyal na lin...,Tagalog at Ingles ang dalawa opisyal lingwahe ...,
3,Tagalog is the most widely spoken native langu...,Tagalog ang pinakamalawak at ginagamit na katu...,Tagalog ay ang pinaka malawak sinasabi katutub...,
4,The Philippines was a Spanish colony for over ...,Ang Pilipinas ay naging isang kolonya ng Espan...,Pilipinas naging Espanya Kolonya sa higit 300 ...,


In [8]:
training_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 561 entries, 0 to 562
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   English           561 non-null    object
 1   Filipino-Correct  561 non-null    object
 2   Filipino-Flawed   561 non-null    object
 3   Remarks           332 non-null    object
dtypes: object(4)
memory usage: 21.9+ KB


### Test Set

In [9]:
test_df = pd.read_csv(test_set).dropna(how='all')
test_df.drop(columns=["Contributor"], inplace=True)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 57 entries, 0 to 63
Data columns (total 5 columns):
 #   Column                                                          Non-Null Count  Dtype  
---  ------                                                          --------------  -----  
 0   Source Text (English)                                           57 non-null     object 
 1   Target Text (Filipino)                                          57 non-null     object 
 2   Final Score                          (1 - lowest, 5 - highest)  57 non-null     float64
 3   Rater 1 Explanation                                             57 non-null     object 
 4   Rater 2 Explanation                                             54 non-null     object 
dtypes: float64(1), object(4)
memory usage: 2.7+ KB


In [7]:
test_df.rename(columns={
    "Source Text (English)": "English",
    "Target Text (Filipino)": "Filipino",
    "Final Score                          (1 - lowest, 5 - highest)": "Rating",
    "Rater 1 Explanation": "Remarks 1",
    "Rater 2 Explanation": "Remarks 2"
}, inplace=True)

test_df.head()

Unnamed: 0,English,Filipino,Rating,Remarks 1,Remarks 2
0,The children laughed and played under the afte...,Ang mga bata ay nagtawanan at naglaro sa ilali...,4.0,"Accurate, fluent, and natural translation. Cap...",Just slight error due to the literal translati...
1,She took a break to gather her thoughts.,Nagpahinga siya para mag-isip-isip.,4.0,The translation is accurate. It was able to ca...,The translation would have been better if the ...
2,The algorithm efficiently identifies patterns ...,Mabisang kinikilala ng algoritmo ang mga patte...,3.0,"The translation of ""identifies"" as ""kinikilala...",The translation would have been better if the ...
3,Data normalization helps improve model perform...,Tumutulong sa pagpabuti ng model ang normalisa...,5.0,The translated text is natural and captures th...,The translation didn't literally translated th...
4,alam mo ma'am masaya naman topics natin sa phi...,"You know, ma'am, we have a lot of fun philosop...",4.0,"flawed translation is close, but failed to tra...",


## Few-shot Prompting

In [10]:
examples = training_df.sample(3, random_state=38).to_dict(orient="records")

# Adding ratings to few-shot examples for the sake of prompting.
examples[0]['Rating'] = 4
examples[1]['Rating'] = 5
examples[2]['Rating'] = 4

examples

[{'English': 'Dinagyang culminates in a competition of tribes dancing with drums.',
  'Filipino-Correct': 'Nagtatapos ang Dinagyang sa paligsahan ng mga tribo na sumasayaw kasabay ng mga tambol.',
  'Filipino-Flawed': '… sa isang patimpalak ng mga tribo na kumakanta ng kundiman.',
  'Remarks': 'Replaces drum dance with singing—a meaning swap.',
  'Rating': 4},
 {'English': 'James is eating strawberries in Baguio.',
  'Filipino-Correct': 'Kumakain si James ng strawberries sa Baguio.',
  'Filipino-Flawed': 'Si James ay kumakain ng strawberries sa lugar ng Baguio.',
  'Remarks': 'ay inversion, unnecessary addition of "sa lugar ng"',
  'Rating': 5},
 {'English': "Big O notation describes the upper bound of an algorithm's growth.",
  'Filipino-Correct': 'Inilalarawan ng Big O notation ang pinakamataas na paglaki ng isang algoritmo.',
  'Filipino-Flawed': 'Big O ay laki ng algo.',
  'Remarks': nan,
  'Rating': 4}]

In [17]:
system_message = """
You are a professional translation evaluator. Your job is to judge the quality of a Filipino translation of an English sentence.
Assess the translation using with the following in mind:
- Adequacy: How well does it preserve the original meaning?
- Fluency: How natural and grammatically correct is the Filipino?
- Lexical Choice: Are the word choices appropriate and accurate?

Then: 
- Rate the translation from a scale of 1 to 5, serving as a combination of all three criteria with 1 being the worst and 5 being the best.
- Provide a short explanation or remarks of why you gave that rating.

Please format your response like this:
English Sentence: ...
Filipino Translation: ...
Rating: ...
Remarks: ...
"""

system_prompt = SystemMessagePromptTemplate.from_template(system_message)

human_prompt = HumanMessagePromptTemplate.from_template(
    "English Sentence: {english}\nFilipino Translation: {filipino}"
)

ai_prompt = AIMessagePromptTemplate.from_template(
    "Rating: {rating}\nRemarks: {remarks}\n"
)

In [18]:
example_prompt = ChatPromptTemplate.from_messages([
    human_prompt,
    ai_prompt,
])

few_shot_prompt = FewShotChatMessagePromptTemplate(
    examples=[{
        "english": ex["English"],
        "filipino": ex["Filipino-Correct"],
        "rating": ex["Rating"],
        "remarks": ex["Remarks"],
    } for ex in examples],
    example_prompt=example_prompt,
)

final_prompt = ChatPromptTemplate.from_messages([
    system_prompt,
    few_shot_prompt,
    human_prompt
])

### Single Input-Output Prompt

In [54]:
messages = final_prompt.format_messages(
    english="With what would you use a \"wah-wah pedal?\"",
    filipino="Ano ang gagamitin mo ng \"wah-wah pedal?\"",
)

llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

response = llm.invoke(messages)
print(response.content)

Rating: 3
Remarks: The translation is grammatically awkward. A more natural phrasing would be "Para saan mo gagamitin ang 'wah-wah pedal'?" or "Sa ano mo gagamitin ang 'wah-wah pedal'?"


Static few-shot learning examples for entire test set.

In [20]:
def evaluate_rating(llm_func, prompt_obj, english: str, filipino: str) -> str:
    messages = prompt_obj.format_messages(english=english, filipino=filipino)
    resp = llm_func.invoke(messages)
    return resp.content

test_df['pred_raw'] = test_df.apply(
    lambda r: evaluate_rating(llm, final_prompt, r['English'], r['Filipino']),
    axis=1,
)

Extract the numeric Rating from each LLM response

In [24]:
def parse_rating(text):
    m = re.search(r"Rating:\s*([1-5])", text)
    return int(m.group(1)) if m else None

test_df['pred_rating'] = test_df['pred_raw'].apply(parse_rating)

test_df.to_csv("../results/few_shot_ratings.csv", index=False)
test_df.head()

Unnamed: 0,English,Filipino,Rating,Remarks 1,Remarks 2,pred_raw,pred_rating
0,The children laughed and played under the afte...,Ang mga bata ay nagtawanan at naglaro sa ilali...,4.0,"Accurate, fluent, and natural translation. Cap...",Just slight error due to the literal translati...,Rating: 5\nRemarks: The translation is accurat...,5
1,She took a break to gather her thoughts.,Nagpahinga siya para mag-isip-isip.,4.0,The translation is accurate. It was able to ca...,The translation would have been better if the ...,Rating: 5\nRemarks: The translation is accurat...,5
2,The algorithm efficiently identifies patterns ...,Mabisang kinikilala ng algoritmo ang mga patte...,3.0,"The translation of ""identifies"" as ""kinikilala...",The translation would have been better if the ...,Rating: 5\nRemarks: The translation is accurat...,5
3,Data normalization helps improve model perform...,Tumutulong sa pagpabuti ng model ang normalisa...,5.0,The translated text is natural and captures th...,The translation didn't literally translated th...,Rating: 5\nRemarks: The translation is accurat...,5
4,alam mo ma'am masaya naman topics natin sa phi...,"You know, ma'am, we have a lot of fun philosop...",4.0,"flawed translation is close, but failed to tra...",,Rating: 5\nRemarks: The translation accurately...,5


Compare human rating versus LLM rating

In [22]:
y_true = test_df['Rating'].tolist()
y_pred = test_df['pred_rating'].tolist()

acc = accuracy_score(y_true, y_pred)
prec, rec, f1, support = precision_recall_fscore_support(
    y_true, y_pred, average='macro', zero_division=0
)

print(f"Accuracy: {acc:.3f}")
print(f"Macro Precision: {prec:.3f}")
print(f"Macro Recall: {rec:.3f}")
print(f"Macro F1-score: {f1:.3f}")

print("\nDetailed per-rating breakdown:\n")
print(classification_report(y_true, y_pred))

Accuracy: 0.263
Macro Precision: 0.255
Macro Recall: 0.250
Macro F1-score: 0.166

Detailed per-rating breakdown:

              precision    recall  f1-score   support

         1.0       1.00      0.25      0.40         4
         2.0       0.00      0.00      0.00        10
         3.0       0.00      0.00      0.00        15
         4.0       0.00      0.00      0.00        14
         5.0       0.27      1.00      0.43        14

    accuracy                           0.26        57
   macro avg       0.25      0.25      0.17        57
weighted avg       0.14      0.26      0.13        57



## Semantic Similarity Few Shot

Set up documents / vector store.

In [11]:
documents = [Document(
    page_content=
        f"English: {row[1]['English']}\n"
        f"Filipino-Flawed: {row[1]['Filipino-Flawed']}\n"
        f"Filipino-Correct: {row[1]['Filipino-Correct']}\n"
        f"Remarks: {row[1]['Remarks']}",
    metadata={
        "english": row[1]["English"],
        "flawed": row[1]["Filipino-Flawed"],
        "correct": row[1]["Filipino-Correct"],
        "remarks": row[1]["Remarks"]
        # Would be good to add rating, type of flaw, etc. here as metadata
    }
) for row in training_df.iterrows()]

doc_ids = [f"train_{i}" for i in range(len(documents))]

texts = []
text_ids = []
for i, doc in enumerate(documents):
    # Suppose metadata holds each segment
    texts.append(doc.metadata["english"])
    text_ids.append(f"train_{i}_en")
    texts.append(doc.metadata["flawed"])
    text_ids.append(f"train_{i}_flawed")
    texts.append(doc.metadata["correct"])
    text_ids.append(f"train_{i}_correct")


TODO: Just abandon langchain chroma and just go straight to chroma db

In [19]:
# from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
from chromadb import PersistentClient

# ef = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
client = PersistentClient(path="../vector_stores/chroma_db/agent_translation_eval_documents")
col = client.get_or_create_collection(name="agent_eval")

col.add(
    documents=[doc.page_content for doc in documents],
    ids=[f"doc_{i}" for i in range(len(documents))],
    metadatas=[doc.metadata for doc in documents],
)

/Users/riki/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:46<00:00, 1.78MiB/s]


In [20]:
# First collect 2–3 sample IDs
sample_ids = [f"doc_{i}" for i in range(3)]

# Query Chroma for embeddings, documents, and metadata
records = col.get(
    ids=sample_ids,
    include=["embeddings", "documents", "metadatas"],
)

for doc_id, doc_txt, meta, embed in zip(
    records["ids"],
    records["documents"],
    records["metadatas"],
    records["embeddings"],
):
    print("ID:", doc_id)
    print("Doc:", doc_txt[:60], "…")
    print("Meta:", meta)
    print("Embedding size:", len(embed))
    print("First 8 floats:", embed[:8])
    print("—" * 40)


ID: doc_0
Doc: English: The Philippines is an archipelago made up of over 7 …
Meta: {'correct': 'Ang Pilipinas ay isang kapulaang binubuo ng 7,640 na isla, ngunit 2,000 lamang ang tinitirahan', 'flawed': 'Ang Pilipinas ay isang puno na binubuo ng mahigit 7,640 manok, bagaman halos 2,000 lamang ang tumira.', 'english': 'The Philippines is an archipelago made up of over 7,640 islands, though only about 2,000 are inhabited.'}
Embedding size: 384
First 8 floats: [ 0.0868037  -0.0139881  -0.02492837  0.03443569 -0.06847151 -0.05465192
 -0.00600597 -0.00713197]
————————————————————————————————————————
ID: doc_1
Doc: English: Philippines is the world's second-largest archipela …
Meta: {'flawed': 'Ang Pilipinas ay ang pangalawang malaking isla sa asya', 'english': "Philippines is the world's second-largest archipelago.", 'correct': 'Ang Pilipinas ang pangalawa sa pinakamalaking kapuluan sa mundo'}
Embedding size: 384
First 8 floats: [ 0.09797665 -0.02525598 -0.00836756  0.04072418 -0.03064715 

TODO: Use the embeddings to conduct semantic similarity search.

In [None]:
system_message = """
You are a professional translation evaluator. Your job is to judge the quality of a Filipino translation of an English sentence.
Assess the translation using with the following in mind:
- Adequacy: How well does it preserve the original meaning?
- Fluency: How natural and grammatically correct is the Filipino?
- Lexical Choice: Are the word choices appropriate and accurate?

Then: 
- Rate the translation from a scale of 1 to 5, serving as a combination of all three criteria with 1 being the worst and 5 being the best.
- Provide a short explanation or remarks of why you gave that rating.

BEFORE rating, list any most similar example(s) from training and how they influenced your decision, e.g.:

"Closest training sentence: '...'. Gave a score of 2 because the sentence was 'unnatural for native speakers'."

Please format your response like this:
English Sentence: ...
Filipino Translation: ...
Rating: ...
Remarks: ...
"""

system_prompt = SystemMessagePromptTemplate.from_template(system_message)

human_prompt = HumanMessagePromptTemplate.from_template(
    "English Sentence: {english}\nFilipino Translation: {filipino}"
)

ai_prompt = AIMessagePromptTemplate.from_template(
    "Rating: {rating}\nRemarks: {remarks}\n"
)

example_prompt = ChatPromptTemplate.from_messages([
    human_prompt,
    ai_prompt,
])

few_shot_prompt = FewShotChatMessagePromptTemplate(
    examples=[{
        "english": ex["English"],
        "filipino": ex["Filipino-Correct"],
        "rating": ex["Rating"],
        "remarks": ex["Remarks"],
    } for ex in examples],
    example_prompt=example_prompt,
)

final_prompt = ChatPromptTemplate.from_messages([
    system_prompt,
    few_shot_prompt,
    human_prompt
])