# Agentic LLM-as-a-Judge for English-to-Filipino Translations
Enrique Lejano & Monica Manlises | CSC420M G01

## Install Dependencies

In [2]:
%pip install -U langchain-google-genai python-dotenv langchain-core --quiet

[0mNote: you may need to restart the kernel to use updated packages.


## Import Libraries and Setup

In [None]:
from dotenv import load_dotenv
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate, AIMessagePromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate
from langchain_core.messages import HumanMessage, AIMessage
from langchain_core.example_selectors import SemanticSimilarityExampleSelector
import pandas as pd

load_dotenv()

True

## Zero-shot Prompting

In [None]:
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

In [None]:
template = """
You are a professional translation evaluator. Your job is to judge the quality of a Filipino translation of an English sentence.

Assess the translation using the following criteria:
- **Adequacy (0–100)**: How well does it preserve the original meaning?
- **Fluency (0–100)**: How natural and grammatically correct is the Filipino?
- **Lexical Choice (0–100)**: Are the word choices appropriate and accurate?

Then:
- Provide a final verdict: Good or Needs Improvement
- Give a short explanation of why.

English Sentence:
{english}

Filipino Translation:
{filipino}

Please format your response like this:

Verdict: ...
Adequacy: ...
Fluency: ...
Lexical Choice: ...
Explanation:
...
"""

prompt = PromptTemplate(
    input_variables=["english", "filipino"],
    template=template
)

In [7]:
def evaluate_translation(english: str, filipino: str):
    formatted_prompt = prompt.format(english=english, filipino=filipino)
    response = llm([HumanMessage(content=formatted_prompt)])
    return response.content

In [None]:
eng = "Please ensure the system is compliant with the new policies."b
flawed_tl = "Siguraduhin ang sistema ay sumunod sa bagong patakaran."

result = evaluate_translation(eng, flawed_tl)
print(result)

  response = llm([HumanMessage(content=formatted_prompt)])


Verdict: Needs Improvement
Adequacy: 85
Fluency: 75
Lexical Choice: 80

Explanation:
While the translation captures the core meaning, it could be improved in terms of fluency and lexical choice.

*   **Adequacy:** The translation conveys the essential meaning of ensuring compliance.
*   **Fluency:** The sentence structure is slightly awkward. A more natural phrasing might be "Pakisiguro na ang sistema ay sumusunod sa mga bagong patakaran." The use of "ay" after "sistema" is grammatically correct but can sound less fluent in modern Filipino.
*   **Lexical Choice:** "Patakaran" is generally correct for "policy," but "mga bagong patakaran" (plural) would be more accurate since "policies" is plural in the original English sentence. Also, "sumunod" is a good choice for "compliant," but "tumalima" could also be considered for a more formal tone.


## Load Dataset and Data Preprocessing

In [56]:
training_set = "../datasets/training.csv"
test_set = "../datasets/test.csv"

### Training Set

In [61]:
training_df = pd.read_csv(training_set).dropna(how='all')
training_df.drop(columns=["Contributor"], inplace=True)
training_df.head()

Unnamed: 0,English,Filipino-Correct,Filipino-Flawed,Remarks
0,The Philippines is an archipelago made up of o...,"Ang Pilipinas ay isang kapulaang binubuo ng 7,...",Ang Pilipinas ay isang puno na binubuo ng mahi...,
1,Philippines is the world's second-largest arch...,Ang Pilipinas ang pangalawa sa pinakamalaking ...,Ang Pilipinas ay ang pangalawang malaking isla...,
2,Filipino and English are the two official lang...,Filipino at Ingles ang dalawang opisyal na lin...,Tagalog at Ingles ang dalawa opisyal lingwahe ...,
3,Tagalog is the most widely spoken native langu...,Tagalog ang pinakamalawak at ginagamit na katu...,Tagalog ay ang pinaka malawak sinasabi katutub...,
4,The Philippines was a Spanish colony for over ...,Ang Pilipinas ay naging isang kolonya ng Espan...,Pilipinas naging Espanya Kolonya sa higit 300 ...,


In [62]:
training_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 561 entries, 0 to 562
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   English           561 non-null    object
 1   Filipino-Correct  561 non-null    object
 2   Filipino-Flawed   561 non-null    object
 3   Remarks           332 non-null    object
dtypes: object(4)
memory usage: 21.9+ KB


### Test Set

In [67]:
test_df = pd.read_csv(test_set).dropna(how='all')
test_df.drop(columns=["Contributor"], inplace=True)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 57 entries, 0 to 63
Data columns (total 5 columns):
 #   Column                                                          Non-Null Count  Dtype  
---  ------                                                          --------------  -----  
 0   Source Text (English)                                           57 non-null     object 
 1   Target Text (Filipino)                                          57 non-null     object 
 2   Final Score                          (1 - lowest, 5 - highest)  57 non-null     float64
 3   Rater 1 Explanation                                             57 non-null     object 
 4   Rater 2 Explanation                                             54 non-null     object 
dtypes: float64(1), object(4)
memory usage: 2.7+ KB


In [70]:
test_df.rename(columns={
    "Source Text (English)": "English",
    "Target Text (Filipino)": "Filipino",
    "Final Score                          (1 - lowest, 5 - highest)": "Rating",
    "Rater 1 Explanation": "Remarks 1",
    "Rater 2 Explanation": "Remarks 2"
}, inplace=True)

test_df.head()

Unnamed: 0,English,Filipino,Rating,Remarks 1,Remarks 2
0,The children laughed and played under the afte...,Ang mga bata ay nagtawanan at naglaro sa ilali...,4.0,"Accurate, fluent, and natural translation. Cap...",Just slight error due to the literal translati...
1,She took a break to gather her thoughts.,Nagpahinga siya para mag-isip-isip.,4.0,The translation is accurate. It was able to ca...,The translation would have been better if the ...
2,The algorithm efficiently identifies patterns ...,Mabisang kinikilala ng algoritmo ang mga patte...,3.0,"The translation of ""identifies"" as ""kinikilala...",The translation would have been better if the ...
3,Data normalization helps improve model perform...,Tumutulong sa pagpabuti ng model ang normalisa...,5.0,The translated text is natural and captures th...,The translation didn't literally translated th...
4,alam mo ma'am masaya naman topics natin sa phi...,"You know, ma'am, we have a lot of fun philosop...",4.0,"flawed translation is close, but failed to tra...",


## Few-shot Prompting

In [44]:
examples = training_df.sample(3, random_state=38).to_dict(orient="records")

# Adding ratings to few-shot examples for the sake of prompting.
examples[0]['Rating'] = 4
examples[1]['Rating'] = 5
examples[2]['Rating'] = 4

examples

[{'English': 'The patient’s treatment plan includes specific, measurable goals collaboratively developed.',
  'Filipino-Correct': 'May malinaw at masukat na goals ang plano ng pasyente na pinlano nila nang magkasama.',
  'Filipino-Flawed': 'Gagawa sila ng plano ng pasyente para gumaling.',
  'Remarks': 'too casual and didnt mention collaboration',
  'Rating': 4},
 {'English': 'Hah! Get owned!',
  'Filipino-Correct': 'Hah! Mukha mo!',
  'Filipino-Flawed': 'Hah! Pag-aari!',
  'Remarks': 'Mukha mo is an appropriate substitution because it has the same essence  as "Get owned!". Both are very childish casual insults typically used in game-settings with the goal of teasing the opponent after beating them.',
  'Rating': 5},
 {'English': "You got spirit, Red. But this is the real world! The real world is cold! The real world doesn't care about spirit! You wanna be a hero!? Then play the part and die like every other Huntsman in history! As for me, I'll do what I do best: lie, steal, cheat, and

In [45]:
system_message = """
You are a professional translation evaluator. Your job is to judge the quality of a Filipino translation of an English sentence.
Assess the translation using with the following in mind:
- Adequacy: How well does it preserve the original meaning?
- Fluency: How natural and grammatically correct is the Filipino?
- Lexical Choice: Are the word choices appropriate and accurate?

Then: 
- Rate the translation from a scale of 1 to 5, serving as a combination of all three criteria with 1 being the worst and 5 being the best.
- Provide a short explanation or remarks of why you gave that rating.

Please format your response like this:
English Sentence: ...
Filipino Translation: ...
Rating: ...
Remarks: ...
"""

system_prompt = SystemMessagePromptTemplate.from_template(system_message)

human_prompt = HumanMessagePromptTemplate.from_template(
    "English Sentence: {english}\nFilipino Translation: {filipino}"
)

ai_prompt = AIMessagePromptTemplate.from_template(
    "Rating: {rating}\nRemarks: {remarks}\n"
)

In [52]:
example_prompt = ChatPromptTemplate.from_messages([
    human_prompt,
    ai_prompt,
])

few_shot_prompt = FewShotChatMessagePromptTemplate(
    examples=[{
        "english": ex["English"],
        "filipino": ex["Filipino-Correct"],
        "rating": ex["Rating"],
        "remarks": ex["Remarks"],
    } for ex in examples],
    example_prompt=example_prompt,
)

final_prompt = ChatPromptTemplate.from_messages([
    system_prompt,
    few_shot_prompt,
    human_prompt
])

### Single Input-Output Prompt

In [54]:
messages = final_prompt.format_messages(
    english="With what would you use a \"wah-wah pedal?\"",
    filipino="Ano ang gagamitin mo ng \"wah-wah pedal?\"",
)

llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

response = llm.invoke(messages)
print(response.content)

Rating: 3
Remarks: The translation is grammatically awkward. A more natural phrasing would be "Para saan mo gagamitin ang 'wah-wah pedal'?" or "Sa ano mo gagamitin ang 'wah-wah pedal'?"


### Static few-shot learning examples for entire test set.