# LLM-as-a-Judge for English-to-Filipino Translations: Agentic LLM Judge
Enrique Lejano & Monica Manlises | CSC420M G01 

## I. Setup

Install Dependencies

In [81]:
%pip install -U langchain langchain-google-genai python-dotenv langchain-core "langchain-chroma>=0.1.2" google-genai chromadb langchain-chroma scikit-learn libretranslatepy langchain_community --quiet
%load_ext autoreload

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jupyterlab 4.2.5 requires ipykernel>=6.5.0, but you have ipykernel 5.5.5 which is incompatible.
datasets 3.6.0 requires requests>=2.32.2, but you have requests 2.31.0 which is incompatible.
sagemaker 2.247.1 requires packaging<25,>=23.0, but you have packaging 25.0 which is incompatible.
sagemaker 2.247.1 requires protobuf<6.0,>=3.12, but you have protobuf 6.31.1 which is incompatible.
tensorflow 2.19.0 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.3, but you have protobuf 6.31.1 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Import libraries and environment variables

In [62]:
%load_ext autoreload
%autoreload 2

from dotenv import load_dotenv
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate, AIMessagePromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate, MessagesPlaceholder
from langchain_core.example_selectors import SemanticSimilarityExampleSelector
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_core.rate_limiters import InMemoryRateLimiter  
from langchain.agents import create_tool_calling_agent, AgentExecutor
from chromadb import PersistentClient
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, cohen_kappa_score, mean_absolute_error
from libretranslatepy import LibreTranslateAPI
import pandas as pd
import numpy as np
import re
import json

from agentic_judge_main import format_training_batches, evaluate_training_batches, print_collection_records, parse_rating, parse_overall_rating, libretranslate_en_fil, get_similar_examples, prepare_few_shot_agent_with_tools, evaluate_with_agent_executor, structured_evaluate_with_agent_executor

load_dotenv(override=True)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


True

### Load Dataset and Data Preprocessing

In [168]:
training_set = "../datasets/training.csv"
test_set = "../datasets/test.csv"

### Load training set

In [169]:
training_df = pd.read_csv(training_set).dropna(how='all')
training_df.drop(columns=["Contributor"], inplace=True)
training_df.head()

Unnamed: 0,English,Filipino-Correct,Filipino-Flawed,Remarks
0,The Philippines is an archipelago made up of o...,"Ang Pilipinas ay isang kapulaang binubuo ng 7,...",Ang Pilipinas ay isang puno na binubuo ng mahi...,
1,Philippines is the world's second-largest arch...,Ang Pilipinas ang pangalawa sa pinakamalaking ...,Ang Pilipinas ay ang pangalawang malaking isla...,
2,Filipino and English are the two official lang...,Filipino at Ingles ang dalawang opisyal na lin...,Tagalog at Ingles ang dalawa opisyal lingwahe ...,
3,Tagalog is the most widely spoken native langu...,Tagalog ang pinakamalawak at ginagamit na katu...,Tagalog ay ang pinaka malawak sinasabi katutub...,
4,The Philippines was a Spanish colony for over ...,Ang Pilipinas ay naging isang kolonya ng Espan...,Pilipinas naging Espanya Kolonya sa higit 300 ...,


Add "None" to all rows with missing remarks.

In [170]:
# Add "None" to all rows with missing remarks.
training_df['Remarks'] = training_df['Remarks'].fillna("None")
training_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 561 entries, 0 to 562
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   English           561 non-null    object
 1   Filipino-Correct  561 non-null    object
 2   Filipino-Flawed   561 non-null    object
 3   Remarks           561 non-null    object
dtypes: object(4)
memory usage: 21.9+ KB


### Load test set

In [171]:
test_df = pd.read_csv(test_set).dropna(how='all')
test_df.drop(columns=["Contributor"], inplace=True)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 57 entries, 0 to 63
Data columns (total 5 columns):
 #   Column                                                          Non-Null Count  Dtype  
---  ------                                                          --------------  -----  
 0   Source Text (English)                                           57 non-null     object 
 1   Target Text (Filipino)                                          57 non-null     object 
 2   Final Score                          (1 - lowest, 5 - highest)  57 non-null     float64
 3   Rater 1 Explanation                                             57 non-null     object 
 4   Rater 2 Explanation                                             54 non-null     object 
dtypes: float64(1), object(4)
memory usage: 2.7+ KB


In [172]:
test_df.rename(columns={
    "Source Text (English)": "English",
    "Target Text (Filipino)": "Filipino",
    "Final Score                          (1 - lowest, 5 - highest)": "Rating",
    "Rater 1 Explanation": "Remarks 1",
    "Rater 2 Explanation": "Remarks 2"
}, inplace=True)

test_df['Remarks 2'] = test_df['Remarks 2'].fillna("None")

test_df.head()

Unnamed: 0,English,Filipino,Rating,Remarks 1,Remarks 2
0,The children laughed and played under the afte...,Ang mga bata ay nagtawanan at naglaro sa ilali...,4.0,"Accurate, fluent, and natural translation. Cap...",Just slight error due to the literal translati...
1,She took a break to gather her thoughts.,Nagpahinga siya para mag-isip-isip.,4.0,The translation is accurate. It was able to ca...,The translation would have been better if the ...
2,The algorithm efficiently identifies patterns ...,Mabisang kinikilala ng algoritmo ang mga patte...,3.0,"The translation of ""identifies"" as ""kinikilala...",The translation would have been better if the ...
3,Data normalization helps improve model perform...,Tumutulong sa pagpabuti ng model ang normalisa...,5.0,The translated text is natural and captures th...,The translation didn't literally translated th...
4,alam mo ma'am masaya naman topics natin sa phi...,"You know, ma'am, we have a lot of fun philosop...",4.0,"flawed translation is close, but failed to tra...",


Before any LLM operations, a rate limiter is necessary to stay under the Gemini Free plan.

In [21]:
rate_limiter = InMemoryRateLimiter(
    requests_per_second=0.25,  # 1 request every 4 seconds
    check_every_n_seconds=0.1,
    max_bucket_size=1.0        # No piling up of surplus tokens
)

llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    rate_limiter=rate_limiter
)

## II. Few-shot Prompting

In [60]:
examples = training_df.sample(3, random_state=38).to_dict(orient="records")

# Adding ratings to few-shot examples for the sake of prompting.
examples[0]['Rating'] = 4
examples[1]['Rating'] = 5
examples[2]['Rating'] = 4

examples

[{'English': 'Dinagyang culminates in a competition of tribes dancing with drums.', 'Filipino-Correct': 'Nagtatapos ang Dinagyang sa paligsahan ng mga tribo na sumasayaw kasabay ng mga tambol.', 'Filipino-Flawed': '… sa isang patimpalak ng mga tribo na kumakanta ng kundiman.', 'Remarks': 'Replaces drum dance with singing—a meaning swap.', 'Rating': 4}, {'English': 'James is eating strawberries in Baguio.', 'Filipino-Correct': 'Kumakain si James ng strawberries sa Baguio.', 'Filipino-Flawed': 'Si James ay kumakain ng strawberries sa lugar ng Baguio.', 'Remarks': 'ay inversion, unnecessary addition of "sa lugar ng"', 'Rating': 5}, {'English': "Big O notation describes the upper bound of an algorithm's growth.", 'Filipino-Correct': 'Inilalarawan ng Big O notation ang pinakamataas na paglaki ng isang algoritmo.', 'Filipino-Flawed': 'Big O ay laki ng algo.', 'Remarks': 'None', 'Rating': 4}]

In [None]:
system_message = """
You are a professional translation evaluator. You must assess a Filipino translation based on:
- Adequacy: Does the Filipino translation preserve the meaning of the original sentence?.
- Fluency: Is it natural, smooth, and grammatically correct to be easily understood by a native speaker?.
- Lexical Choice: Are the words contextually accurate and culturally appropriate?.

For each input: 
- Adequacy rating (1-5) + detailed reasoning for your score (cite words or phrases from the translation),
- Fluency rating (1-5) + reasoning,
- Lexical Choice rating (1-5) + reasoning.
- Overall rating (1-5).

All the reasonings should be detailed.
Output Format:
English Sentence: ...
Filipino Translation: ...
Adequacy: (1-5) + [reasoning citing specific words/phrases]
Fluency: (1-5) + [reasoning citing specific words/phrases]
Lexical Choice: (1-5) + [reasoning citing specific words/phrases]
Overall Rating: ...

Do not preamble.
"""

system_prompt = SystemMessagePromptTemplate.from_template(system_message)

human_prompt = HumanMessagePromptTemplate.from_template(
    "English Sentence: {english}\nFilipino Translation: {filipino}"
)

ai_prompt = AIMessagePromptTemplate.from_template(
    "Rating: {rating}\nRemarks: {remarks}\n"
)

In [18]:
example_prompt = ChatPromptTemplate.from_messages([
    human_prompt,
    ai_prompt,
])

few_shot_prompt = FewShotChatMessagePromptTemplate(
    examples=[{
        "english": ex["English"],
        "filipino": ex["Filipino-Correct"],
        "rating": ex["Rating"],
        "remarks": ex["Remarks"],
    } for ex in examples],
    example_prompt=example_prompt,
)

final_prompt = ChatPromptTemplate.from_messages([
    system_prompt,
    few_shot_prompt,
    human_prompt
])

Simple Prompt Engineered Example

In [None]:
messages = final_prompt.format_messages(
    english="With what would you use a \"wah-wah pedal?\"",
    filipino="Ano ang gagamitin mo ng \"wah-wah pedal?\"",
)

response = llm.invoke(messages)
print(response.content)

Rating: 3
Remarks: The translation is grammatically awkward. A more natural phrasing would be "Para saan mo gagamitin ang 'wah-wah pedal'?" or "Sa ano mo gagamitin ang 'wah-wah pedal'?"


Static few-shot learning examples for entire test set.

In [None]:
def evaluate_rating_static(llm_func, prompt_obj, english: str, filipino: str) -> str:
    messages = prompt_obj.format_messages(english=english, filipino=filipino)
    resp = llm_func.invoke(messages)
    return resp.content

test_df_few_shot = test_df.copy()

test_df_few_shot['pred_raw'] = test_df_few_shot.apply(
    lambda r: evaluate_rating_static(llm, final_prompt, r['English'], r['Filipino']),
    axis=1,
)

Extract the numeric Rating from each LLM response

In [None]:
test_df_few_shot['pred_rating'] = test_df_few_shot['pred_raw'].apply(parse_rating)

test_df_few_shot.to_csv("../results/few_shot_ratings.csv", index=False)
test_df_few_shot.head()

Unnamed: 0,English,Filipino,Rating,Remarks 1,Remarks 2,pred_raw,pred_rating
0,The children laughed and played under the afte...,Ang mga bata ay nagtawanan at naglaro sa ilali...,4.0,"Accurate, fluent, and natural translation. Cap...",Just slight error due to the literal translati...,Rating: 5\nRemarks: The translation is accurat...,5
1,She took a break to gather her thoughts.,Nagpahinga siya para mag-isip-isip.,4.0,The translation is accurate. It was able to ca...,The translation would have been better if the ...,Rating: 5\nRemarks: The translation is accurat...,5
2,The algorithm efficiently identifies patterns ...,Mabisang kinikilala ng algoritmo ang mga patte...,3.0,"The translation of ""identifies"" as ""kinikilala...",The translation would have been better if the ...,Rating: 5\nRemarks: The translation is accurat...,5
3,Data normalization helps improve model perform...,Tumutulong sa pagpabuti ng model ang normalisa...,5.0,The translated text is natural and captures th...,The translation didn't literally translated th...,Rating: 5\nRemarks: The translation is accurat...,5
4,alam mo ma'am masaya naman topics natin sa phi...,"You know, ma'am, we have a lot of fun philosop...",4.0,"flawed translation is close, but failed to tra...",,Rating: 5\nRemarks: The translation accurately...,5


Uncomment if loading `test_df_few_shot` on fresh kernel to avoid unnecessary LLM calls.

In [None]:
# test_df_few_shot = pd.read_csv('../results/few_shot_ratings.csv')

Measure performance of few shot LLM versus human evaluations using standard metrics (accuracy, precision, recall, f1-score).

In [25]:
y_true = test_df_few_shot['Rating'].tolist()
y_pred = test_df_few_shot['pred_rating'].tolist()

acc = accuracy_score(y_true, y_pred)
prec, rec, f1, support = precision_recall_fscore_support(
    y_true, y_pred, average='macro', zero_division=0
)

print(f"Accuracy: {acc:.3f}")
print(f"Macro Precision: {prec:.3f}")
print(f"Macro Recall: {rec:.3f}")
print(f"Macro F1-score: {f1:.3f}")

print("\nDetailed per-rating breakdown:\n")
print(classification_report(y_true, y_pred))

Accuracy: 0.263
Macro Precision: 0.255
Macro Recall: 0.250
Macro F1-score: 0.166

Detailed per-rating breakdown:

              precision    recall  f1-score   support

         1.0       1.00      0.25      0.40         4
         2.0       0.00      0.00      0.00        10
         3.0       0.00      0.00      0.00        15
         4.0       0.00      0.00      0.00        14
         5.0       0.27      1.00      0.43        14

    accuracy                           0.26        57
   macro avg       0.25      0.25      0.17        57
weighted avg       0.14      0.26      0.13        57



Evalatuate LLM performance on other metrics:
- Cohen's Kappa: Level of agreement between two annotators on a classification problem. 
- Mean Absolute Error: Average absolute difference between your model’s ratings and the ground truth ratings, giving a sense of how close predictions are from ground truth.

In [26]:
qwk_score = cohen_kappa_score(y_true, y_pred)
mae_score = mean_absolute_error(y_true, y_pred)
print(f"Cohen Kappa Score: {qwk_score}")
print(f"Mean Absolute Error: {mae_score}")

Cohen Kappa Score: 0.027619821283509483
Mean Absolute Error: 1.368421052631579


## III. Memory Management and RAG

Split the training set into batches for batch evaluations via the same criteria that will be used on the test set.

In [16]:
batches = format_training_batches(training_df, batch_size=50)
print(f"Batch Prompt Example: {batches[0]}")

Batch Prompt Example: ('You are evaluating English-to-Filipino translations based on three criteria:\n1. Adequacy: Does the Filipino translation preserve the meaning of the original sentence?\n2. Fluency: Is it natural, smooth, and grammatically correct to be easily understood by a native speaker?\n3. Lexical Choice: Are the words contextually accurate and culturally appropriate?\n\nPlease provide scores from 1-5, with 1 being the lowest and 5 being the highest, along with detailed reasoning for each that cites words or phrases that may make the translation flawed.\nRespond ONLY with a valid JSON array. Do not use markdown code blocks (```). Do not include any explanations, formatting, or text outside of the JSON. Start your response directly with [ and end with ].Your response must look exactly like this:\n[\n{\n\'item_num\': 0,\n\'adequacy_score\': 4,\'adequacy_reasoning\': \'The correct translation preserves the meaning well.\'\n\'fluency_score\': 5\n\'fluency_reasoning\': \'The sen

Conduct batch evaluations of training set translations with the following criteria:
- Adequacy: Does the Filipino translation preserve the meaning of the original sentence?
- Fluency: Is it natural, smooth, and grammatically correct to be easily understood by a native speaker?
- Lexical Choice: Are the words contextually accurate and culturally appropriate?

These evaluations will be stored for later use.

In [None]:
test_batch_results = evaluate_training_batches(batches, llm)

batch_eval_results = pd.DataFrame(test_batch_results)
batch_eval_results.to_csv("../results/batch_eval_results.csv", index=False)

Processing Batches:   0%|          | 0/12 [00:00<?, ?batch/s]

Processing Batches:   8%|▊         | 1/12 [00:43<07:54, 43.13s/batch]

Response Metadata: {'input_tokens': 4813, 'output_tokens': 5163, 'total_tokens': 9976, 'input_token_details': {'cache_read': 0}}
Batch 1: 50 items processed


Processing Batches:  17%|█▋        | 2/12 [01:14<06:04, 36.46s/batch]

Response Metadata: {'input_tokens': 5167, 'output_tokens': 5395, 'total_tokens': 10562, 'input_token_details': {'cache_read': 0}}
Batch 2: 50 items processed


Processing Batches:  25%|██▌       | 3/12 [01:51<05:29, 36.60s/batch]

Response Metadata: {'input_tokens': 5406, 'output_tokens': 5936, 'total_tokens': 11342, 'input_token_details': {'cache_read': 0}}
Batch 3: 50 items processed


Processing Batches:  33%|███▎      | 4/12 [02:23<04:37, 34.68s/batch]

Response Metadata: {'input_tokens': 2779, 'output_tokens': 5445, 'total_tokens': 8224, 'input_token_details': {'cache_read': 0}}
Batch 4: 50 items processed


Processing Batches:  42%|████▏     | 5/12 [03:03<04:15, 36.47s/batch]

Response Metadata: {'input_tokens': 4144, 'output_tokens': 6149, 'total_tokens': 10293, 'input_token_details': {'cache_read': 0}}
Batch 5: 50 items processed


Processing Batches:  50%|█████     | 6/12 [03:36<03:32, 35.49s/batch]

Response Metadata: {'input_tokens': 2953, 'output_tokens': 5999, 'total_tokens': 8952, 'input_token_details': {'cache_read': 0}}
Batch 6: 50 items processed


Processing Batches:  58%|█████▊    | 7/12 [04:00<02:39, 31.82s/batch]

Response Metadata: {'input_tokens': 5003, 'output_tokens': 5655, 'total_tokens': 10658, 'input_token_details': {'cache_read': 0}}
Batch 7: 50 items processed


Processing Batches:  67%|██████▋   | 8/12 [04:37<02:13, 33.49s/batch]

Response Metadata: {'input_tokens': 4757, 'output_tokens': 6049, 'total_tokens': 10806, 'input_token_details': {'cache_read': 0}}
Batch 8: 50 items processed


Processing Batches:  75%|███████▌  | 9/12 [05:13<01:41, 34.00s/batch]

Response Metadata: {'input_tokens': 3096, 'output_tokens': 5752, 'total_tokens': 8848, 'input_token_details': {'cache_read': 0}}
Batch 9: 50 items processed


Processing Batches:  83%|████████▎ | 10/12 [05:43<01:05, 32.89s/batch]

Response Metadata: {'input_tokens': 8251, 'output_tokens': 5094, 'total_tokens': 13345, 'input_token_details': {'cache_read': 0}}
Batch 10: 50 items processed


Processing Batches:  92%|█████████▏| 11/12 [06:18<00:33, 33.64s/batch]

Response Metadata: {'input_tokens': 3980, 'output_tokens': 5595, 'total_tokens': 9575, 'input_token_details': {'cache_read': 0}}
Batch 11: 50 items processed


Processing Batches: 100%|██████████| 12/12 [06:27<00:00, 32.26s/batch]

Response Metadata: {'input_tokens': 996, 'output_tokens': 1308, 'total_tokens': 2304, 'input_token_details': {'cache_read': 0}}
Batch 12: 11 items processed





In [38]:
# Fetch batch_eval_results
batch_eval_results = pd.read_csv('../results/batch_eval_results.csv')
batch_eval_results

Unnamed: 0,orig_sentence,flawed_translation,target_translation,item_num,adequacy_score,adequacy_reasoning,fluency_score,fluency_reasoning,lexical_choice_score,lexical_choice_reasoning
0,The Philippines is an archipelago made up of o...,Ang Pilipinas ay isang puno na binubuo ng mahi...,"Ang Pilipinas ay isang kapulaang binubuo ng 7,...",1,1,The flawed translation uses incorrect words ('...,1,The flawed translation is grammatically incorr...,1,The flawed translation uses completely wrong w...
1,Philippines is the world's second-largest arch...,Ang Pilipinas ay ang pangalawang malaking isla...,Ang Pilipinas ang pangalawa sa pinakamalaking ...,2,2,The flawed translation changes the location to...,3,The flawed translation is grammatically incorr...,3,The flawed translation uses 'isla' instead of ...
2,Filipino and English are the two official lang...,Tagalog at Ingles ang dalawa opisyal lingwahe ...,Filipino at Ingles ang dalawang opisyal na lin...,3,3,The flawed translation is understandable but m...,3,The flawed translation is grammatically incorr...,4,The flawed translation uses mostly correct words.
3,Tagalog is the most widely spoken native langu...,Tagalog ay ang pinaka malawak sinasabi katutub...,Tagalog ang pinakamalawak at ginagamit na katu...,4,2,The flawed translation is missing key words an...,1,The flawed translation is grammatically incorr...,2,The flawed translation uses incorrect words.
4,The Philippines was a Spanish colony for over ...,Pilipinas naging Espanya Kolonya sa higit 300 ...,Ang Pilipinas ay naging isang kolonya ng Espan...,5,3,The flawed translation is understandable but m...,3,The flawed translation is grammatically incorr...,4,The flawed translation uses mostly correct words.
...,...,...,...,...,...,...,...,...,...,...
556,Simbang Gabi is a nine-day dawn mass leading u...,Ang Simbang Gabi ay misa sa gabi tuwing Disyem...,Ang Simbang Gabi ay siyam na araw ng misang ma...,7,5,The correct translation accurately conveys the...,5,The Filipino sentence is grammatically correct...,5,The word choices are appropriate and natural i...
557,Flores de Mayo culminates in the grand Santacr...,… sa malaking fireworks display.,Humahantong ang Flores de Mayo sa engrandeng p...,8,5,The correct translation accurately conveys the...,5,The Filipino sentence is grammatically correct...,5,The word choices are appropriate and natural i...
558,The practice of “pagmamano” is often followed ...,Madalas sinusundan ang pagbabasbas pagmamano ng.,Madalas na sinusundan ng pagbabasbas ang pagma...,9,5,The correct translation accurately conveys the...,5,The Filipino sentence is grammatically correct...,5,The word choices are appropriate and natural i...
559,"Filipino wakes, or lamay, can last several nig...",… upang parangal ang zombie.,Ang lamay ay maaaring tumagal ng ilang gabi up...,10,5,The correct translation accurately conveys the...,5,The Filipino sentence is grammatically correct...,5,The word choices are appropriate and natural i...


Store these LLM-based evaluations in a vector store as well so that metadata can be used to augment LLM evaluations.

In [44]:
batch_eval_documents = []

for idx, row in batch_eval_results.iterrows():
    # Embeddings are only the english sentence, since they will be queried to find semantic similarity.
    embedding_input = row["orig_sentence"]

    metadata = {
        "english": row['orig_sentence'],
        "flawed_translation": row['flawed_translation'],
        "correct_translation": row['target_translation'],
        "adequacy_score": row['adequacy_score'],
        "fluency_score": row['fluency_score'],
        "lexical_score": row['lexical_choice_score'],
        "adequacy_reasoning": row['adequacy_reasoning'],
        "fluency_reasoning": row['fluency_reasoning'],
        "lexical_reasoning": row['lexical_choice_reasoning'],
    }

    batch_eval_documents.append(Document(page_content=embedding_input, metadata=metadata))

batch_eval_documents[0]

Document(metadata={'english': 'The Philippines is an archipelago made up of over 7,640 islands, though only about 2,000 are inhabited.', 'flawed_translation': 'Ang Pilipinas ay isang puno na binubuo ng mahigit 7,640 manok, bagaman halos 2,000 lamang ang tumira.', 'correct_translation': 'Ang Pilipinas ay isang kapulaang binubuo ng 7,640 na isla, ngunit 2,000 lamang ang tinitirahan', 'adequacy_score': 1, 'fluency_score': 1, 'lexical_score': 1, 'adequacy_reasoning': "The flawed translation uses incorrect words ('puno' for archipelago and 'manok' for islands).", 'fluency_reasoning': 'The flawed translation is grammatically incorrect and nonsensical.', 'lexical_reasoning': 'The flawed translation uses completely wrong words.'}, page_content='The Philippines is an archipelago made up of over 7,640 islands, though only about 2,000 are inhabited.')

### Using ChromaDB as vector store for embeddings.

In [44]:
batch_eval_client = PersistentClient(path="../vector_stores/chroma_db/batch_train_eval_documents_v2")
collection_name = "batch_train_eval_v2"

# # ONLY RUN if need to refresh vectors
# existing_collections = [col.name for col in batch_eval_client.list_collections()]
# if collection_name in existing_collections:
#     batch_eval_client.delete_collection(collection_name)

collection = batch_eval_client.get_or_create_collection(name=collection_name)

In [47]:
collection.add(
    documents=[doc.page_content for doc in batch_eval_documents],
    ids=[f"doc_{i}" for i in range(len(batch_eval_documents))],
    metadatas=[doc.metadata for doc in batch_eval_documents]
)

Check embeddings created from english sentence within training set, along with corresponding metadata.

In [48]:
print_collection_records(collection, 3)

ID: doc_0
Doc: The Philippines is an archipelago made up of over 7,640 isla …
Meta: {'fluency_reasoning': 'The flawed translation is grammatically incorrect and nonsensical.', 'adequacy_reasoning': "The flawed translation uses incorrect words ('puno' for archipelago and 'manok' for islands).", 'flawed_translation': 'Ang Pilipinas ay isang puno na binubuo ng mahigit 7,640 manok, bagaman halos 2,000 lamang ang tumira.', 'adequacy_score': 1, 'english': 'The Philippines is an archipelago made up of over 7,640 islands, though only about 2,000 are inhabited.', 'lexical_reasoning': 'The flawed translation uses completely wrong words.', 'correct_translation': 'Ang Pilipinas ay isang kapulaang binubuo ng 7,640 na isla, ngunit 2,000 lamang ang tinitirahan', 'fluency_score': 1, 'lexical_score': 1}
Embedding size: 384
First 8 floats: [ 0.08887584 -0.00797997 -0.02554781  0.03541112 -0.03908981 -0.07329558
 -0.09429879 -0.0135244 ]
————————————————————————————————————————
ID: doc_1
Doc: Philippines

Load vector store into LangChain, then use similarity search to find the `k` nearest sentences for any document.

In [45]:
records = collection.get(
    include=["embeddings", "documents", "metadatas"]
)

batch_eval_vector_store = Chroma(
    client=batch_eval_client,
    collection_name=collection_name,
    embedding_function=None
)

nearest = batch_eval_vector_store.similarity_search_by_vector(
    records['embeddings'][50],
    k=10
)

nearest

[Document(id='doc_50', metadata={'fluency_reasoning': 'The Filipino sentence is natural and grammatically correct.', 'adequacy_reasoning': 'The correct translation accurately conveys the information in the English sentence.', 'fluency_score': 5, 'english': 'Three people were confirmed dead in a fire in Tondo, Manila, which broke out this Wednesday morning.', 'lexical_score': 5, 'lexical_reasoning': 'The word choices are appropriate for the context.', 'correct_translation': 'Tatlo ang kumpirmadong patay sa sunog sa Tondo , Maynila , na sumiklab nitong Miyerkules ng umaga.', 'flawed_translation': 'Tatlong tao ay kumpirmadong patay sa sunog sa Tondo, Manila na nagsimula itong umagang Miyerkules.', 'adequacy_score': 5}, page_content='Three people were confirmed dead in a fire in Tondo, Manila, which broke out this Wednesday morning.'),
 Document(id='doc_46', metadata={'fluency_score': 3, 'flawed_translation': 'Tatlo ay napatay sa isang apoy sa Novaliches, Quezon City ngayong umaga', 'engli

### Using Semantic Similarity Search
For every input english sentence, the 3 most semantically similar sentences within the training set are retrieved and used as few shot examples.

In [61]:
selector = SemanticSimilarityExampleSelector(   
    vectorstore=batch_eval_vector_store,
    input_keys=['english'],
    example_keys=['english', 'flawed_translation', 'correct_translation', 
                  'adequacy_score', 'adequacy_reasoning', 'fluency_score',
                  'fluency_reasoning', 'lexical_score', 'lexical_reasoning']
)

In [None]:
system_message = """
You are a professional translation evaluator. You must assess a Filipino translation based on:
- Adequacy: Does the Filipino translation preserve the meaning of the original sentence?.
- Fluency: Is it natural, smooth, and grammatically correct to be easily understood by a native speaker?.
- Lexical Choice: Are the words contextually accurate and culturally appropriate?.

For each new test example:
- First, state closest similar training examples and explain how they influence your judgment.
- Then give:
  - Overall rating (1-5),
  - Adequacy rating (1-5) + detailed reasoning for your score (cite words or phrases from the translation),
  - Fluency rating (1-5) + reasoning ,
  - Lexical Choice rating (1-5) + reasoning.

All the reasonings should be detailed and should especially cite words or phrases from the translation that make the translation flawed, if applicable.
  
Output Format:
English Sentence: ...
Filipino Translation: ...
Closest example(s): [cite the exact sentences that influenced your decision]
Overall Rating: ...
Adequacy: ... - ...
Fluency: ... - ...
Lexical Choice: ... - ...

Do not preamble.
"""

system_prompt = SystemMessagePromptTemplate.from_template(system_message)

human_prompt = HumanMessagePromptTemplate.from_template(
    "English Sentence: {english}\nFilipino Translation: {correct_translation}"
)

ai_prompt = AIMessagePromptTemplate.from_template(
    "\nAdequacy: {adequacy_score} - {adequacy_reasoning}\nFluency: {fluency_score} - {fluency_reasoning}\nLexical Choice: {lexical_score} - {lexical_reasoning}"
)

example_prompt = ChatPromptTemplate.from_messages([
    human_prompt,
    ai_prompt,
])

dynamic_few_shot_prompt  = FewShotChatMessagePromptTemplate(
    input_variables=['english'],
    example_selector=selector,
    example_prompt=example_prompt
)

final_prompt = ChatPromptTemplate.from_messages([
    system_prompt,
    dynamic_few_shot_prompt,
    HumanMessagePromptTemplate.from_template("English Sentence: {english}\nFilipino translation: {filipino}")
])

In [None]:
import pprint
english_test = "Stealth is not your thing. But it's adorable you're willing to try."
filipino_test = "Ang pagnanakaw ay hindi bagay sa iyo. Ngunit ito ay kaibig-ibig na handa kang subukan."

messages = final_prompt.format_messages(
    english=english_test,
    filipino=filipino_test
)

resp = llm.invoke(messages)
pprint.pprint(resp.content)

('Closest example(s):\n'
 '- "Adequacy: 5 - The correct translation accurately reflects the original '
 'meaning.\n'
 'Fluency: 5 - The correct translation is more fluent and natural-sounding.\n'
 'Lexical Choice: 5 - The correct translation uses more natural Filipino '
 'phrasing." (This highlights the importance of natural phrasing and accurate '
 'meaning.)\n'
 '\n'
 'Overall Rating: 3\n'
 'Adequacy: 3 - "Pagnanakaw" translates to "theft," which is too literal and '
 'doesn\'t capture the intended meaning of "stealth" in the context of a game '
 'or challenge. "Bagay sa iyo" is acceptable but could be more natural.\n'
 'Fluency: 4 - The sentence is grammatically correct, but the word choice '
 'makes it sound slightly unnatural.\n'
 'Lexical Choice: 3 - "Pagnanakaw" is an incorrect lexical choice. While '
 'technically a translation of "stealth" in some contexts, it doesn\'t fit the '
 'intended meaning here. "Kaibig-ibig" is a valid translation of "adorable," '
 'but it sounds a bi

In [195]:
def evaluate_batch_fewshot(llm_func, prompt_obj, english, filipino):
    messages = prompt_obj.format_messages(
        english=english,
        filipino=filipino
    )
    resp=llm_func.invoke(messages)
    return resp.content

In [None]:
batch_eval_test_df = test_df.copy()
batch_eval_test_df['pred_raw'] = batch_eval_test_df.apply(
    lambda r: evaluate_batch_fewshot(llm, final_prompt, r['English'], r['Filipino']),
    axis=1
)

In [None]:
batch_eval_test_df['pred_rating'] = batch_eval_test_df['pred_raw'].apply(parse_overall_rating)

batch_eval_test_df.to_csv("../results/batch_eval_test_df.csv", index=False)
batch_eval_test_df

Uncomment if loading `batch_eval_df` on fresh kernel to avoid unnecessary LLM calls.

In [32]:
batch_eval_df = pd.read_csv('../results/batch_eval_test_df.csv')

Measure performance of prompt-engineered LLM versus human evaluations using standard metrics (accuracy, precision, recall, f1-score)

In [33]:
y_true = batch_eval_df['Rating'].tolist()
y_pred = batch_eval_df['pred_rating'].tolist()

acc = accuracy_score(y_true, y_pred)
prec, rec, f1, support = precision_recall_fscore_support(
    y_true, y_pred, average='macro', zero_division=0
)

print(f"Accuracy: {acc:.3f}")
print(f"Macro Precision: {prec:.3f}")
print(f"Macro Recall: {rec:.3f}")
print(f"Macro F1-score: {f1:.3f}")

print("\nDetailed per-rating breakdown:\n")
print(classification_report(y_true, y_pred))

Accuracy: 0.298
Macro Precision: 0.266
Macro Recall: 0.278
Macro F1-score: 0.202

Detailed per-rating breakdown:

              precision    recall  f1-score   support

           1       0.50      0.25      0.33         4
           2       0.00      0.00      0.00        10
           3       0.33      0.07      0.11        15
           4       0.20      0.07      0.11        14
           5       0.30      1.00      0.46        14

    accuracy                           0.30        57
   macro avg       0.27      0.28      0.20        57
weighted avg       0.25      0.30      0.19        57



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Evalatuate LLM performance on other metrics:
- Cohen's Kappa: Level of agreement between two annotators on a classification problem. 
- Mean Absolute Error: Average absolute difference between your model’s ratings and the ground truth ratings, giving a sense of how close predictions are from ground truth.

In [34]:
qwk_score = cohen_kappa_score(y_true, y_pred)
mae_score = mean_absolute_error(y_true, y_pred)
print(f"Cohen Kappa Score: {qwk_score}")
print(f"Mean Absolute Error: {mae_score}")

Cohen Kappa Score: 0.07617504051863855
Mean Absolute Error: 1.3508771929824561


## IV. Tool Integration

#### LibreTranslate API
To run the LibreTranslate API, make sure you `pip install libretranslate` onto the local machine, then run `libretranslate` on the terminal to access the locally-hosted API via `http://127.0.0.1:5000`. 

Once the server is running on the local machine, the API below can work. Do NOT close the locally hosted server, or else the API will not work.

In [52]:
lt = LibreTranslateAPI(url='http://127.0.0.1:5000')

In [53]:
lt_translations = {'English Sentence': [], 'Filipino Translation': []}

for sentence in training_df['English']:
    fil_translation = lt.translate(sentence, source="en", target="tl")
    lt_translations['English Sentence'].append(sentence)
    lt_translations['Filipino Translation'].append(fil_translation)

lt_translations = pd.DataFrame(lt_translations)
lt_translations

Unnamed: 0,English Sentence,Filipino Translation
0,The Philippines is an archipelago made up of o...,Ang Pilipinas ay isang kapuluan na binubuo ng ...
1,Philippines is the world's second-largest arch...,Ang Pilipinas ang pangalawang-pinakamalaking k...
2,Filipino and English are the two official lang...,Filipino at Ingles ang dalawang opisyal na wik...
3,Tagalog is the most widely spoken native langu...,Ang Tagalog ang pinakamalawak na sinasalitang ...
4,The Philippines was a Spanish colony for over ...,Ang Pilipinas ay isang kolonya ng mga Kastila ...
...,...,...
556,Simbang Gabi is a nine-day dawn mass leading u...,Ang Simbang Gabi ay isang siyam-araw na tala n...
557,Flores de Mayo culminates in the grand Santacr...,Si Flores de Mayo ay humantong sa dakilang pru...
558,The practice of “pagmamano” is often followed ...,Ang pagsasagawa ng “pagmamano is ay kadalasang...
559,"Filipino wakes, or lamay, can last several nig...",Maaaring tumagal ng ilang gabi ang mga himayin...


Using LangChain's `AgentExecutor` to integrate the tools into the LLM and use translations from LibreTranslate as part of the judge's evaluation.

In [85]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.agents import create_tool_calling_agent, AgentExecutor
from agentic_judge_main import libretranslate_en_fil  # Your tool

# Define the system prompt: tell the agent its task—and that it may call the translate tool
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an expert translation evaluator. When given an English sentence and a Filipino translation of that sentence, you will rate the Filipino translation from 1 to 5, with 1 being the lowest and 5 being the highest in terms of adequacy, fluency, and lexical choice. Please translate the English sentence using LibreTranslate to serve as partial basis for your rating."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

tools = [libretranslate_en_fil]

# Bind LLM and tools into an agent
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

english_test = "Stealth is not your thing. But it's adorable you're willing to try."
filipino_test = "Ang pagnanakaw ay hindi bagay sa iyo. Ngunit ito ay kaibig-ibig na handa kang subukan."

# Run on a test sentence; agent will automatically translate then evaluate
result = agent_executor.invoke({"input": f"English Sentence: {english_test}\nFilipino Translation: {filipino_test}"})
result



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `libretranslate_en_fil` with `{'text': "Stealth is not your thing. But it's adorable you're willing to try."}`
responded: Here's my evaluation of the Filipino translation:

*   **Adequacy:** The translation captures the general meaning of the English sentence.
*   **Fluency:** The translation sounds natural and reads well in Filipino.
*   **Lexical Choice:** "Pagnanakaw" (stealing) is not the best fit for "stealth." A better word might be "pagtatago" or "pagkikimkim" depending on the context. "Kaibig-ibig" is a good translation of "adorable."

I will use LibreTranslate to provide another translation for comparison.



[0m[36;1m[1;3mAng pagnanakaw ay hindi mo bagay. Pero mas gusto mong subukan.[0m[32;1m[1;3mAdequacy: 4 - The translation captures the general meaning of the English sentence.
Fluency: 4 - The translation is generally fluent and natural-sounding in Filipino.
Lexical Choice: 3 - "Pagnanakaw" transl

{'input': "English Sentence: Stealth is not your thing. But it's adorable you're willing to try.\nFilipino Translation: Ang pagnanakaw ay hindi bagay sa iyo. Ngunit ito ay kaibig-ibig na handa kang subukan.", 'output': 'Adequacy: 4 - The translation captures the general meaning of the English sentence.\nFluency: 4 - The translation is generally fluent and natural-sounding in Filipino.\nLexical Choice: 3 - "Pagnanakaw" translates to theft which doesn\'t really capture the meaning of stealth. LibreTranslate used "Ang pagnanakaw ay hindi mo bagay. Pero mas gusto mong subukan." which is more accurate.\n\nOverall Rating: 3\n'}

Testing helper function for grabbing semantically similar sentences.

In [120]:
english_test = "Three people were confirmed dead in a fire in Tondo, Manila, which broke out this Wednesday morning."
nearest_examples = get_similar_examples(english_test, batch_eval_vector_store)
nearest_examples[1]

Closest training examples (based on English input):
doc: Three people were confirmed dead in a fire in Tondo, Manila, which broke out this Wednesday morning. metadata: {'adequacy_reasoning': 'The correct translation accurately conveys the information in the English sentence.', 'adequacy_score': 5, 'lexical_score': 5, 'lexical_reasoning': 'The word choices are appropriate for the context.', 'fluency_reasoning': 'The Filipino sentence is natural and grammatically correct.', 'flawed_translation': 'Tatlong tao ay kumpirmadong patay sa sunog sa Tondo, Manila na nagsimula itong umagang Miyerkules.', 'fluency_score': 5, 'english': 'Three people were confirmed dead in a fire in Tondo, Manila, which broke out this Wednesday morning.', 'correct_translation': 'Tatlo ang kumpirmadong patay sa sunog sa Tondo , Maynila , na sumiklab nitong Miyerkules ng umaga.'}
- Three people were confirmed dead in a fire in Tondo, Manila, which broke out this Wednesday morning.
doc: Three were killed in a fire in 

'English Sentence: Three were killed in a fire in Novaliches, Quezon City this morning.\n\n                Flawed Filipino Translation: Tatlo ay napatay sa isang apoy sa Novaliches, Quezon City ngayong umaga\n\n                Correct Filipino Translation: Tatlo ang nasawi sa isang sunog sa Novaliches, Quezon City kaninang umaga .\n\n                Adequacy [4] - The flawed translation is understandable but missing a key word.\n\n                Fluency [3] - The flawed translation is grammatically incorrect.\n\n                Lexical Choice [5] - The flawed translation uses correct words.\n            '

### Build agentic LLM executor
Integrates RAG from ChromaDB, external tools via LibreTranslate API, and few shot learning via Semantic Similarity Search.

In [None]:
nearest_examples = get_similar_examples(english_test, batch_eval_vector_store)
tools = [libretranslate_en_fil]
agent_executor = prepare_few_shot_agent_with_tools(nearest_examples, llm, tools)

In [159]:
english_test = "Stealth is not your thing. But it's adorable you're willing to try."
filipino_test = "Ang pagnanakaw ay hindi bagay sa iyo. Ngunit ito ay kaibig-ibig na handa kang subukan."

result = agent_executor.invoke({"input": f"English Sentence: {english_test}\nFilipino Translation: {filipino_test}"}, return_only_outputs=True)
result



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `libretranslate_en_fil` with `{'text': "Stealth is not your thing. But it's adorable you're willing to try."}`
responded: Similar Examples Influence: The provided examples highlight the importance of accurately conveying the original meaning, ensuring fluency, and selecting appropriate vocabulary in Filipino translations. They emphasize that a good translation should not only be grammatically correct but also natural and culturally relevant.



[0m[36;1m[1;3mAng pagnanakaw ay hindi mo bagay. Pero mas gusto mong subukan.[0m[32;1m[1;3mSimilar Examples Influence: The examples highlighted the importance of accurately conveying the original meaning, ensuring fluency, and selecting appropriate vocabulary for the context. The flawed translations in the examples demonstrated how misinterpretations or unnatural phrasing can negatively impact the overall quality of the translation.

Adequacy: 2 - "Ang pagnanakaw ay hin

{'output': 'Similar Examples Influence: The examples highlighted the importance of accurately conveying the original meaning, ensuring fluency, and selecting appropriate vocabulary for the context. The flawed translations in the examples demonstrated how misinterpretations or unnatural phrasing can negatively impact the overall quality of the translation.\n\nAdequacy: 2 - "Ang pagnanakaw ay hindi bagay sa iyo" translates to "Stealing is not your thing," which is a significant deviation from the original meaning of "Stealth is not your thing." While the second part, "Ngunit ito ay kaibig-ibig na handa kang subukan," is closer, the initial error heavily impacts the overall adequacy.\n\nFluency: 4 - The sentence structure is grammatically correct, but the word choice makes it sound somewhat unnatural. "Ngunit ito ay kaibig-ibig na handa kang subukan" is understandable but not how a native speaker would typically phrase it.\n\nLexical Choice: 2 - "Pagnanakaw" means "theft," which is incorr

In [161]:
result['output']

'Similar Examples Influence: The examples highlighted the importance of accurately conveying the original meaning, ensuring fluency, and selecting appropriate vocabulary for the context. The flawed translations in the examples demonstrated how misinterpretations or unnatural phrasing can negatively impact the overall quality of the translation.\n\nAdequacy: 2 - "Ang pagnanakaw ay hindi bagay sa iyo" translates to "Stealing is not your thing," which is a significant deviation from the original meaning of "Stealth is not your thing." While the second part, "Ngunit ito ay kaibig-ibig na handa kang subukan," is closer, the initial error heavily impacts the overall adequacy.\n\nFluency: 4 - The sentence structure is grammatically correct, but the word choice makes it sound somewhat unnatural. "Ngunit ito ay kaibig-ibig na handa kang subukan" is understandable but not how a native speaker would typically phrase it.\n\nLexical Choice: 2 - "Pagnanakaw" means "theft," which is incorrect in this

Run agentic executor on every row of test set.

In [177]:
def evaluate_with_agent_executor(llm, english, filipino, agent_executor, vector_store, tools):
    nearest_examples = get_similar_examples(english, vector_store)
    agent_executor = prepare_few_shot_agent_with_tools(nearest_examples, llm, tools)
    response = agent_executor.invoke({"input": f"English Sentence: {english}\nFilipino Translation: {filipino}"}, return_only_outputs=True)
    return response['output']

In [None]:
test_df_agent_tools = test_df.copy()
test_df_agent_tools['pred_raw'] = test_df_agent_tools.apply(
    lambda r: evaluate_with_agent_executor(llm, r['English'], r['Filipino'], agent_executor, batch_eval_vector_store, tools),
    axis=1
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `libretranslate_en_fil` with `{'text': 'The children laughed and played under the afternoon sun.'}`
responded: Similar Examples Influence: The provided examples highlight the importance of natural word choices and grammatical correctness in Filipino. They guide me to evaluate whether the translation uses common Filipino expressions and adheres to grammatical norms.



[0m[36;1m[1;3mAng mga bata ay nagtawanan at naglaro sa ilalim ng sikat ng araw sa hapon.[0m[32;1m[1;3mEnglish Sentence: The children laughed and played under the afternoon sun.
Filipino Translation: Ang mga bata ay nagtawanan at naglaro sa ilalim ng hapon na araw.

Similar Examples Influence: The examples provided guidance on evaluating adequacy, fluency, and lexical choice, particularly in identifying direct translations that may not be the most natural or accurate in Filipino. They highlight the importance of idiomatic expressions and cultural

Retrying langchain_google_genai.chat_models._chat_with_retry.<locals>._chat_with_retry in 2.0 seconds as it raised ResourceExhausted: 429 You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits. [violations {
  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerMinutePerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.0-flash"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 15
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
  seconds: 37
}
].
Retrying langchain_google_genai.chat_models._chat_with_retry.<locals>._chat_with_retry in 4.0 seconds as it raised ResourceExhausted: 429 You exceeded your current quota, please check your plan and billing

[32;1m[1;3m
Invoking: `libretranslate_en_fil` with `{'text': 'Sea of Thieves is a smash-hit pirate adventure game, offering the quintessential pirate experience of plundering lost treasures, intense battles, vanquishing sea monsters and more. Dive in with this revised edition of the game, which includes access to digital bonus media.'}`
responded: Similar Examples Influence: The provided examples highlight the importance of preserving meaning, ensuring fluency, and choosing appropriate words in Filipino translations. They guide me to focus on grammatical correctness, natural phrasing, and accurate word choices that resonate with Filipino speakers.



[0m[36;1m[1;3mAng Sea of Ebanghelista ay isang larong pakikipagsapalaran na brush-hit, na nag-aalok ng quintessential na karanasan ng pandarambong sa mga nawalang kayamanan, matinding labanan, paglupig sa mga halimaw ng dagat at higit pa. Dave sa nirebisang edisyong ito ng laro, kasama na rito ang digital bonus media.[0m[32;1m[1;3m

Unnamed: 0,English,Filipino,Rating,Remarks 1,Remarks 2,pred_raw
0,The children laughed and played under the afte...,Ang mga bata ay nagtawanan at naglaro sa ilali...,4.0,"Accurate, fluent, and natural translation. Cap...",Just slight error due to the literal translati...,English Sentence: The children laughed and pla...
1,She took a break to gather her thoughts.,Nagpahinga siya para mag-isip-isip.,4.0,The translation is accurate. It was able to ca...,The translation would have been better if the ...,English Sentence: She took a break to gather h...
2,The algorithm efficiently identifies patterns ...,Mabisang kinikilala ng algoritmo ang mga patte...,3.0,"The translation of ""identifies"" as ""kinikilala...",The translation would have been better if the ...,English Sentence: The algorithm efficiently id...
3,Data normalization helps improve model perform...,Tumutulong sa pagpabuti ng model ang normalisa...,5.0,The translated text is natural and captures th...,The translation didn't literally translated th...,English Sentence: Data normalization helps imp...
4,alam mo ma'am masaya naman topics natin sa phi...,"You know, ma'am, we have a lot of fun philosop...",4.0,"flawed translation is close, but failed to tra...",,English Sentence: alam mo ma'am masaya naman t...
5,It's raining cats and dogs.,Umuulan ng pusa at aso.,1.0,"Literal translation of an idiom, doesn't make ...",Fails to capture the idiomatic meaning; sounds...,English Sentence: It's raining cats and dogs.\...
6,The party of the first part shall not be held ...,Ang partido ng unang bahagi ay hindi mananagot...,4.0,"Accurate and understandable, good for formal/l...","Effectively conveys legal terminology, correct...",English Sentence: The party of the first part ...
7,Thank you for coming to the event.,Salamat sa pagpunta sa kaganapan.,5.0,"Perfect, natural, and accurate.","Excellent, natural, and idiomatic translation.",English Sentence: Thank you for coming to the ...
8,"Despite her exhaustion, she finished the repor...","Sa kabila ng kanyang pagod, natapos niya ang u...",5.0,"Excellent, accurate, and flows well.","Fluent, accurate, and retains original meaning.","English Sentence: Despite her exhaustion, she ..."
9,That designer bag costs an arm and a leg.,Napakamahal ng designer bag na 'yan.,5.0,"Accurate, Fluent, Coherent, Complete, and capt...","Complete, and sounds natural. Captures the mea...",English Sentence: That designer bag costs an a...


In [182]:
test_df_agent_tools['pred_rating'] = test_df_agent_tools['pred_raw'].apply(parse_rating)

test_df_agent_tools.to_csv('../results/test_df_agent_tools.csv', index=False)
test_df_agent_tools

Unnamed: 0,English,Filipino,Rating,Remarks 1,Remarks 2,pred_raw,pred_rating
0,The children laughed and played under the afte...,Ang mga bata ay nagtawanan at naglaro sa ilali...,4.0,"Accurate, fluent, and natural translation. Cap...",Just slight error due to the literal translati...,English Sentence: The children laughed and pla...,5
1,She took a break to gather her thoughts.,Nagpahinga siya para mag-isip-isip.,4.0,The translation is accurate. It was able to ca...,The translation would have been better if the ...,English Sentence: She took a break to gather h...,5
2,The algorithm efficiently identifies patterns ...,Mabisang kinikilala ng algoritmo ang mga patte...,3.0,"The translation of ""identifies"" as ""kinikilala...",The translation would have been better if the ...,English Sentence: The algorithm efficiently id...,5
3,Data normalization helps improve model perform...,Tumutulong sa pagpabuti ng model ang normalisa...,5.0,The translated text is natural and captures th...,The translation didn't literally translated th...,English Sentence: Data normalization helps imp...,4
4,alam mo ma'am masaya naman topics natin sa phi...,"You know, ma'am, we have a lot of fun philosop...",4.0,"flawed translation is close, but failed to tra...",,English Sentence: alam mo ma'am masaya naman t...,3
5,It's raining cats and dogs.,Umuulan ng pusa at aso.,1.0,"Literal translation of an idiom, doesn't make ...",Fails to capture the idiomatic meaning; sounds...,English Sentence: It's raining cats and dogs.\...,1
6,The party of the first part shall not be held ...,Ang partido ng unang bahagi ay hindi mananagot...,4.0,"Accurate and understandable, good for formal/l...","Effectively conveys legal terminology, correct...",English Sentence: The party of the first part ...,5
7,Thank you for coming to the event.,Salamat sa pagpunta sa kaganapan.,5.0,"Perfect, natural, and accurate.","Excellent, natural, and idiomatic translation.",English Sentence: Thank you for coming to the ...,5
8,"Despite her exhaustion, she finished the repor...","Sa kabila ng kanyang pagod, natapos niya ang u...",5.0,"Excellent, accurate, and flows well.","Fluent, accurate, and retains original meaning.","English Sentence: Despite her exhaustion, she ...",5
9,That designer bag costs an arm and a leg.,Napakamahal ng designer bag na 'yan.,5.0,"Accurate, Fluent, Coherent, Complete, and capt...","Complete, and sounds natural. Captures the mea...",English Sentence: That designer bag costs an a...,4


Uncomment if loading `test_df_agent_tools` on fresh kernel to avoid unnecessary LLM calls.

In [35]:
test_df_agent_tools = pd.read_csv('../results/test_df_agent_tools.csv')

Measure performance of prompt-engineered LLM versus human evaluations using standard metrics (accuracy, precision, recall, f1-score)

In [36]:
test_df_agent_tools = pd.read_csv('../results/test_df_agent_tools.csv')

y_true = test_df_agent_tools['Rating'].tolist()
y_pred = test_df_agent_tools['pred_rating'].tolist()

acc = accuracy_score(y_true, y_pred)
prec, rec, f1, support = precision_recall_fscore_support(
    y_true, y_pred, average='macro', zero_division=0
)

print(f"Accuracy: {acc:.3f}")
print(f"Macro Precision: {prec:.3f}")
print(f"Macro Recall: {rec:.3f}")
print(f"Macro F1-score: {f1:.3f}")

print("\nDetailed per-rating breakdown:\n")
print(classification_report(y_true, y_pred))

Accuracy: 0.316
Macro Precision: 0.407
Macro Recall: 0.291
Macro F1-score: 0.260

Detailed per-rating breakdown:

              precision    recall  f1-score   support

         1.0       1.00      0.25      0.40         4
         2.0       0.00      0.00      0.00        10
         3.0       0.25      0.13      0.17        15
         4.0       0.50      0.21      0.30        14
         5.0       0.29      0.86      0.43        14

    accuracy                           0.32        57
   macro avg       0.41      0.29      0.26        57
weighted avg       0.33      0.32      0.25        57



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Evalatuate LLM performance on other metrics:
- Cohen's Kappa: Level of agreement between two annotators on a classification problem. 
- Mean Absolute Error: Average absolute difference between your model’s ratings and the ground truth ratings, giving a sense of how close predictions are from ground truth.

In [37]:
qwk_score = cohen_kappa_score(y_true, y_pred)
mae_score = mean_absolute_error(y_true, y_pred)
print(f"Cohen Kappa Score: {qwk_score}")
print(f"Mean Absolute Error: {mae_score}")

Cohen Kappa Score: 0.0937627395026498
Mean Absolute Error: 1.2280701754385965


### Testing structured agent evaluator.

In [60]:
english_test = "Stealth is not your thing. But it's adorable you're willing to try."
filipino_test = "Ang pagnanakaw ay hindi bagay sa iyo. Ngunit ito ay kaibig-ibig na handa kang subukan."

result = structured_evaluate_with_agent_executor(
    english=english_test,
    filipino=filipino_test,
    llm=llm,
    vector_store=batch_eval_vector_store,
    tools=[libretranslate_en_fil]
)

print(json.dumps(result, indent=2, ensure_ascii=False))



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `libretranslate_en_fil` with `{'text': "Stealth is not your thing. But it's adorable you're willing to try."}`
responded: Similar Examples Influence: The examples highlight the importance of accurately conveying the original meaning and ensuring the translation is natural and grammatically correct. They also emphasize the need for appropriate lexical choices.



[0m[36;1m[1;3mAng pagnanakaw ay hindi mo bagay. Pero mas gusto mong subukan.[0m[32;1m[1;3mSimilar Examples Influence: The examples helped me understand how to evaluate adequacy, fluency, and lexical choice by looking at specific words and phrases in the translation and comparing them to the original English sentence. They also showed me how to use LibreTranslate as a baseline.

Adequacy: 4 - The translation captures the general meaning. "Pagnanakaw" (stealing) is not the best translation of "stealth," but in some contexts, it can work. The second sent

In [59]:
english_test = "I see you."
filipino_test = "Kita kita."

result = structured_evaluate_with_agent_executor(
    english=english_test,
    filipino=filipino_test,
    llm=llm,
    vector_store=batch_eval_vector_store,
    tools=[libretranslate_en_fil]
)

print(json.dumps(result, indent=2, ensure_ascii=False))



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `libretranslate_en_fil` with `{'text': 'I see you.'}`
responded: Similar Examples Influence: The examples highlight the importance of adequacy, fluency, and lexical choice in evaluating translations. They guide me to focus on whether the translation accurately conveys the meaning, reads naturally, and uses appropriate vocabulary.



[0m[36;1m[1;3mNakikita kita.[0m[32;1m[1;3mEnglish Sentence: I see you.
Filipino Translation: Kita kita.

Similar Examples Influence: The examples highlighted the importance of accuracy in conveying the original meaning and naturalness in the translated sentence. They also emphasized the need for appropriate word choices in the Filipino translation.

Adequacy: 2 - The translation "Kita kita" literally translates to "I see I see" or "We see each other." It doesn't accurately convey the meaning of "I see you." The correct translation should be "Nakikita kita."
Fluency: 5 - "Kita kita