# LLM-as-a-Judge for English-to-Filipino Translations
Enrique Lejano & Monica Manlises | CSC420M G01

## I. Setup

Install Dependencies

In [None]:
%pip install -U langchain langchain-google-genai python-dotenv langchain-core "langchain-chroma>=0.1.2" google-genai chromadb langchain-chroma scikit-learn --quiet
%load_ext autoreload

Note: you may need to restart the kernel to use updated packages.


Import libraries and environment variables

In [50]:
%autoreload 2

from dotenv import load_dotenv
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate, FewShotChatMessagePromptTemplate, AIMessagePromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate
from langchain_core.messages import HumanMessage, AIMessage
from langchain_core.example_selectors import SemanticSimilarityExampleSelector
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_core.rate_limiters import InMemoryRateLimiter  
from langchain.evaluation.criteria import CriteriaEvalChain, LabeledCriteriaEvalChain
from chromadb import PersistentClient
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
import pandas as pd
import numpy as np
import re
import pprint
from tqdm import tqdm

from agentic_judge_main import format_training_batches, evaluate_training_batches


load_dotenv()

True

In [5]:
training_set = "../datasets/training.csv"
test_set = "../datasets/test.csv"

### Load Dataset and Data Preprocessing

### Load training set

In [7]:
training_df = pd.read_csv(training_set).dropna(how='all')
training_df.drop(columns=["Contributor"], inplace=True)
training_df.head()

Unnamed: 0,English,Filipino-Correct,Filipino-Flawed,Remarks
0,The Philippines is an archipelago made up of o...,"Ang Pilipinas ay isang kapulaang binubuo ng 7,...",Ang Pilipinas ay isang puno na binubuo ng mahi...,
1,Philippines is the world's second-largest arch...,Ang Pilipinas ang pangalawa sa pinakamalaking ...,Ang Pilipinas ay ang pangalawang malaking isla...,
2,Filipino and English are the two official lang...,Filipino at Ingles ang dalawang opisyal na lin...,Tagalog at Ingles ang dalawa opisyal lingwahe ...,
3,Tagalog is the most widely spoken native langu...,Tagalog ang pinakamalawak at ginagamit na katu...,Tagalog ay ang pinaka malawak sinasabi katutub...,
4,The Philippines was a Spanish colony for over ...,Ang Pilipinas ay naging isang kolonya ng Espan...,Pilipinas naging Espanya Kolonya sa higit 300 ...,


Add "None" to all rows with missing remarks.

In [8]:
# Add "None" to all rows with missing remarks.
training_df['Remarks'] = training_df['Remarks'].fillna("None")
training_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 561 entries, 0 to 562
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   English           561 non-null    object
 1   Filipino-Correct  561 non-null    object
 2   Filipino-Flawed   561 non-null    object
 3   Remarks           561 non-null    object
dtypes: object(4)
memory usage: 21.9+ KB


### Load test set

In [9]:
test_df = pd.read_csv(test_set).dropna(how='all')
test_df.drop(columns=["Contributor"], inplace=True)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 57 entries, 0 to 63
Data columns (total 5 columns):
 #   Column                                                          Non-Null Count  Dtype  
---  ------                                                          --------------  -----  
 0   Source Text (English)                                           57 non-null     object 
 1   Target Text (Filipino)                                          57 non-null     object 
 2   Final Score                          (1 - lowest, 5 - highest)  57 non-null     float64
 3   Rater 1 Explanation                                             57 non-null     object 
 4   Rater 2 Explanation                                             54 non-null     object 
dtypes: float64(1), object(4)
memory usage: 2.7+ KB


In [10]:
test_df.rename(columns={
    "Source Text (English)": "English",
    "Target Text (Filipino)": "Filipino",
    "Final Score                          (1 - lowest, 5 - highest)": "Rating",
    "Rater 1 Explanation": "Remarks 1",
    "Rater 2 Explanation": "Remarks 2"
}, inplace=True)

test_df['Remarks 2'] = test_df['Remarks 2'].fillna("None")

test_df.head()

Unnamed: 0,English,Filipino,Rating,Remarks 1,Remarks 2
0,The children laughed and played under the afte...,Ang mga bata ay nagtawanan at naglaro sa ilali...,4.0,"Accurate, fluent, and natural translation. Cap...",Just slight error due to the literal translati...
1,She took a break to gather her thoughts.,Nagpahinga siya para mag-isip-isip.,4.0,The translation is accurate. It was able to ca...,The translation would have been better if the ...
2,The algorithm efficiently identifies patterns ...,Mabisang kinikilala ng algoritmo ang mga patte...,3.0,"The translation of ""identifies"" as ""kinikilala...",The translation would have been better if the ...
3,Data normalization helps improve model perform...,Tumutulong sa pagpabuti ng model ang normalisa...,5.0,The translated text is natural and captures th...,The translation didn't literally translated th...
4,alam mo ma'am masaya naman topics natin sa phi...,"You know, ma'am, we have a lot of fun philosop...",4.0,"flawed translation is close, but failed to tra...",


## II. Few-shot Prompting

Before any LLM operations, a rate limiter is necessary to stay under the Gemini Free plan.

In [None]:
rate_limiter = InMemoryRateLimiter(
    requests_per_second=0.25,  # 1 request every 4 seconds
    check_every_n_seconds=0.1,
    max_bucket_size=1.0        # No piling up of surplus tokens
)

In [12]:
examples = training_df.sample(3, random_state=38).to_dict(orient="records")

# Adding ratings to few-shot examples for the sake of prompting.
examples[0]['Rating'] = 4
examples[1]['Rating'] = 5
examples[2]['Rating'] = 4

examples

[{'English': 'Dinagyang culminates in a competition of tribes dancing with drums.',
  'Filipino-Correct': 'Nagtatapos ang Dinagyang sa paligsahan ng mga tribo na sumasayaw kasabay ng mga tambol.',
  'Filipino-Flawed': '… sa isang patimpalak ng mga tribo na kumakanta ng kundiman.',
  'Remarks': 'Replaces drum dance with singing—a meaning swap.',
  'Rating': 4},
 {'English': 'James is eating strawberries in Baguio.',
  'Filipino-Correct': 'Kumakain si James ng strawberries sa Baguio.',
  'Filipino-Flawed': 'Si James ay kumakain ng strawberries sa lugar ng Baguio.',
  'Remarks': 'ay inversion, unnecessary addition of "sa lugar ng"',
  'Rating': 5},
 {'English': "Big O notation describes the upper bound of an algorithm's growth.",
  'Filipino-Correct': 'Inilalarawan ng Big O notation ang pinakamataas na paglaki ng isang algoritmo.',
  'Filipino-Flawed': 'Big O ay laki ng algo.',
  'Remarks': 'None',
  'Rating': 4}]

In [None]:

llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    rate_limiter=rate_limiter
)

In [17]:
system_message = """
You are a professional translation evaluator. Your job is to judge the quality of a Filipino translation of an English sentence.
Assess the translation using with the following in mind:
- Adequacy: How well does it preserve the original meaning?
- Fluency: How natural and grammatically correct is the Filipino?
- Lexical Choice: Are the word choices appropriate and accurate?

Then: 
- Rate the translation from a scale of 1 to 5, serving as a combination of all three criteria with 1 being the worst and 5 being the best.
- Provide a short explanation or remarks of why you gave that rating.

Please format your response like this:
English Sentence: ...
Filipino Translation: ...
Rating: ...
Remarks: ...
"""

system_prompt = SystemMessagePromptTemplate.from_template(system_message)

human_prompt = HumanMessagePromptTemplate.from_template(
    "English Sentence: {english}\nFilipino Translation: {filipino}"
)

ai_prompt = AIMessagePromptTemplate.from_template(
    "Rating: {rating}\nRemarks: {remarks}\n"
)

In [18]:
example_prompt = ChatPromptTemplate.from_messages([
    human_prompt,
    ai_prompt,
])

few_shot_prompt = FewShotChatMessagePromptTemplate(
    examples=[{
        "english": ex["English"],
        "filipino": ex["Filipino-Correct"],
        "rating": ex["Rating"],
        "remarks": ex["Remarks"],
    } for ex in examples],
    example_prompt=example_prompt,
)

final_prompt = ChatPromptTemplate.from_messages([
    system_prompt,
    few_shot_prompt,
    human_prompt
])

#### Testing Single Input-Output Prompt

In [54]:
messages = final_prompt.format_messages(
    english="With what would you use a \"wah-wah pedal?\"",
    filipino="Ano ang gagamitin mo ng \"wah-wah pedal?\"",
)

llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

response = llm.invoke(messages)
print(response.content)

Rating: 3
Remarks: The translation is grammatically awkward. A more natural phrasing would be "Para saan mo gagamitin ang 'wah-wah pedal'?" or "Sa ano mo gagamitin ang 'wah-wah pedal'?"


Static few-shot learning examples for entire test set.

In [None]:
def evaluate_rating_static(llm_func, prompt_obj, english: str, filipino: str) -> str:
    messages = prompt_obj.format_messages(english=english, filipino=filipino)
    resp = llm_func.invoke(messages)
    return resp.content

test_df['pred_raw'] = test_df.apply(
    lambda r: evaluate_rating_static(llm, final_prompt, r['English'], r['Filipino']),
    axis=1,
)

Extract the numeric Rating from each LLM response

In [24]:
def parse_rating(text):
    m = re.search(r"Rating:\s*([1-5])", text)
    return int(m.group(1)) if m else None

test_df['pred_rating'] = test_df['pred_raw'].apply(parse_rating)

test_df.to_csv("../results/few_shot_ratings.csv", index=False)
test_df.head()

Unnamed: 0,English,Filipino,Rating,Remarks 1,Remarks 2,pred_raw,pred_rating
0,The children laughed and played under the afte...,Ang mga bata ay nagtawanan at naglaro sa ilali...,4.0,"Accurate, fluent, and natural translation. Cap...",Just slight error due to the literal translati...,Rating: 5\nRemarks: The translation is accurat...,5
1,She took a break to gather her thoughts.,Nagpahinga siya para mag-isip-isip.,4.0,The translation is accurate. It was able to ca...,The translation would have been better if the ...,Rating: 5\nRemarks: The translation is accurat...,5
2,The algorithm efficiently identifies patterns ...,Mabisang kinikilala ng algoritmo ang mga patte...,3.0,"The translation of ""identifies"" as ""kinikilala...",The translation would have been better if the ...,Rating: 5\nRemarks: The translation is accurat...,5
3,Data normalization helps improve model perform...,Tumutulong sa pagpabuti ng model ang normalisa...,5.0,The translated text is natural and captures th...,The translation didn't literally translated th...,Rating: 5\nRemarks: The translation is accurat...,5
4,alam mo ma'am masaya naman topics natin sa phi...,"You know, ma'am, we have a lot of fun philosop...",4.0,"flawed translation is close, but failed to tra...",,Rating: 5\nRemarks: The translation accurately...,5


#### Comparing Few-shot Prompt-Engineered Gemini Evaluation versus Human Evaluation.

In [22]:
y_true = test_df['Rating'].tolist()
y_pred = test_df['pred_rating'].tolist()

acc = accuracy_score(y_true, y_pred)
prec, rec, f1, support = precision_recall_fscore_support(
    y_true, y_pred, average='macro', zero_division=0
)

print(f"Accuracy: {acc:.3f}")
print(f"Macro Precision: {prec:.3f}")
print(f"Macro Recall: {rec:.3f}")
print(f"Macro F1-score: {f1:.3f}")

print("\nDetailed per-rating breakdown:\n")
print(classification_report(y_true, y_pred))

Accuracy: 0.263
Macro Precision: 0.255
Macro Recall: 0.250
Macro F1-score: 0.166

Detailed per-rating breakdown:

              precision    recall  f1-score   support

         1.0       1.00      0.25      0.40         4
         2.0       0.00      0.00      0.00        10
         3.0       0.00      0.00      0.00        15
         4.0       0.00      0.00      0.00        14
         5.0       0.27      1.00      0.43        14

    accuracy                           0.26        57
   macro avg       0.25      0.25      0.17        57
weighted avg       0.14      0.26      0.13        57



## III. Semantic Similarity Few Shot

Set up documents / vector store.

In [10]:
documents = [Document(
    page_content=
        f"English: {row[1]['English']}\n"
        f"Filipino-Flawed: {row[1]['Filipino-Flawed']}\n"
        f"Filipino-Correct: {row[1]['Filipino-Correct']}\n"
        f"Remarks: {row[1]['Remarks']}",
    metadata={
        "english": row[1]["English"],
        "flawed": row[1]["Filipino-Flawed"],
        "correct": row[1]["Filipino-Correct"],
        "remarks": row[1]["Remarks"],
        # Would be good to add rating, type of flaw, etc. here as metadata
    }
) for row in training_df.iterrows()]

doc_ids = [f"train_{i}" for i in range(len(documents))]

texts = []
text_ids = []
for i, doc in enumerate(documents):
    # Suppose metadata holds each segment
    texts.append(doc.metadata["english"])
    text_ids.append(f"train_{i}_en")
    texts.append(doc.metadata["flawed"])
    text_ids.append(f"train_{i}_flawed")
    texts.append(doc.metadata["correct"])
    text_ids.append(f"train_{i}_correct")


Use Chroma DB default embedding function to create a vector store

In [None]:
# ef = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
client = PersistentClient(path="../vector_stores/chroma_db/agent_translation_eval_documents")

# Check existing collections
existing_collections = [col.name for col in client.list_collections()]

# Conditionally delete if "agent_eval" exists
if "agent_eval" in existing_collections:
    client.delete_collection("agent_eval")

# Recreate collection cleanly
collection = client.get_or_create_collection(name="agent_eval")

collection.add(
    documents=[doc.page_content for doc in documents],
    ids=[f"doc_{i}" for i in range(len(documents))],
    metadatas=[doc.metadata for doc in documents],
)

Retrieve embeddings of all the records in the vector store collection (just to check)

In [13]:
# Query Chroma for all embeddings, documents, and metadata
records = collection.get(
    include=["embeddings", "documents", "metadatas"],
)

for doc_id, doc_txt, meta, embed in zip(
    records["ids"],
    records["documents"],
    records["metadatas"],
    records["embeddings"],
):
    print("ID:", doc_id)
    print("Doc:", doc_txt[:60], "…")
    print("Meta:", meta)
    print("Embedding size:", len(embed))
    print("First 8 floats:", embed[:8])
    print("—" * 40)


ID: doc_0
Doc: English: The Philippines is an archipelago made up of over 7 …
Meta: {'flawed': 'Ang Pilipinas ay isang puno na binubuo ng mahigit 7,640 manok, bagaman halos 2,000 lamang ang tumira.', 'english': 'The Philippines is an archipelago made up of over 7,640 islands, though only about 2,000 are inhabited.', 'remarks': 'None', 'correct': 'Ang Pilipinas ay isang kapulaang binubuo ng 7,640 na isla, ngunit 2,000 lamang ang tinitirahan'}
Embedding size: 384
First 8 floats: [ 0.08819384 -0.01419071 -0.02710477  0.0365556  -0.07022581 -0.05232793
 -0.00767512 -0.00500217]
————————————————————————————————————————
ID: doc_1
Doc: English: Philippines is the world's second-largest archipela …
Meta: {'flawed': 'Ang Pilipinas ay ang pangalawang malaking isla sa asya', 'remarks': 'None', 'english': "Philippines is the world's second-largest archipelago.", 'correct': 'Ang Pilipinas ang pangalawa sa pinakamalaking kapuluan sa mundo'}
Embedding size: 384
First 8 floats: [ 0.10007331 -0.0256438

Load vector store into LangChain, then use similarity search to find the `k` nearest sentences for any document.

In [14]:
langchain_vector_store = Chroma(
    client=client,
    collection_name="agent_eval",
    embedding_function=None  # Use Chroma's default embedding function
)

# For example, find the 10 nearest sentences for the 50th embedding (i.e., the document)
nearest = langchain_vector_store.similarity_search_by_vector(
    records['embeddings'][50],  # Use the first embedding as an example
    k=10
)

nearest

[Document(id='doc_50', metadata={'correct': 'Tatlo ang kumpirmadong patay sa sunog sa Tondo , Maynila , na sumiklab nitong Miyerkules ng umaga.', 'flawed': 'Tatlong tao ay kumpirmadong patay sa sunog sa Tondo, Manila na nagsimula itong umagang Miyerkules.', 'english': 'Three people were confirmed dead in a fire in Tondo, Manila, which broke out this Wednesday morning.', 'remarks': 'None'}, page_content='English: Three people were confirmed dead in a fire in Tondo, Manila, which broke out this Wednesday morning.\nFilipino-Flawed: Tatlong tao ay kumpirmadong patay sa sunog sa Tondo, Manila na nagsimula itong umagang Miyerkules.\nFilipino-Correct: Tatlo ang kumpirmadong patay sa sunog sa Tondo , Maynila , na sumiklab nitong Miyerkules ng umaga.\nRemarks: None'),
 Document(id='doc_46', metadata={'english': 'Three were killed in a fire in Novaliches, Quezon City this morning.', 'correct': 'Tatlo ang nasawi sa isang sunog sa Novaliches, Quezon City kaninang umaga .', 'remarks': 'None', 'flaw

Trying LangChain SemanticSimilarityExampleSelector

In [15]:
selector = SemanticSimilarityExampleSelector(
    vectorstore=langchain_vector_store,
    input_keys=['english'],
    example_keys=['english', 'correct', 'remarks']
)

In [20]:
system_message = """
You are a professional translation evaluator. Your job is to judge the quality of a Filipino translation of an English sentence.
Assess the translation using with the following in mind:
- Adequacy: How well does it preserve the original meaning?
- Fluency: How natural and grammatically correct is the Filipino?
- Lexical Choice: Are the word choices appropriate and accurate?

Then: 
- Rate the translation from a scale of 1 to 5, serving as a combination of all three criteria with 1 being the worst and 5 being the best.
- Provide a short explanation or remarks of why you gave that rating.

BEFORE rating, list any most similar example(s) from training and how they influenced your decision, e.g.:

"Closest training sentence: '...'. Gave a score of 2 because the sentence was 'unnatural for native speakers'."

Please format your response like this:
English Sentence: ...
Filipino Translation: ...
Rating: ...
Remarks: ...
"""

system_prompt = SystemMessagePromptTemplate.from_template(system_message)

human_prompt = HumanMessagePromptTemplate.from_template(
    "English Sentence: {english}\nFilipino Translation: {correct}"
)

# Changed because training set has no rating in it.
ai_prompt = AIMessagePromptTemplate.from_template(
    "Remarks: {remarks}"
)

example_prompt = ChatPromptTemplate.from_messages([
    human_prompt,
    ai_prompt,
])

dynamic_few_shot_prompt = FewShotChatMessagePromptTemplate(
    input_variables=['english', 'correct'],
    example_selector=selector,
    example_prompt=example_prompt,
)

final_prompt = ChatPromptTemplate.from_messages([
    system_prompt,
    dynamic_few_shot_prompt,
    human_prompt
])

llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    rate_limiter=rate_limiter
)

Test Dynamic Few Shot Prompting on one input

In [18]:
english_test = "Which of the following dental topics is developmentally appropriate for toddlers?"
filipino_test = "Alin sa mga sumusunod na paksa sa ngipin ang angkop sa pag-unlad para sa mga paslit?"
remarks = "No direct translation for dental in tagalog, so translation tries to find a related word in \"ngipin\" which means teeth. A code-switched version like \"paksang dental\" would be better."

messages = final_prompt.format_messages(
    english=english_test,
    correct=filipino_test,
    remarks=remarks
)

resp = llm.invoke(messages)
resp.content

'Rating: 4\nRemarks: The translation is generally good and conveys the meaning accurately. However, the phrase "angkop sa pag-unlad para sa mga paslit" could be slightly improved for a more natural flow in Filipino. A more natural phrasing might be "angkop sa antas ng pag-unlad ng mga paslit" or something similar.'

Dynamic few shot prompting on entire test set

In [None]:
def evaluate_rating_dynamic_fewshot(llm_func, prompt_obj, english, filipino, remarks):
    messages = prompt_obj.format_messages(
        english=english,
        correct=filipino,
        remarks=remarks
    )
    resp = llm_func.invoke(messages)
    return resp.content

In [None]:
test_df_dynamic_few_shot = test_df.copy()
test_df_dynamic_few_shot['pred_raw'] = test_df_dynamic_few_shot.apply(
    lambda r: evaluate_rating_dynamic_fewshot(llm, final_prompt, r['English'], r['Filipino'], r['Remarks 1']),
    axis=1
)

In [23]:
def parse_rating(text):
    m = re.search(r"Rating:\s*([1-5])", text)
    return int(m.group(1)) if m else None

test_df_dynamic_few_shot['pred_rating'] = test_df_dynamic_few_shot['pred_raw'].apply(parse_rating)

test_df_dynamic_few_shot.to_csv("../results/dynamic_fewshot_ratings.csv", index=False)
test_df_dynamic_few_shot.head()

Unnamed: 0,English,Filipino,Rating,Remarks 1,Remarks 2,pred_raw,pred_rating
0,The children laughed and played under the afte...,Ang mga bata ay nagtawanan at naglaro sa ilali...,4.0,"Accurate, fluent, and natural translation. Cap...",Just slight error due to the literal translati...,Rating: 5\nRemarks: The translation is accurat...,5
1,She took a break to gather her thoughts.,Nagpahinga siya para mag-isip-isip.,4.0,The translation is accurate. It was able to ca...,The translation would have been better if the ...,Rating: 5\nRemarks: The translation is accurat...,5
2,The algorithm efficiently identifies patterns ...,Mabisang kinikilala ng algoritmo ang mga patte...,3.0,"The translation of ""identifies"" as ""kinikilala...",The translation would have been better if the ...,Rating: 5\nRemarks: The translation is accurat...,5
3,Data normalization helps improve model perform...,Tumutulong sa pagpabuti ng model ang normalisa...,5.0,The translated text is natural and captures th...,The translation didn't literally translated th...,Rating: 4\nRemarks: The translation is accurat...,4
4,alam mo ma'am masaya naman topics natin sa phi...,"You know, ma'am, we have a lot of fun philosop...",4.0,"flawed translation is close, but failed to tra...",,English Sentence: alam mo ma'am masaya naman t...,4


In [24]:
y_true = test_df_dynamic_few_shot['Rating'].tolist()
y_pred = test_df_dynamic_few_shot['pred_rating'].tolist()

acc = accuracy_score(y_true, y_pred)
prec, rec, f1, support = precision_recall_fscore_support(
    y_true, y_pred, average='macro', zero_division=0
)

print(f"Accuracy: {acc:.3f}")
print(f"Macro Precision: {prec:.3f}")
print(f"Macro Recall: {rec:.3f}")
print(f"Macro F1-score: {f1:.3f}")

print("\nDetailed per-rating breakdown:\n")
print(classification_report(y_true, y_pred))

Accuracy: 0.228
Macro Precision: 0.084
Macro Recall: 0.186
Macro F1-score: 0.105

Detailed per-rating breakdown:

              precision    recall  f1-score   support

         1.0       0.00      0.00      0.00         4
         2.0       0.00      0.00      0.00        10
         3.0       0.00      0.00      0.00        15
         4.0       0.12      0.07      0.09        14
         5.0       0.29      0.86      0.44        14

    accuracy                           0.23        57
   macro avg       0.08      0.19      0.11        57
weighted avg       0.10      0.23      0.13        57



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## Memory Management

1. Store criteria for evaluating translations.
The purpose of this cell is to help the LLM *think about what makes a good translation by showing it what is bad and what is good and making it think about it.*

In [58]:
batches = format_training_batches(training_df)
print(f"Batch Prompt Example: {batches[0]}")

Batch Prompt Example: ("You are evaluating English-to-Filipino translations based on three criteria:\n1. Adequacy: Does the Filipino translation preserve the meaning of the original sentence?\n2. Fluency: Is it natural, smooth, and grammatically correct to be easily understood by a native speaker?\n3. Lexical Choice: Are the words contextually accurate and culturally approrpiate?\n\nPlease provide scores from 1-5, with 1 being the lowest and 5 being the highest, along with reasoning for each.\nRespond ONLY with a valid JSON array. Do not use markdown code blocks (```). Do not include any explanations, formatting, or text outside of the JSON. Start your response directly with [ and end with ].Your response must look exactly like this:\n[\n{\n'item_num': 0,\n'adequacy_score': 4,'adequacy_reasoning': 'The correct translation preserves the meaning well.'\n'fluency_score': 5\n'fluency_reasoning': 'The sentence reads smoothly in Filipino.'\n'lexical_choice_score': 4\n'lexical_choice_reasonin

Test run with one batch of 10:

In [59]:
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    rate_limiter=rate_limiter
)

test_batch_results = evaluate_training_batches(batches[:1], llm)


In [60]:
test_batch_results

[{'orig_sentence': 'The Philippines is an archipelago made up of over 7,640 islands, though only about 2,000 are inhabited.',
  'flawed_translation': 'Ang Pilipinas ay isang puno na binubuo ng mahigit 7,640 manok, bagaman halos 2,000 lamang ang tumira.',
  'target_translation': 'Ang Pilipinas ay isang kapulaang binubuo ng 7,640 na isla, ngunit 2,000 lamang ang tinitirahan',
  'item_num': 1,
  'adequacy_score': 1,
  'adequacy_reasoning': "The flawed translation uses incorrect words ('puno' for archipelago, 'manok' for islands, 'tumira' for inhabited), completely distorting the meaning.",
  'fluency_score': 2,
  'fluency_reasoning': 'While grammatically structured, the incorrect word choices make the sentence nonsensical and unnatural.',
  'lexical_choice_score': 1,
  'lexical_choice_reasoning': 'The lexical choices are entirely inappropriate and nonsensical in the context.'},
 {'orig_sentence': "Philippines is the world's second-largest archipelago.",
  'flawed_translation': 'Ang Pilipi

In [None]:
# llm = ChatGoogleGenerativeAI(
#     model="gemini-2.0-flash",
#     temperature=0,
#     max_tokens=None,
#     timeout=None,
#     max_retries=2,
#     rate_limiter=rate_limiter
# )

# criteria = {
#     "adequacy": "Does the Filipino sentence translation preserve the original meaning of the English sentence with accuracy?",
#     "fluency": "Is the sentence translation natural, smooth, and grammatically correct in Filipino, to the point that it can be easily understood by native speakers?",
#     "lexical_choice": "Does the Filipino sentence translation use appropriate and contextually accurate words?",
# }

# evaluator = LabeledCriteriaEvalChain.from_llm(
#     llm=llm,
#     criteria=criteria
# )

# # Loop over training set rows for evaluation:
# evaluations = []
# for _, row in tqdm(training_df.iterrows(), total=len(training_df)):
#     try:
#         input_text = f"Please translate the following English sentence into Filipino: {row['English']}"
#         prediction = row['Filipino-Correct']
#         reference = row['Filipino-Flawed']

#         result = evaluator.evaluate_strings(
#             prediction=prediction,
#             reference=reference,
#             input=input_text
#         ),
#         evaluations.append({
#             "orig_sentence": row['English'],
#             "flawed_translation": row['Filipino-Flawed'],
#             "target_translation": row['Filipino-Correct'],
#             "eval_reasoning": result.get("reasoning", ""),
#             "raw_eval": result,
#         })
#     except Exception as e:
#             evaluations.append({
#             "adequacy_reasoning": f"Error: {str(e)}",
#             "adequacy_score": None,
#             "raw_eval": None
#         })

# training_eval_df = pd.DataFrame(evaluations)
# training_eval_df.to_csv('../results/criteria_eval_training.csv', index=False)
# training_eval_df

Batching requests

TODO: Can I store this qualitative assessment for every sentence, and then I use that as additional basis for judging?

**TODO**: Need to compare the remarks made by the human with the remarks made by human evaluators. Ask the LLM to reflect on the difference between their evaluations, and then make a judgment and give it a score.

TODO: Try lowering rate limiter to ~2 seconds per request.