# LLM-as-a-Judge for English-to-Filipino Translations: Prompt-Engineered LLM Judge
Enrique Lejano & Monica Manlises | CSC420M G01 

## I. Setup

Uncomment if need to install packages.

In [None]:
# %pip install -U langchain langchain-google-genai python-dotenv langchain-core "langchain-chroma>=0.1.2" google-genai chromadb langchain-chroma scikit-learn libretranslatepy langchain_community --quiet
# %load_ext autoreload

In [35]:
%load_ext autoreload
%autoreload 2

from dotenv import load_dotenv
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.rate_limiters import InMemoryRateLimiter
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from libretranslatepy import LibreTranslateAPI
import pandas as pd
import numpy as np
import re

from prompt_judge_main import load_training_set, load_test_set, prepare_zeroshot_prompt, evaluate_with_prompt_engineering, parse_rating, measure_standard_metrics

load_dotenv(override=True)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


True

### Load Dataset and Preprocessing

In [4]:
training_set_path = "../datasets/training.csv"
test_set_path = "../datasets/test.csv"

Load training set

In [6]:
training_df = load_training_set(training_set_path)
training_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 561 entries, 0 to 562
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   English           561 non-null    object
 1   Filipino-Correct  561 non-null    object
 2   Filipino-Flawed   561 non-null    object
 3   Remarks           561 non-null    object
dtypes: object(4)
memory usage: 21.9+ KB


Load test set

In [9]:
test_df = load_test_set(test_set_path)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 57 entries, 0 to 63
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   English    57 non-null     object 
 1   Filipino   57 non-null     object 
 2   Rating     57 non-null     float64
 3   Remarks 1  57 non-null     object 
 4   Remarks 2  57 non-null     object 
dtypes: float64(1), object(4)
memory usage: 2.7+ KB


Setup standardized rate limiter for Gemini API requests to avoid exceeding rate limits.

In [26]:
rate_limiter = InMemoryRateLimiter(
    requests_per_second=0.25,  # 1 request every 4 seconds
    check_every_n_seconds=0.1,
    max_bucket_size=1.0        # No piling up of surplus tokens
)

## II. Zeroshot Prompting

Prepare LLM and prompt

In [27]:
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    rate_limiter=rate_limiter
)

final_prompt = prepare_zeroshot_prompt(llm)
final_prompt

ChatPromptTemplate(input_variables=['english', 'filipino'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], input_types={}, partial_variables={}, template='\n    You are a professional translation evaluator. You must assess a Filipino translation based on:\n    - Adequacy: Does the Filipino translation preserve the meaning of the original sentence?.\n    - Fluency: Is it natural, smooth, and grammatically correct to be easily understood by a native speaker?.\n    - Lexical Choice: Are the words contextually accurate and culturally appropriate?.\n\n    For each input: \n    - Adequacy rating (1-5) + detailed reasoning for your score (cite words or phrases from the translation),\n    - Fluency rating (1-5) + reasoning,\n    - Lexical Choice rating (1-5) + reasoning.\n    - Overall rating (1-5).\n\n    All the reasonings should be detailed.\n    Output Format:\n    English Sentence: ...\n    Filipino Translation: ...\n 

Format message and test with single input

In [28]:
messages = final_prompt.format_messages(
    english="With what would you use a \"wah-wah pedal?\"",
    filipino="Ano ang gagamitin mo ng \"wah-wah pedal?\"",
)

response = llm.invoke(messages)
print(response.content)

English Sentence: With what would you use a "wah-wah pedal?"
Filipino Translation: Ano ang gagamitin mo ng "wah-wah pedal?"
Adequacy: 3 - The translation captures the basic question of what one would use with a wah-wah pedal. However, the preposition "with" is not accurately translated, leading to a slightly awkward phrasing. The Filipino sentence directly translates to "What will you use of 'wah-wah pedal'?" which isn't entirely correct.
Fluency: 3 - The sentence is grammatically correct but sounds slightly unnatural. A more fluent phrasing might reorder the words or use a different construction.
Lexical Choice: 5 - The term "wah-wah pedal" is correctly retained, and the other words are common and appropriate.
Overall Rating: 3


Run LLM prompt on entire training set.

In [29]:
test_df_prompt_engineered = test_df.copy()

test_df_prompt_engineered['pred_raw'] = test_df_prompt_engineered.apply(
    lambda r: evaluate_with_prompt_engineering(llm, final_prompt, r['English'], r['Filipino']),
    axis=1,
)

Save results (full response + extracted rating) to `.csv` file

In [33]:
test_df_prompt_engineered['pred_rating'] = test_df_prompt_engineered['pred_raw'].apply(parse_rating)

test_df_prompt_engineered.to_csv('../results/prompt_engineered_ratings.csv', index=False)
test_df_prompt_engineered

Unnamed: 0,English,Filipino,Rating,Remarks 1,Remarks 2,pred_raw,pred_rating
0,The children laughed and played under the afte...,Ang mga bata ay nagtawanan at naglaro sa ilali...,4.0,"Accurate, fluent, and natural translation. Cap...",Just slight error due to the literal translati...,English Sentence: The children laughed and pla...,5
1,She took a break to gather her thoughts.,Nagpahinga siya para mag-isip-isip.,4.0,The translation is accurate. It was able to ca...,The translation would have been better if the ...,English Sentence: She took a break to gather h...,5
2,The algorithm efficiently identifies patterns ...,Mabisang kinikilala ng algoritmo ang mga patte...,3.0,"The translation of ""identifies"" as ""kinikilala...",The translation would have been better if the ...,English Sentence: The algorithm efficiently id...,5
3,Data normalization helps improve model perform...,Tumutulong sa pagpabuti ng model ang normalisa...,5.0,The translated text is natural and captures th...,The translation didn't literally translated th...,English Sentence: Data normalization helps imp...,4
4,alam mo ma'am masaya naman topics natin sa phi...,"You know, ma'am, we have a lot of fun philosop...",4.0,"flawed translation is close, but failed to tra...",,English Sentence: alam mo ma'am masaya naman t...,1
5,It's raining cats and dogs.,Umuulan ng pusa at aso.,1.0,"Literal translation of an idiom, doesn't make ...",Fails to capture the idiomatic meaning; sounds...,English Sentence: It's raining cats and dogs.\...,5
6,The party of the first part shall not be held ...,Ang partido ng unang bahagi ay hindi mananagot...,4.0,"Accurate and understandable, good for formal/l...","Effectively conveys legal terminology, correct...",English Sentence: The party of the first part ...,5
7,Thank you for coming to the event.,Salamat sa pagpunta sa kaganapan.,5.0,"Perfect, natural, and accurate.","Excellent, natural, and idiomatic translation.",English Sentence: Thank you for coming to the ...,5
8,"Despite her exhaustion, she finished the repor...","Sa kabila ng kanyang pagod, natapos niya ang u...",5.0,"Excellent, accurate, and flows well.","Fluent, accurate, and retains original meaning.","English Sentence: Despite her exhaustion, she ...",5
9,That designer bag costs an arm and a leg.,Napakamahal ng designer bag na 'yan.,5.0,"Accurate, Fluent, Coherent, Complete, and capt...","Complete, and sounds natural. Captures the mea...",English Sentence: That designer bag costs an a...,4


Measure performance of prompt-engineered LLM using standard metrics (accuracy, precision, recall, f1-score)

In [36]:
y_true = test_df_prompt_engineered['Rating'].tolist()
y_pred = test_df_prompt_engineered['pred_rating'].tolist()

measure_standard_metrics(y_true, y_pred)

Accuracy: 0.263
Macro Precision: 0.210
Macro Recall: 0.220
Macro F1-score: 0.153

Detailed per-rating breakdown:

              precision    recall  f1-score   support

         1.0       0.00      0.00      0.00         4
         2.0       0.50      0.10      0.17        10
         3.0       0.00      0.00      0.00        15
         4.0       0.29      0.14      0.19        14
         5.0       0.27      0.86      0.41        14

    accuracy                           0.26        57
   macro avg       0.21      0.22      0.15        57
weighted avg       0.22      0.26      0.18        57

