# Introduction

**You have an AI model. It seems to work. But how do you actually know?​**

### Common Pain Points:
- **Retrieval fails silently**: Gets irrelevant chunks but you don't notice
- **Context gets lost**: Important info split across chunks disappears  
- **Hallucination persists**: LLM makes up facts even with good sources
- **Quality varies wildly**: Same question, different quality answers each time
- **Manual checking doesn't scale**: Can't manually verify thousands of responses

### The $10M Question:
*"How do you evaluate AI systems that generate nuanced, contextual responses at scale?"*

In [None]:
# Why Evaluations Are Critical (Real-World Impact)

print("🚨 HIGH-STAKES AI DEPLOYMENT REALITY")
print("=" * 45)

deployment_stats = {
    "Customer Service Bots": "Handle millions of conversations daily",
    "Content Moderation": "Process billions of social media posts", 
    "Medical AI": "Assist in patient diagnosis and treatment",
    "Legal AI": "Evaluate document relevance in court cases",
    "Financial AI": "Determine loan approvals and credit decisions",
    "Educational AI": "Grade student work and provide feedback"
}

print("Current AI Scale:")
for system, impact in deployment_stats.items():
    print(f"• {system}: {impact}")

print("\n💰 COST OF POOR EVALUATION:")
print("-" * 30)

failure_costs = {
    "Customer Churn": "23% abandon AI tools after bad experience",
    "Support Costs": "Poor AI increases human tickets by 40%", 
    "Brand Damage": "AI failures become viral social content",
    "Legal Liability": "Biased systems face discrimination lawsuits",
    "Regulatory Risk": "Can't prove compliance without measurement"
}

for cost_type, impact in failure_costs.items():
    print(f"• {cost_type}: {impact}")

print("\n🎯 THE BOTTOM LINE:")
print("Without proper evaluation, AI systems fail silently at scale.")
print("LLM judges provide the solution - but only if built correctly!")

## 📊 Traditional Evaluation Methods

### Human Evaluation Methods:
- **Expert assessment**: Manual rating but $5-50 per evaluation
- **Weeks to scale**: Gold standard quality, impossible timeline
- **Subjective bias**: Different evaluators, different standards
- **Can't handle volume**: Thousands of outputs daily

### Reference-Based Automated Metrics:
- **Exact Match**: Perfect matches only, zero tolerance
- **F1 Score**: Token overlap, misses meaning
- **BLEU**: Translation metric, ignores factual accuracy
- **ROUGE**: Content recall, can't detect hallucinations

### Critical Limitations:
- **Rigid scoring**: Correct rephrases score poorly
- **Missing hallucination detection**: Can't spot made-up facts
- **Context blind**: Ignores document grounding
- **Too slow**: Can't monitor production systems real-time

### Exact Match (EM)

Definition: Exact Match is a binary metric that determines if a generated text is perfectly identical to a reference text. It is a very strict measure, returning 1 (true) only if every character matches, including case, punctuation, and spacing; otherwise, it returns 0 (false). It has "zero tolerance" for any deviation.


Formula:
$$ EM(R, C) = \begin{cases} 1 & \text{if } R = C \ 0 & \text{if } R \neq C \end{cases} $$
Where:


$R$ is the Reference text.
$C$ is the Candidate (generated) text.

Exact Match is straightforward to implement manually or can be found in some NLP toolkits.

In [1]:
def exact_match(reference: str, candidate: str) -> int:
    """
    Calculates the Exact Match score between a reference and a candidate string.
    Returns 1 if they are identical, 0 otherwise.
    """
    return 1 if reference == candidate else 0

# Working Example
reference_em = "The capital of France is Paris."

candidate_em_1 = "The capital of France is Paris."
candidate_em_2 = "The capital of France is paris."
candidate_em_3 = "Paris is the capital of France."

print(f"Reference: '{reference_em}'")
print(f"Candidate 1: '{candidate_em_1}' -> EM Score: {exact_match(reference_em, candidate_em_1)}")
print(f"Candidate 2: '{candidate_em_2}' -> EM Score: {exact_match(reference_em, candidate_em_2)}")
print(f"Candidate 3: '{candidate_em_3}' -> EM Score: {exact_match(reference_em, candidate_em_3)}")

Reference: 'The capital of France is Paris.'
Candidate 1: 'The capital of France is Paris.' -> EM Score: 1
Candidate 2: 'The capital of France is paris.' -> EM Score: 0
Candidate 3: 'Paris is the capital of France.' -> EM Score: 0


### F1 Score

Definition: The F1 Score is the harmonic mean of Precision and Recall. In the context of NLP text generation evaluation (especially for tasks like question answering where token overlap is important), it measures the overlap between the references in the generated text and the reference text.


Precision: Measures how many of the references in the generated text are also present in the reference text. It answers: "Of all the references I generated, how many were correct?"
Recall: Measures how many of the references in the reference text were captured by the generated text. It answers: "Of all the correct references, how many did I generate?"

Formulas:
Let:


$TP$ (True Positives) = Number of references common to both the candidate and reference texts.
$FP$ (False Positives) = Number of references in the candidate text but not in the reference text.
$FN$ (False Negatives) = Number of references in the reference text but not in the candidate text.

$$ Precision = \frac{TP}{TP + FP} = \frac{\text{Number of matching references}}{\text{Total references in candidate}} $$
$$ Recall = \frac{TP}{TP + FN} = \frac{\text{Number of matching references}}{\text{Total references in reference}} $$
$$ F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} $$

For token-level F1, we often use sklearn.metrics.f1_score after converting strings to sets of references.

In [2]:
from collections import Counter

def calculate_f1_score_references(reference_references: list, candidate_references: list) -> float:
    """
    Calculates the token-level F1 score between a reference and a candidate list of references.
    """
    common = Counter(reference_references) & Counter(candidate_references)
    num_common = sum(common.values())

    if num_common == 0:
        return 0.0

    precision = num_common / len(candidate_references)
    recall = num_common / len(reference_references)

    f1 = (2 * precision * recall) / (precision + recall)
    return f1

# Working Example
reference_f1 = "The quick brown fox jumps over the lazy dog."
candidate_f1 = "A quick fox jumps over a dog."

# Tokenize the sentences (simple split for demonstration)
reference_references_f1 = reference_f1.lower().split()
candidate_references_f1 = candidate_f1.lower().split()

print(f"\nReference references: {reference_references_f1}")
print(f"Candidate references: {candidate_references_f1}")
print(f"F1 Score (token-level): {calculate_f1_score_references(reference_references_f1, candidate_references_f1):.3f}")

# Using sklearn for comparison (requires converting to binary labels, which is less direct for this specific use case)
# For direct token overlap, the custom function above is more illustrative.
# If using sklearn, it's typically for classification where each token is a class.


Reference references: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.']
Candidate references: ['a', 'quick', 'fox', 'jumps', 'over', 'a', 'dog.']
F1 Score (token-level): 0.625


## What is LLM as a Judge?

Large Language Models (LLMs) as judges represent a paradigm where we leverage the reasoning capabilities of LLMs to evaluate, score, and assess various types of content, conversations, or decisions.

### Key Characteristics:
- **Automated Evaluation**: Replace human evaluators in specific contexts
- **Consistent Scoring**: Provide standardized assessment criteria
- **Scalable Assessment**: Handle large volumes of evaluation tasks
- **Multi-dimensional Analysis**: Evaluate multiple criteria simultaneously

### Why LLM Judges Changed Everything:
- **Semantic Understanding**: Recognizes paraphrasing and meaning beyond keywords
- **Scalable Human-like Judgment**: Thousands of evaluations in minutes vs weeks
- **Reference-free Evaluation**: Can assess faithfulness without ground truth
- **Contextual Assessment**: Considers domain expertise and user intent

In [3]:
# Setup and imports
import os
import json
import pandas as pd
from typing import Dict, List, Any, Optional
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage

# Initialize LLM
try:
    llm = ChatOllama(model="llama3.1:8b", temperature=0)
    llm.invoke("Hello World!")
    print("✅ ChatOllama initialized with llama3.1:8b model")
except Exception as e:
    print(f"❌ Failed to initialize ChatOllama: {e}")
    print("Please make sure Ollama is installed and running with llama3.1 model")

✅ ChatOllama initialized with llama3.1:8b model


In [4]:
# Simple example of LLM evaluation concept
sample_text = """
The quick brown fox jumps over the lazy dog. This sentence contains all letters of the alphabet.
It's commonly used for testing fonts and keyboards.
"""

evaluation_criteria = {
    "clarity": "How clear and understandable is the text?",
    "informativeness": "How much useful information does it provide?",
    "engagement": "How engaging is the content for readers?"
}

print("Sample Text:", sample_text)
print("\nEvaluation Criteria:")
for criterion, description in evaluation_criteria.items():
    print(f"- {criterion.title()}: {description}")

Sample Text: 
The quick brown fox jumps over the lazy dog. This sentence contains all letters of the alphabet.
It's commonly used for testing fonts and keyboards.


Evaluation Criteria:
- Clarity: How clear and understandable is the text?
- Informativeness: How much useful information does it provide?
- Engagement: How engaging is the content for readers?


In [6]:
print("🤖 LLM EVALUATION RESULTS")
# Now let's use the LLM to evaluate the text against each criterion
for criterion, description in evaluation_criteria.items():
    print(f"\n🎯 Evaluating: {criterion.title()}")
    print("-" * 40)
    
    # Create evaluation prompt
    evaluation_prompt = f"""
Please evaluate the following text based on this criterion: {description}

Text to evaluate: {sample_text.strip()}

Provide a score from 1-10 and a brief explanation of your reasoning.
Format your response as:
Score: X/10
Reasoning: [Your explanation]
"""
    
    # Get LLM evaluation
    try:
        response = llm.invoke(evaluation_prompt)
        print(f"LLM Response:\n{response.content}")
    except Exception as e:
        print(f"❌ Error getting evaluation: {e}")

🤖 LLM EVALUATION RESULTS

🎯 Evaluating: Clarity
----------------------------------------
LLM Response:
Score: 9/10
Reasoning: The text is clear and easy to understand, but it assumes some prior knowledge about the purpose of the sentence. A reader who has never heard of this sentence before might not fully grasp its significance or why it's used for testing fonts and keyboards. However, the language itself is simple and straightforward, making it accessible to a wide range of readers.

🎯 Evaluating: Informativeness
----------------------------------------
LLM Response:
Score: 6/10
Reasoning: The text provides some useful information about the sentence, specifically its use for testing fonts and keyboards. However, it doesn't provide much depth or context beyond that. It also assumes prior knowledge of why this particular sentence is significant (i.e., containing all letters of the alphabet), which limits its usefulness to readers who are already familiar with this fact.

🎯 Evaluating: 

## Applications Across Domains

### Legal and Judicial Applications
- **Document Relevance Scoring**: Assess relevance of legal documents to cases
- **Case Law Analysis**: Evaluate similarity between legal precedents
- **Judicial Decision Support**: Assist in evidence evaluation and consistency checking

### Content Quality Evaluation
- **Academic Paper Review**: Automated initial screening of research papers
- **Content Moderation**: Scale content review for platforms
- **Customer Service Quality**: Evaluate support interactions

### Conversation Assessment
- **Chatbot Performance**: Evaluate AI assistant responses
- **Human-likeness Detection**: Assess naturalness of generated conversations
- **Training Data Quality**: Validate synthetic conversation datasets

In [7]:
# Domain Examples for LLM as Judge

print("🏛️ LEGAL DOMAIN EXAMPLE: Document Relevance Scoring")
print("=" * 60)

# Case scenario
legal_case = "Personal injury lawsuit: slip and fall at grocery store"
sample_document = "Store surveillance footage showing wet floor conditions on day of incident"

print(f"Case: {legal_case}")
print(f"Document: {sample_document}")

# LLM evaluation
legal_prompt = f"""
Rate this document's relevance to the legal case (1-10 scale):

Case: {legal_case}
Document: {sample_document}

Provide: Score (1-10) and brief reasoning.
"""

try:
    legal_result = llm.invoke(legal_prompt)
    print(f"\n🤖 LLM Evaluation:\n{legal_result.content}")
except Exception as e:
    print(f"Error: {e}")

print("\n💬 CONVERSATION EXAMPLE: Chatbot Response Quality")
print("=" * 60)

# Customer service scenario
customer_query = "I ordered a laptop 3 days ago but haven't received shipping confirmation. Can you help?"
chatbot_response = "Orders usually ship within 5-7 business days. Please wait longer."

print(f"Customer: {customer_query}")
print(f"Chatbot: {chatbot_response}")

# LLM evaluation
chatbot_prompt = f"""
Evaluate this chatbot response for customer service quality:

Customer Query: {customer_query}
Chatbot Response: {chatbot_response}

Rate helpfulness (1-10) and suggest improvements.
"""

try:
    chatbot_result = llm.invoke(chatbot_prompt)
    print(f"\n🤖 LLM Evaluation:\n{chatbot_result.content}")
except Exception as e:
    print(f"Error: {e}")

print("\n💡 Key Takeaways:")
print("- Legal: Helps prioritize case materials")
print("- Chatbot: Improves customer service quality")
print("- All domains need clear evaluation criteria!")

🏛️ LEGAL DOMAIN EXAMPLE: Document Relevance Scoring
Case: Personal injury lawsuit: slip and fall at grocery store
Document: Store surveillance footage showing wet floor conditions on day of incident

🤖 LLM Evaluation:
I would rate the document's relevance to the legal case as a 9 out of 10.

The document is directly related to the incident in question, providing visual evidence of the store's condition at the time of the slip and fall. The footage can be used to:

* Support or refute claims made by the plaintiff about the cause of the accident
* Show that the store was aware of the wet floor conditions and failed to take adequate measures to address them
* Demonstrate the extent of the hazard posed by the wet floor

The only reason I wouldn't give it a perfect 10 is that, without more context or analysis, we can't be certain what specific details the footage shows. However, in general, store surveillance footage is highly relevant and probative evidence in slip and fall cases like this

In [8]:
print("\n📝 CONTENT QUALITY EXAMPLE: Academic Paper Review")
print("=" * 60)

# Sample abstract with obvious flaws
paper_abstract = """
We surveyed 10 students about social media and mood. Students using social media 
more than 3 hours daily sometimes felt sad. Therefore, social media is bad for 
all teenagers and should be banned.
"""

print(f"Abstract to review:\n{paper_abstract}")

# LLM evaluation
academic_prompt = f"""
Review this academic abstract for quality issues:

Abstract: {paper_abstract}

Rate (1-10) and identify main problems with methodology, sample size, or conclusions.
"""

try:
    academic_result = llm.invoke(academic_prompt)
    print(f"\n🤖 LLM Review:\n{academic_result.content}")
except Exception as e:
    print(f"Error: {e}")



📝 CONTENT QUALITY EXAMPLE: Academic Paper Review
Abstract to review:

We surveyed 10 students about social media and mood. Students using social media 
more than 3 hours daily sometimes felt sad. Therefore, social media is bad for 
all teenagers and should be banned.


🤖 LLM Review:
I'd rate this abstract a 2 out of 10 in terms of quality.

Here are the main problems I've identified:

1. **Sample size**: The sample size is extremely small, consisting of only 10 students. This is not sufficient to draw any meaningful conclusions about social media use and mood among teenagers.
2. **Lack of control group**: There is no comparison group or control condition in this study. How do we know that the students who used social media more than 3 hours a day would have felt sad if they hadn't used social media? A control group would help to establish causality.
3. **Correlation vs. causation**: The abstract implies that using social media causes sadness, but it's possible that there are other fac

In [9]:
print("\n💬 CONVERSATION EXAMPLE: Chatbot Response Quality")
print("=" * 60)

# Customer service scenario
customer_query = "I ordered a laptop 3 days ago but haven't received shipping confirmation. Can you help?"
chatbot_response = "Orders usually ship within 5-7 business days. Please wait longer."

print(f"Customer: {customer_query}")
print(f"Chatbot: {chatbot_response}")

# LLM evaluation
chatbot_prompt = f"""
Evaluate this chatbot response for customer service quality:

Customer Query: {customer_query}
Chatbot Response: {chatbot_response}

Rate helpfulness (1-10) and suggest improvements.
"""

try:
    chatbot_result = llm.invoke(chatbot_prompt)
    print(f"\n🤖 LLM Evaluation:\n{chatbot_result.content}")
except Exception as e:
    print(f"Error: {e}")

print("\n" + "=" * 60)
print("💡 Key Takeaways:")
print("- Legal: Helps prioritize case materials")
print("- Academic: Catches obvious methodology flaws") 
print("- Chatbot: Improves customer service quality")
print("- All domains need clear evaluation criteria!")


💬 CONVERSATION EXAMPLE: Chatbot Response Quality
Customer: I ordered a laptop 3 days ago but haven't received shipping confirmation. Can you help?
Chatbot: Orders usually ship within 5-7 business days. Please wait longer.

🤖 LLM Evaluation:
I would rate the helpfulness of this chatbot response as a 2 out of 10.

The response is unhelpful for several reasons:

* It doesn't acknowledge the customer's concern or frustration about not receiving shipping confirmation.
* The answer is too vague, stating only that orders "usually" ship within 5-7 business days. This doesn't provide any specific information about the status of this particular order.
* The response essentially tells the customer to wait longer without offering any additional assistance or next steps.

To improve this response, I would suggest the following:

1. Acknowledge the customer's concern: "Sorry to hear that you haven't received shipping confirmation yet."
2. Provide a more specific answer: "I've checked on your order 

## The Journey Most People Take (And Why It's Problematic)

Most people start with LLM judging the same way:
1. **"Just ask if it's correct"** - Seems obvious, what could go wrong?
2. **"Ask for true/false"** - More structured, feels better
3. **"Give it a score"** - Numbers feel objective and scientific
4. **"Compare two options"** - Let the LLM pick the better one

**Spoiler**: Each approach has serious hidden flaws that most people never discover.

Let's experience this journey together, starting with the most naive approach...

### Naive Approach 1: "Just Tell Me If This Answer Is Correct"

This is how everyone starts. It seems so simple and obvious:
- Give the LLM a question and an answer
- Ask "Is this answer correct?"
- Trust the yes/no response

### What Could Possibly Go Wrong?
Let's find out using carefully chosen examples...

In [16]:
print("\n### Naive Approach 1: 'Just Tell Me If This Answer Is Correct'")
print("This method fails spectacularly when the LLM is presented with information that is both plausible and incorrect. The LLM may lack the internal process to critically verify a statement that it is presented with as fact, especially if the answer is short and lacks context.")

# User's example details
user_question_1 = "What programming language should beginners learn first?"
model_answer_1 = "Python is an excellent choice for beginners because it has clean, readable syntax and a gentle learning curve. Most computer science courses and coding bootcamps start with Python."
llm_judge_prompt_1 = f"""
Is the given answer correct? Only answer with Yes or No.
Question: '{user_question_1}'
Answer: '{model_answer_1}'
"""

print(f"\nUser Question: {user_question_1}")
print(f"Model Answer: {model_answer_1}")
print(f"LLM Judge Prompt:\n{llm_judge_prompt_1}")

# Simulating LLM response based on the problem description
# In a real scenario, llm.invoke(llm_judge_prompt_2) would be called.
# The problem description states: "LLM likely says YES because answer sounds authoritative and mentions 'most courses' - mistaking common practice for universal truth."
response_1_content = llm.invoke(llm_judge_prompt_1)
response_1 = type('obj', (object,), {'content': response_1_content})() # Mocking the response object
print(f"LLM Response: {response_1.content}")

print("\n**What goes wrong:** The LLM will often agree even if the answer is subjective and other equally valid answers exist. It can confidently state 'yes' to an opinion presented as fact, making the output seem reliable when it is not universally true.")
print("**Hidden Flaw:** The LLM's confidence is not a reliable indicator of correctness when the question itself is subjective. Confident, research-backed language can trick the LLM into thinking advice is factual rather than contextual.")


### Naive Approach 1: 'Just Tell Me If This Answer Is Correct'
This method fails spectacularly when the LLM is presented with information that is both plausible and incorrect. The LLM may lack the internal process to critically verify a statement that it is presented with as fact, especially if the answer is short and lacks context.

User Question: What programming language should beginners learn first?
Model Answer: Python is an excellent choice for beginners because it has clean, readable syntax and a gentle learning curve. Most computer science courses and coding bootcamps start with Python.
LLM Judge Prompt:

Is the given answer correct? Only answer with Yes or No.
Question: 'What programming language should beginners learn first?'
Answer: 'Python is an excellent choice for beginners because it has clean, readable syntax and a gentle learning curve. Most computer science courses and coding bootcamps start with Python.'

LLM Response: content='Yes' additional_kwargs={} response_met

### Naive Approach 2: True/False Classification

After discovering issues with simple correctness, people often move to true/false evaluation:
- Seems more structured and binary
- Feels more "scientific" than yes/no
- But loses important nuance...

In [None]:
print("\n### Naive Approach 2: 'True/False with Nuanced Claims'")
print("This method fails when the LLM is asked to evaluate a nuanced claim as a simple true or false statement, even with context. The LLM may lack the ability to acknowledge the complexities, exceptions, or varying degrees of truth within a statement, leading to an oversimplified 'True' or 'False' response that misses critical subtleties.")

# User's example details
user_question_2 = "Is the following statement true or false given the context?"
model_answer_2 = "Exercise is good for mental health" # This is the statement being evaluated
context_2 = "Regular moderate exercise has been shown in numerous studies to reduce symptoms of depression and anxiety."
llm_judge_prompt_2 = f"""
Is the following statement true or false given the context? Return only True or False.
Statement: '{model_answer_2}'
Context: '{context_2}'
"""

print(f"\nUser Question: {user_question_2}")
print(f"Statement being evaluated (Model Answer): {model_answer_2}")
print(f"Context provided: {context_2}")
print(f"LLM Judge Prompt:\n{llm_judge_prompt_2}")

# Simulating LLM response based on the problem description
# The problem description states: "Generally true, but ignores individual variation, severity of conditions, and that exercise alone isn't sufficient for serious mental health issues"
# An LLM would likely respond 'True' because the statement is generally accepted as true, overlooking the nuances.
response_2_content = llm.invoke(llm_judge_prompt_2)
response_2 = type('obj', (object,), {'content': response_2_content})() # Mocking the response object
print(f"LLM Response: {response_2.content}")

print("\n**What goes wrong:** The LLM will likely respond 'True' because the statement is broadly accepted, despite the significant nuances and exceptions. It oversimplifies a complex topic into a binary answer.")
print("**Hidden Flaw:** The LLM's binary 'True/False' judgment fails to capture the conditional nature or limitations of the claim. It struggles with statements that are 'mostly true' but not universally or unconditionally true, especially when the context provided supports the general truth without elaborating on exceptions.")


### Naive Approach 3: 'True/False with Nuanced Claims'
This method fails when the LLM is asked to evaluate a nuanced claim as a simple true or false statement, even with context. The LLM may lack the ability to acknowledge the complexities, exceptions, or varying degrees of truth within a statement, leading to an oversimplified 'True' or 'False' response that misses critical subtleties.

User Question: Is the following statement true or false given the context?
Statement being evaluated (Model Answer): Exercise is good for mental health
Context provided: Regular moderate exercise has been shown in numerous studies to reduce symptoms of depression and anxiety.
LLM Judge Prompt:

Is the following statement true or false given the context? Return only True or False.
Statement: 'Exercise is good for mental health'
Context: 'Regular moderate exercise has been shown in numerous studies to reduce symptoms of depression and anxiety.'

LLM Response: content='True.' additional_kwargs={} respons

### Naive Approach 3: Direct Scoring

When true/false feels too limiting, people turn to scoring:
- "Numbers are objective!"
- "1-10 scale feels scientific"
- But without clear criteria, scores become arbitrary...

In [18]:
print("\n### Naive Approach 3: Direct Scoring")
print("A score can be completely arbitrary without a clear definition of what each number represents. A partially correct answer or even a hallucinated answer might receive a surprisingly high score if the LLM is designed to be helpful rather than strictly accurate.")
# User's example details
user_question_3 = "Explain how photosynthesis works"
model_answer_3 = "Plants use sunlight to make food. Chlorophyll in leaves absorbs light and converts carbon dioxide and water into glucose and oxygen."
llm_judge_prompt_3 = f"""
Rate this answer from 1-10 for quality. Only provide the number. Explain your reasoning.
Question: '{user_question_3}'
Answer: '{model_answer_3}'
"""

print(f"\nUser Question: {user_question_3}")
print(f"Model Answer: {model_answer_3}")
print(f"LLM Judge Prompt:\n{llm_judge_prompt_3}")

# Simulating LLM response based on the problem description
# The problem description explicitly shows inconsistency by running multiple times.
# We'll simulate one run and then explain the inconsistency in the analysis.
# Let's pick a plausible score for a single run.
response_3= llm.invoke(llm_judge_prompt_3)
response_3 = type('obj', (object,), {'content': response_3 })() # Mocking the response object
print(f"LLM Response (Example Run 1): {response_3.content}")

# To demonstrate inconsistency as per the original example, we would run it multiple times:
# For illustrative purposes in the explanation, we can mention a range.
# Example scores from multiple runs could be 8, 7, 9.
# print(f"LLM Response (Example Run 2): 7")
# print(f"LLM Response (Example Run 3): 9")

print("\n**Hidden Flaw:** The LLM lacks a transparent and consistently applied internal rubric for 'quality'. Without explicit criteria provided in the prompt, its scoring becomes arbitrary, reflecting internal stochasticity rather than a stable evaluation of the answer's merit. This means 'no clear criteria means arbitrary scoring'.")




### Naive Approach 3: Direct Scoring
A score can be completely arbitrary without a clear definition of what each number represents. A partially correct answer or even a hallucinated answer might receive a surprisingly high score if the LLM is designed to be helpful rather than strictly accurate.

User Question: Explain how photosynthesis works
Model Answer: Plants use sunlight to make food. Chlorophyll in leaves absorbs light and converts carbon dioxide and water into glucose and oxygen.
LLM Judge Prompt:

Rate this answer from 1-10 for quality. Only provide the number. Explain your reasoning.
Question: 'Explain how photosynthesis works'
Answer: 'Plants use sunlight to make food. Chlorophyll in leaves absorbs light and converts carbon dioxide and water into glucose and oxygen.'

LLM Response (Example Run 1): content='8\n\nThis answer provides a clear and concise explanation of the basic process of photosynthesis, including the key components involved (chlorophyll, sunlight, CO2, H2O, 

### Naive Approach 4: Compare two options

When direct scoring proves too arbitrary and inconsistent, people often pivot to comparing options:


"Let the LLM pick the better one!"
"It's how humans often evaluate choices, so it must be good."
But this method is still susceptible to subtle biases that can skew the results

In [15]:
print("\n### Naive Approach 4: Compare two options")
print("This method, popular in preference fine-tuning, is still susceptible to several biases, including position bias and verbosity bias. An LLM might prefer a response based on its length or its position in the prompt rather than its actual quality.")

# User's example details
user_question_4 = "Describe the city of New York."
model_answer_A_4 = "New York City is a global hub for finance, culture, and media. It is the most populous city in the United States and is home to iconic landmarks like the Statue of Liberty and Times Square. Its diverse neighborhoods and bustling atmosphere make it a unique and dynamic place to visit."
model_answer_B_4 = "New York City is a place with lots of tall buildings and famous spots. It has many different people from all over the world. It can be a very busy and exciting place to be."

# Prompt for Verbosity Bias (A is more verbose)
llm_judge_prompt_4_verbosity = f"""Which of the two following answers is better?
Answer A: '{model_answer_A_4}'
Answer B: '{model_answer_B_4}'"""

print(f"\nUser Question: {user_question_4}")
print(f"Model Answer A: {model_answer_A_4}")
print(f"Model Answer B: {model_answer_B_4}")
print(f"LLM Judge Prompt (A first):\n{llm_judge_prompt_4_verbosity}")

response_4_verbosity = llm.invoke(llm_judge_prompt_4_verbosity)
print(f"LLM Response (A first): {response_4_verbosity.content}")

print("\n**Hidden flaw (Verbosity Bias, as per your description):** The judge will almost certainly favor the longer, more verbose Answer A, associating greater length with higher quality, even though Answer B is not necessarily wrong or inadequate for certain contexts. The LLM's response often reflects this preference by citing more detail or sophisticated language.")

# Prompt for Position Bias (swapping A and B)
llm_judge_prompt_4_position = f"""Which of the two following answers is better?
Answer A: '{model_answer_B_4}'
Answer B: '{model_answer_A_4}'""" # Swapped order

print(f"\nLLM Judge Prompt (B first):\n{llm_judge_prompt_4_position}")

response_4_position = llm.invoke(llm_judge_prompt_4_position)
print(f"LLM Response (B first): {response_4_position.content}")

print("\n**Hidden flaw (Position Bias, as per your description):** If the order of the answers were swapped (as demonstrated in the second prompt), there is a chance the LLM would favor the new 'Answer A' (which is now the less verbose one), demonstrating an unconscious preference for the first item presented. This bias is particularly problematic for instances where the answers are of similar quality, as the LLM's preference can be swayed by the arbitrary ordering.")




### Naive Approach 4: Compare two options
This method, popular in preference fine-tuning, is still susceptible to several biases, including position bias and verbosity bias. An LLM might prefer a response based on its length or its position in the prompt rather than its actual quality.

User Question: Describe the city of New York.
Model Answer A: New York City is a global hub for finance, culture, and media. It is the most populous city in the United States and is home to iconic landmarks like the Statue of Liberty and Times Square. Its diverse neighborhoods and bustling atmosphere make it a unique and dynamic place to visit.
Model Answer B: New York City is a place with lots of tall buildings and famous spots. It has many different people from all over the world. It can be a very busy and exciting place to be.
LLM Judge Prompt (A first):
Which of the two following answers is better?
Answer A: 'New York City is a global hub for finance, culture, and media. It is the most populous cit

## Summary: The Path Forward

### What We've Learned:

**The Problem:**
1. **Traditional evaluation methods** don't work for modern AI systems
2. **AI models fail silently** without proper evaluation
3. **Naive LLM judging approaches** have hidden flaws

**The Solution:**
1. **LLM as Judge** provides scalable, understanding
2. **Proper implementation** requires systematic approaches
3. **Success measurement** needs concrete metrics and bias detection

### Next Steps:
In the following sessions, we'll build sophisticated solutions:
- **Session 2**: Progressive improvements and structured approaches
- **Session 3**: Production-ready systems with bias detection
- **Final Challenge**: Building comprehensive evaluation pipelines

---

**You now understand both the promise and perils of LLM as Judge systems. Ready to build better solutions?**