# Module 1: Metrics for Evaluation 

Outline
- Intro to LLM-as-judge
	- why use them
		- LLM are being evaluated on more complicated tasks
		- Faster and Cheaper than Human Evaluators ()
	- how we score
	- how we check if metrics are correct
- metrics Ragas has
	- faithfullness
	- answer_correctness
	- context_recall and context_enity_recall
	- context_precision
	- noise_sensitivity
	- rubric based method
- In Action
	- using metrics as a guiding light and not optimisation function
	- how to choose the Judge LLM
		- summarise the work we did to choose the Judge LLM for the assignment evaluation
	- the alignment problem
		- why is it hard
		- how can we do better

slides are [here](./Evaluation%20for%20Search%20for%20RAG.pdf)

In [1]:
%reload_ext autoreload
%autoreload 2

import nest_asyncio
nest_asyncio.apply()

In [2]:
from dotenv import find_dotenv, load_dotenv

load_dotenv(find_dotenv())

True

In [3]:
import phoenix as px
from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentor

# start the phoenix app
session = px.launch_app()
# Initialize Langchain auto-instrumentation
tracer_provider = register()
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix
OpenTelemetry Tracing Details
|  Phoenix Project: default
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: localhost:4317
|  Transport: gRPC
|  Transport Headers: {'user-agent': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



In [4]:
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper

llm = ChatOpenAI(model="gpt-4o", temperature=0)
judge_llm = LangchainLLMWrapper(llm)

In [5]:
def process_row(row, correct=False, column="response"):
    if correct:
        row[column] = row["correct"]
    else:
        row[column] = row["incorrect"]
    return row

### Faithfulness

In [19]:
row = {
    "user_input": "Where and when was Einstein born?",
    "retrieved_contexts": ["Albert Einstein (born 14 March 1879) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time"],
    "correct": "Einstein was born in Germany on 14th March 1879.",
    "incorrect": "Einstein was born in Germany.",
}

from ragas.metrics import faithfulness
faithfulness.llm = judge_llm

faithfulness.score(process_row(row, correct=False))

  faithfulness.score(process_row(row, correct=False))


1.0

### Answer Correctness

In [7]:
row = {
    "user_input": "Where and when was Einstein born?",
    "reference": "Einstein was born in 1879 in Germany.",
    "correct": "In 1879, Einstein was born in Germany.",
    "incorrect": "Einstein was born in Spain in 1879.",
}

from ragas.metrics import answer_correctness, answer_similarity
from ragas.embeddings import embedding_factory

answer_correctness.llm = judge_llm
answer_similarity.embeddings = embedding_factory("text-embedding-3-small")
answer_correctness.answer_similarity = answer_similarity

In [8]:
answer_correctness.score(process_row(row, correct=True))

  answer_correctness.score(process_row(row, correct=True))


0.9837369223673363

### Context Recall

In [9]:
row = {
    "user_input": "Where and when was Einstein born?",
    "reference": "Einstein was born in 1879 in Germany.",
    "correct": ["Albert Einstein was born on March 14, 1879 in Ulm, Württemberg, Germany"],
    "incorrect": ["Einstein was born in Ulm, but his family moved to Munich when he was just six weeks old", 
                  "Einstein's birth was registered at the registry office in Ulm on March 15, 1879, the day after he was born"
                  "At the time of Einstein's birth, Ulm was a growing town of about 33,000 inhabitants"],
}

from ragas.metrics import context_recall
context_recall.llm = judge_llm

context_recall._required_columns

{<MetricType.SINGLE_TURN: 'single_turn'>: {'reference',
  'retrieved_contexts',
  'user_input'}}

In [10]:
context_recall.score(process_row(row, correct=False, column="retrieved_contexts"))

  context_recall.score(process_row(row, correct=False, column="retrieved_contexts"))


1.0

In [11]:

context_recall.score(process_row(row, correct=True, column="retrieved_contexts"))

  context_recall.score(process_row(row, correct=True, column="retrieved_contexts"))


1.0

### Context Enity Recall

In [12]:
from ragas.metrics import context_entity_recall
context_entity_recall.llm = judge_llm
context_entity_recall._required_columns
context_entity_recall.score(process_row(row, correct=False, column="retrieved_contexts"))

  context_entity_recall.score(process_row(row, correct=False, column="retrieved_contexts"))


0.3333333322222222

### Context Precision

In [13]:
row = {
    "user_input": "Where and when was Einstein born?",
    "reference": "Einstein was born in 1879 in Germany.",
    "correct": ["Albert Einstein was born on March 14, 1879 in Ulm, Württemberg, Germany"],
    "incorrect": ["Einstein was born in Ulm, but his family moved to Munich when he was just six weeks old", 
                  "Einstein's birth was registered at the registry office in Ulm on March 15, 1879, the day after he was born"],
}

from ragas.metrics import context_precision
context_precision.llm = judge_llm
context_precision._required_columns

{<MetricType.SINGLE_TURN: 'single_turn'>: {'reference',
  'retrieved_contexts',
  'user_input'}}

In [14]:
context_precision.score(process_row(row, correct=True, column="retrieved_contexts"))

  context_precision.score(process_row(row, correct=True, column="retrieved_contexts"))


0.9999999999

### Rubric Based Metrics

In [15]:
from ragas import evaluate
from datasets import Dataset, DatasetDict

from ragas.metrics import labelled_rubrics_score, reference_free_rubrics_score


responses = [
    "The Longest river is Ganga",
    "The Longest river is Nile",
    "The longest river in the world is the Nile, stretching approximately 6,650 kilometers (4,130 miles) through northeastern Africa, flowing through countries such as Uganda, Sudan, and Egypt before emptying into the Mediterranean Sea. There is some debate about this title, as recent studies suggest the Amazon River could be longer if its longest tributaries are included, potentially extending its length to about 7,000 kilometers (4,350 miles)."
]
rows = {
    "user_input": [
        "What's the longest river in the world?",
    ],
    "reference": [
        "The Nile is a major north-flowing river in northeastern Africa.",
    ],
    "response": [
        responses[2],
    ],
    "retrieved_contexts": [
        [
            "Scientists debate whether the Amazon or the Nile is the longest river in the world. Traditionally, the Nile is considered longer, but recent information suggests that the Amazon may be longer.",
            "The Nile River was central to the Ancient Egyptians' rise to wealth and power. Since rainfall is almost non-existent in Egypt, the Nile River and its yearly floodwaters offered the people a fertile oasis for rich agriculture.",
            "The world's longest rivers are defined as the longest natural streams whose water flows within a channel, or streambed, with defined banks.",
            "The Amazon River could be considered longer if its longest tributaries are included, potentially extending its length to about 7,000 kilometers."
        ],
    ]
}



dataset = Dataset.from_dict(rows)

result = evaluate(
    dataset,
    metrics=[
        labelled_rubrics_score,
        reference_free_rubrics_score
    ],
)

result.to_pandas()

ImportError: cannot import name 'labelled_rubrics_score' from 'ragas.metrics' (D:\Projects\rag-to-riches\.venv\Lib\site-packages\ragas\metrics\__init__.py)