## Evaluating in an Azure OpenAI RAG System

Retriever â€“ e.g., Azure Cognitive Search or a vector DB.

Generator â€“ e.g., Azure OpenAI gpt-4o-mini.

For each query:

- Run the retriever â†’ get top-k docs (contexts)

- Run the generator â†’ get the LLM answer (answer)

- Collect together with the gold answer(s)

Then run RAGAS evaluation:

In [None]:
from ragas import evaluate, EvaluationDataset

eval_dataset = EvaluationDataset.from_list([
    {
        "question": "Who is the CEO of Tesla?",
        "answer": "Elon Musk is the CEO of Tesla.",
        "contexts": ["Tesla, Inc. is led by CEO Elon Musk..."],
        "ground_truths": ["Elon Musk"]
    }
])

results = evaluate(dataset=eval_dataset, llm=evaluator_llm)
print(results)


Output:

{
  "hit_rate": 1.0,
  "mrr": 0.9,
  "answer_relevancy": 0.98,
  "faithfulness": 0.95
}

| Metric               | What it measures                         | Example insight                      |
| :------------------- | :--------------------------------------- | :----------------------------------- |
| **Hit Rate**         | Retriever found at least one correct doc | If low â†’ retriever misses key info   |
| **MRR**              | How early the correct doc appears        | If low â†’ retriever ranks docs poorly |
| **Answer Relevancy** | Whether the answer answers the question  | If low â†’ LLM not following query     |
| **Faithfulness**     | Whether answer is supported by context   | If low â†’ hallucination issue         |


### ðŸ”¹ Key Idea

The benchmark you compare against = gold answers (and sometimes gold docs) from your test dataset.

Retriever metrics (Hit Rate, MRR) compare retrieved docs vs gold answers.

Generator metrics (Relevancy, Faithfulness) compare generated answers vs gold answers and/or contexts.

### âœ… In short:
When you evaluate your Azure OpenAI RAG system:

- Hit Rate & MRR â†’ How well your retriever finds and ranks relevant docs.

- Answer Relevancy â†’ Whether your generated answer actually answers the question.

- Faithfulness â†’ Whether your answer is supported by retrieved context (not hallucinated).

Together, these metrics let you pinpoint weaknesses â€” whether to improve retrieval quality, LLM grounding, or both.