
# LLM RAG Evaluation with MLflow Example Notebook

This notebook demonstrates how to evaluate a **Retrieval‑Augmented Generation (RAG)** system built with [LangChain](https://python.langchain.com/) and [Chroma](https://www.trychroma.com/) using **MLflow's GenAI evaluation** capabilities. It follows the [MLflow GenAI documentation example](https://mlflow.org/docs/3.1.3/genai/eval-monitor/notebooks/rag-evaluation/) for version 3.1.3.

We will:

- Set up the environment and note the recommended package versions.
- Build a simple RAG system that answers questions about the MLflow documentation using LangChain and Chroma.
- Define custom evaluation metrics for *faithfulness* and *relevance* using `mlflow.metrics.genai`.
- Evaluate the RAG system on a small set of questions with `mlflow.evaluate()` and inspect the results.

> **Note:** To run this notebook successfully, you must have valid API credentials for OpenAI. Set your `OPENAI_API_KEY` (and optional Azure variables) as environment variables before running the cells that interact with the language model.



## Prerequisites

The following table lists the package versions recommended by the MLflow documentation. Using these versions ensures compatibility with this example. Newer versions may also work, but if you encounter issues, try reverting to these:

| Package               | Version |
|----------------------|---------|
| `langchain`           | 0.1.16  |
| `langchain-community` | 0.0.33  |
| `langchain-openai`    | 0.0.8   |
| `openai`             | 1.12.0  |
| `mlflow`             | 2.12.1  |
| `chromadb`           | 0.4.24  |

Install the packages (if they are not already installed) via `pip`:

```bash
pip install langchain==0.1.16 langchain-community==0.0.33 langchain-openai==0.0.8 openai==1.12.0 mlflow==2.12.1 chromadb==0.4.24
```

### Setting API credentials

MLflow and LangChain rely on language model providers such as OpenAI.
To authenticate with OpenAI, set the following environment variable in your shell **before** starting your notebook kernel:

```bash
export OPENAI_API_KEY="<YOUR-OPENAI-API-KEY>"
```

If you are using Azure OpenAI, you may need additional variables:

```bash
export OPENAI_API_TYPE="azure"
export OPENAI_API_VERSION="<YYYY-MM-DD>"
export OPENAI_API_KEY="<YOUR-AZURE-OPENAI-KEY>"
export OPENAI_API_DEPLOYMENT_NAME="<DEPLOYMENT-NAME>"
```

Once the environment variables are set, restart your notebook kernel or re‑run the cell below to load them into `os.environ`.


In [None]:

import os
# Load the OPENAI_API_KEY from environment variables. If the key is set in your environment
# this cell will print confirmation. Otherwise, set it manually as shown below.
# os.environ["OPENAI_API_KEY"] = "<YOUR-OPENAI-API-KEY>"

print("OPENAI_API_KEY is set" if os.environ.get("OPENAI_API_KEY") else "OPENAI_API_KEY is not set")



## Create a RAG system

A **Retrieval‑Augmented Generation (RAG)** system combines a language model with a document retrieval component. In this example, we:

1. **Load the MLflow documentation** using LangChain's `WebBaseLoader`.
2. **Split the documents** into manageable chunks using `CharacterTextSplitter`.
3. **Embed the text** with OpenAI embeddings via `OpenAIEmbeddings`.
4. **Store and search** embeddings using a `Chroma` vector store.
5. **Create a `RetrievalQA` chain** that uses the vector store as a retriever and OpenAI as the language model.

This setup allows the model to search the documentation for relevant passages before answering questions, improving factual accuracy.


In [None]:

import pandas as pd
from langchain.chains import RetrievalQA
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_openai import OpenAI, OpenAIEmbeddings
import mlflow

# 1. Load the MLflow documentation
loader = WebBaseLoader("https://mlflow.org/docs/latest/index.html")
# This call fetches the webpage and parses its content
documents = loader.load()

# 2. Split documents into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

# 3. Create embeddings for the text
embeddings = OpenAIEmbeddings()

# 4. Build a Chroma vector store from the embeddings (in memory by default)
docsearch = Chroma.from_documents(texts, embeddings)

# 5. Construct a RetrievalQA chain using an OpenAI language model
qa = RetrievalQA.from_chain_type(
    llm=OpenAI(temperature=0),
    chain_type="stuff",
    retriever=docsearch.as_retriever(),
    return_source_documents=True,
)



## Wrap the RAG chain in a model function

To make our RAG chain compatible with MLflow's evaluation interface, we wrap it in a simple function that accepts a pandas DataFrame of inputs (with a `questions` column) and returns a list of results. Each result contains the model's answer and the source documents retrieved by the chain.


In [None]:

def model(input_df: pd.DataFrame):
    '''Run the RAG pipeline on each row of the input DataFrame.

    Parameters
    ----------
    input_df : pd.DataFrame
        A DataFrame containing a 'questions' column.

    Returns
    -------
    list
        A list of results from the RetrievalQA chain, including answers and source docs.
    '''
    answers = []
    for _, row in input_df.iterrows():
        answers.append(qa(row["questions"]))
    return answers



## Create an evaluation dataset

We'll define a small evaluation dataset containing questions about MLflow. You can expand or modify this list to test your RAG system more thoroughly.


In [None]:

eval_df = pd.DataFrame(
    {
        "questions": [
            "What is MLflow?",
            "How to run mlflow.evaluate()?",
            "How to log_table()?",
            "How to load_table()?",
        ]
    }
)
# Display the evaluation DataFrame
eval_df



## Define a faithfulness metric

**Faithfulness** measures how consistent the model's output is with the provided context. We create a list of `EvaluationExample` objects containing an input question, the model's output, a score (1–5), and a justification. These examples help the metric calibrate how to rate future outputs.

The scoring rubric for faithfulness is summarized below:

- **1:** None of the output is supported by the context.
- **2:** A minority of claims are supported; most are unsupported or incorrect.
- **3:** Roughly half of the output is supported.
- **4:** Most claims are supported, with little extraneous information.
- **5:** All claims are directly supported by the context.

You can inspect the full grading prompt by printing the metric or accessing its `metric_details` attribute.


In [None]:

from mlflow.metrics.genai import EvaluationExample, faithfulness

# Create two reference examples to calibrate faithfulness
faithfulness_examples = [
    EvaluationExample(
        input="How do I disable MLflow autologging?",
        output=(
            "mlflow.autolog(disable=True) will disable autologging for all functions. "
            "In Databricks, autologging is enabled by default."
        ),
        score=2,
        justification=(
            "The output provides a working solution using the mlflow.autolog() function, "
            "but includes extra information not fully supported by the context."
        ),
        grading_context={
            "context": (
                "mlflow.autolog(log_input_examples: bool = False, "
                "log_model_signatures: bool = True, log_models: bool = True, "
                "log_datasets: bool = True, disable: bool = False, exclusive: bool = False, "
                "disable_for_unsupported_versions: bool = False, silent: bool = False, "
                "extra_tags: Optional[Dict[str, str]] = None) → None: "
                "Enables (or disables) and configures autologging for all supported integrations."
            )
        },
    ),
    EvaluationExample(
        input="How do I disable MLflow autologging?",
        output="mlflow.autolog(disable=True) will disable autologging for all functions.",
        score=5,
        justification=(
            "The output correctly identifies the function call needed to disable autologging "
            "without adding unsupported details."
        ),
        grading_context={
            "context": (
                "mlflow.autolog(log_input_examples: bool = False, "
                "log_model_signatures: bool = True, log_models: bool = True, "
                "log_datasets: bool = True, disable: bool = False, exclusive: bool = False, "
                "disable_for_unsupported_versions: bool = False, silent: bool = False, "
                "extra_tags: Optional[Dict[str, str]] = None) → None: "
                "Enables (or disables) and configures autologging for all supported integrations."
            )
        },
    ),
]

faithfulness_metric = faithfulness(model="openai:/gpt-4", examples=faithfulness_examples)
faithfulness_metric



## Define a relevance metric

**Relevance** assesses how well the model's answer addresses the user's question given the provided context. The metric uses a rubric similar to faithfulness, focusing on the alignment between the input question and the output answer.

For reference, a high relevance score requires that the answer fully and accurately addresses the question using the contextual information. A low score indicates irrelevance or off‑topic content.


In [None]:

from mlflow.metrics.genai import relevance

# Create a relevance metric using the same base model
relevance_metric = relevance(model="openai:/gpt-4")
relevance_metric



## Evaluate the RAG system with MLflow

We can now evaluate our RAG system using `mlflow.evaluate()`.

- `model`: our wrapper function that runs the retrieval and generation.
- `eval_df`: the dataset of questions.
- `model_type`: set to `"question-answering"` so MLflow uses the appropriate evaluator.
- `evaluators`: set to `"default"` to use MLflow's built‑in evaluator for question‑answering.
- `predictions`: the column name where predictions will be stored in the result.
- `extra_metrics`: a list of additional metrics to compute; we include our custom `faithfulness_metric` and `relevance_metric`, as well as the built‑in `mlflow.metrics.latency()`.
- `evaluator_config`: maps our DataFrame columns to the expected input and context fields used by the evaluator.

This step may take several minutes and will make calls to the OpenAI API.


In [None]:

results = mlflow.evaluate(
    model,
    eval_df,
    model_type="question-answering",
    evaluators="default",
    predictions="result",
    extra_metrics=[faithfulness_metric, relevance_metric, mlflow.metrics.latency()],
    evaluator_config={
        "col_mapping": {
            "inputs": "questions",
            "context": "source_documents",
        }
    },
)

# Display aggregated metrics
results.metrics


In [None]:

# Display the detailed per‑example results table
results.tables["eval_results_table"]



## Interpret the results

The output of `mlflow.evaluate()` includes two components:

1. **Aggregated metrics** (`results.metrics`): statistical summaries such as mean, variance, and 90th percentile (p90) for each metric (toxicity, readability scores, faithfulness, relevance, etc.). These help you understand overall performance.

2. **Detailed table** (`results.tables["eval_results_table"]`): per‑question results that include the model's answer, the retrieved source documents, latency, token counts, and the scores/justifications for each custom metric. Reviewing this table helps diagnose specific strengths and weaknesses of your RAG system.

You can log these results to an MLflow experiment for experiment tracking and comparison across different models or retrieval strategies.
