# Real-time Automated Feedback

====================================

### 1. Theory: From Offline Evaluation to Real-time Monitoring

While evaluating your LLM applications on a static dataset is crucial for development, monitoring their performance in a live production environment presents a different challenge. You want to understand how your application is performing on real-world, unpredictable inputs, often without having ground-truth labels.

This is where **real-time automated feedback** comes in. The core idea is to use another LLM as an impartial judge to score your application's outputs as they happen. This is achieved using **reference-free evaluators**—metrics like "helpfulness," "conciseness," or "lack of toxicity" that do not require a predefined "correct" answer. 

This tutorial demonstrates how to attach such an evaluator as a **callback** to your chain. This causes the evaluator to run automatically and asynchronously every time your chain is invoked, sending the resulting feedback scores to LangSmith. LangSmith can then aggregate this feedback, allowing you to create monitoring charts and track the quality of your deployment over time.

![model-based feedback monitoring charts](./img/feedback_charts.png)

If these automated metrics reveal a drop in quality, you can easily filter and isolate the problematic runs in LangSmith for debugging, analysis, or for creating new evaluation datasets.

The process involves two main steps:

1.  **Define Feedback Logic**: We'll create a custom `RunEvaluator` that encapsulates the logic for scoring a run (in this case, for "helpfulness").
2.  **Include in Callbacks**: We'll use the `EvaluatorCallbackHandler` to attach our evaluator to the chain, ensuring it runs automatically after each invocation.

We'll be using LangSmith, so make sure you have the necessary API keys configured.

### 2. Prerequisites and Setup

First, we'll install the necessary Python packages and configure our environment variables to connect to LangSmith and the OpenAI API.

In [1]:
# The '%pip install' command installs python packages from the notebook.
# -U flag ensures we get the latest versions.
# --quiet suppresses the installation output for a cleaner interface.
# %pip install -U langchain openai --quiet

In [2]:
import os # Import the 'os' module to interact with the operating system.

# Set the environment variable to enable LangSmith tracing.
os.environ["LANGCHAIN_TRACING_V2"] = "true"
# # Update with your API URL if using a hosted instance of Langsmith.
# os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
# # Update with your API key.
# os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"
# Change to your project name to organize runs in LangSmith.
os.environ["LANGCHAIN_PROJECT"] = "realtime_monitoring"

In [3]:
from dotenv import load_dotenv # Import function to load environment variables
import os # Import the 'os' module to interact with the operating system.

# Load environment variables from the .env file. The `override=True` argument
# ensures that variables from the .env file will overwrite existing environment variables.
load_dotenv(dotenv_path=".env", override=True)



# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint as an environment variable.
# Update with your API key
os.environ["LANGCHAIN_API_KEY"] = os.getenv('LANGSMITH_API_KEY')# Set your LangSmith API key as an environment variable.
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY') # Set your OpenAI API key as an environment variable.

Once you've decided on the runs you want to evaluate, it's time to define the feedback pipeline.

## Step 1: Define Feedback Logic

The first step is to define the logic that will generate our feedback scores. In LangSmith, all feedback must have a `key` (like "helpfulness") and a nullable numeric `score`. 

We will create a custom `RunEvaluator` class. This class will wrap one of LangChain's built-in `score_string` evaluators. This off-the-shelf evaluator uses an LLM to judge an input/prediction pair against a specified criterion. Our custom class will handle the logic of extracting the correct input and output from the run trace and passing them to the underlying scoring evaluator.

In [4]:
from typing import Optional # Import typing hints.
from langchain.evaluation import load_evaluator # Import a helper to load built-in evaluators.
from langsmith.evaluation import RunEvaluator, EvaluationResult # Import the base classes for custom evaluation.
from langsmith.schemas import Run, Example # Import the Run and Example schemas from LangSmith.


# Define our custom evaluator class, inheriting from the base RunEvaluator.
class HelpfulnessEvaluator(RunEvaluator):
    def __init__(self):
        # Initialize a built-in 'score_string' evaluator for the 'helpfulness' criterion.
        self.evaluator = load_evaluator(
            "score_string", criteria="helpfulness", normalize_by=10
        )

    # This is the core method that will be called for each run.
    def evaluate_run(
        self, run: Run, example: Optional[Example] = None
    ) -> EvaluationResult:
        # A safety check to ensure the run has the necessary inputs and outputs.
        if (
            not run.inputs
            or not run.inputs.get("input")
            or not run.outputs
            or not run.outputs.get("output")
        ):
            # Return a null score if the required fields are not present.
            return EvaluationResult(key="helpfulness", score=None)
        # Call the underlying evaluator with the run's input and output.
        result = self.evaluator.evaluate_strings(
            input=run.inputs["input"], prediction=run.outputs["output"]
        )
        # Return the final result, mapping the keys to the LangSmith feedback format.
        return EvaluationResult(
            **{"key": "helpfulness", "comment": result.get("reasoning"), **result}
        )

By defining this `RunEvaluator`, we have created a reusable component that can generate "helpfulness" feedback for any run, linking the feedback to the specific evaluation trace that produced it.

## Step 2: Include the Evaluator in Callbacks

Now we'll use the **`EvaluatorCallbackHandler`**. This is a special LangChain callback that takes our custom evaluator and runs it automatically in a separate thread whenever a run is completed. This is the key to achieving real-time feedback without blocking our main application.

First, let's define the simple chain we want to monitor.

In [5]:
from langchain_core.output_parsers import StrOutputParser # Import the string output parser.
from langchain_core.prompts import ChatPromptTemplate # Import the chat prompt template.
from langchain_openai import ChatOpenAI # Import the OpenAI chat model wrapper.

# Define a simple chain: it takes an input, passes it to a chat model, and parses the output as a string.
chain = (
    ChatPromptTemplate.from_messages([("user", "{input}")])
    | ChatOpenAI(model = "gpt-3.5-turbo" )# Specify the model to use.
    | StrOutputParser()
)

Next, we create an instance of our evaluator and pass it to the `EvaluatorCallbackHandler`.

In [6]:
from langchain_core.tracers import EvaluatorCallbackHandler # Import the callback handler.

# Create an instance of our custom evaluator.
evaluator = HelpfulnessEvaluator()

# Create the callback handler, passing a list containing our evaluator instance.
feedback_callback = EvaluatorCallbackHandler(evaluators=[evaluator])

This chain was only tested with GPT-4. Performance may be significantly worse with other models.


Finally, we invoke our chain for a series of queries. The crucial step is to pass our `feedback_callback` in the `config` dictionary of the `invoke` method. This attaches the handler to the run, ensuring our evaluator is triggered automatically upon completion.

In [7]:
# A list of example queries to run through our chain.
queries = [
    "Where is Antioch?",
    "What was the US's inflation rate in 2018?",
    "Who were the stars in the show Friends?",
    "How much wood could a woodchuck chuck if a woodchuck could chuck wood?",
    "Why is the sky blue?",
    "When is Rosh hashanah in 2023?",
]

# Loop through each query.
for query in queries:
    # Invoke the chain, passing the query and the callback handler in the config.
    chain.invoke({"input": query}, {"callbacks": [feedback_callback]})

Now, check your project in LangSmith. You will see the runs appear, and shortly after, the "helpfulness" feedback scores will be automatically attached to each run.

![feedback_result](./img/feedback_result.png)

## Conclusion

Congratulations! You have successfully configured a reference-free evaluator to run automatically every time your chain is called. This is a powerful technique for gathering real-time performance metrics on your live LLM applications.

By logging this automated feedback to LangSmith, you can create robust monitoring dashboards, track quality over time, and quickly identify and debug issues as they arise in production.