# Creating an Automated Feedback Pipeline with LangSmith

====================================

### 1. Theory: Automated Monitoring with Algorithmic Feedback

While manual review of your LLM application's traces is valuable for deep debugging, it is not a scalable solution for monitoring a production system. To understand performance at scale, we need automated metrics. **Algorithmic feedback** is the process of programmatically generating quality scores for your application's runs *after* they have been completed. 

This is different from real-time feedback (which runs as a callback) and is typically run as a scheduled job (e.g., every hour or once a day) on a batch of recent production traces. By enriching your traces with these automated scores, you can create powerful monitoring dashboards in LangSmith to track metrics like relevance, verbosity, or correctness over time.

![model-based feedback monitoring charts](./img/feedback_charts.png)

If these metrics indicate a problem, you can easily filter for the low-scoring runs in LangSmith to identify problematic inputs, debug issues, or curate a dataset for fine-tuning.

This tutorial will walk you through the process:

1.  **Filter Runs**: Select a batch of completed runs from a LangSmith project that you want to score.
2.  **Define Feedback Logic**: Create functions or chains that calculate your desired feedback metrics. We will show examples of both simple statistical metrics and more powerful AI-assisted metrics.
3.  **Send Feedback to LangSmith**: Use the LangSmith client to attach the generated scores to the original runs.

We'll be using the LangSmith and LangChain Hub, so make sure you have the necessary API keys.

### 2. Prerequisites and Setup

First, we configure our environment variables. This is a secure way to provide API keys to our application.

**Action Required**: You must replace the placeholder values with your actual keys and provide a project name to work with.

In [1]:
import os # Import the 'os' module to interact with the operating system.

# Update with your API URL if using a hosted instance of Langsmith.
# os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
# # Update with your API key
# os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"
# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_HUB_API_URL"] = "https://api.hub.langchain.com"
# Update with your Hub API key
os.environ["LANGCHAIN_HUB_API_KEY"] = os.getenv('LANGSMITH_API_KEY')
# Change to the project name you want to add feedback to.
project_name = "12_algo_feedback"

In [2]:
from dotenv import load_dotenv # Import function to load environment variables
import os # Import the 'os' module to interact with the operating system.

# Load environment variables from the .env file. The `override=True` argument
# ensures that variables from the .env file will overwrite existing environment variables.
load_dotenv(dotenv_path=".env", override=True)



# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint as an environment variable.
# Update with your API key
os.environ["LANGCHAIN_API_KEY"] = os.getenv('LANGSMITH_API_KEY')# Set your LangSmith API key as an environment variable.
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY') # Set your OpenAI API key as an environment variable.

To ensure we have a common starting point, the following cell will create some example runs in the specified project. In a real-world scenario, you would be targeting runs generated by your live application.

In [3]:
from langsmith import Client # Import the LangSmith client.
from datetime import datetime # Import the datetime module.

client = Client() # Instantiate the client.
# Define a list of example input/output pairs.
example_data = [
    ("Who trained Llama-v2?", "I'm sorry, but I don't have that information."),
    (
        "When did langchain first announce the hub?",
        "LangChain first announced the LangChain Hub on September 5, 2023.",
    ),
    (
        "What's LangSmith?",
        "LangSmith is a platform developed by LangChain for building production-grade LLM (Language Model) applications. It allows you to debug, test, evaluate, and monitor chains and intelligent agents built on any LLM framework. LangSmith seamlessly integrates with LangChain's open-source framework called LangChain, which is widely used for building applications with LLMs.\n\nLangSmith provides full visibility into model inputs and outputs at every step in the chain of events, making it easier to debug and analyze the behavior of LLM applications. It has been tested with early design partners and on internal workflows, and it has been found to help teams in various ways.\n\nYou can find more information about LangSmith on the official LangSmith documentation [here](https://docs.smith.langchain.com/). Additionally, you can read about the announcement of LangSmith as a unified platform for debugging and testing LLM applications [here](https://blog.langchain.dev/announcing-langsmith/).",
    ),
    (
        "What is the langsmith cookbook?",
        "I'm sorry, but I couldn't find any information about the \"Langsmith Cookbook\". It's possible that it may not be a well-known cookbook or it may not exist. Could you provide more context or clarify the name?",
    ),
    (
        "What is LangChain?",
        "I'm sorry, but I couldn't find any information about \"LangChain\". Could you please provide more context or clarify your question?",
    ),
    ("When was Llama-v2 released?", "Llama-v2 was released on July 18, 2023."),
]

# Loop through the example data to create runs in your project.
for input_, output_ in example_data:
    client.create_run(
        name="ExampleRun", # The name of the run.
        run_type="chain", # The type of the run.
        inputs={"input": input_}, # The inputs to the run.
        outputs={"output": output_}, # The outputs of the run.
        project_name=project_name, # The project to associate the run with.
        end_time=datetime.utcnow(), # The end time of the run.
    )

## Step 1: Select Runs to Evaluate

The first step in our feedback pipeline is to select the runs we want to score. The LangSmith client's `list_runs` method provides a powerful way to filter runs. You can filter by project, time, presence of errors, metadata tags, and more. 

In this example, we'll filter for all successful runs in our project that occurred since midnight UTC.

In [4]:
# Get the current time and set it to midnight UTC.
midnight = datetime.utcnow().replace(hour=0, minute=0, second=0, microsecond=0)

# Fetch the list of runs from the specified project.
runs = list(
    client.list_runs(
        project_name=project_name, # Filter by the project name.
        execution_order=1, # Fetch in chronological order.
        start_time=midnight, # Filter for runs that started after midnight.
        error=False # Filter for runs that completed successfully.
    )
)

With our target runs selected, it's time to define the feedback logic.

## Step 2: Define Feedback Logic

Now we'll define the algorithms that generate our feedback scores. Any function or chain can be used. We will demonstrate three different approaches.

#### Example A: Simple Text Statistics

First, we'll show how to apply a simple, non-LLM algorithm. We will use the `textstat` library to compute various readability scores (like Flesch reading ease) for the *input* to each run. This can be useful for understanding the complexity of user queries your application is receiving.

In [None]:
# # Install the textstat library.
# %pip install textstat --quiet

Note: you may need to restart the kernel to use updated packages.\n

In [7]:
import textstat # Import the textstat library.
from langsmith.schemas import Run, Example # Import the Run and Example schemas.
from langchain_core.runnables import RunnableLambda # Import RunnableLambda for batch processing.


# Define a function to compute and log text statistics for a single run.
def compute_stats(run: Run) -> None:
    # Check if the run has the 'input' key we want to measure.
    if "input" not in run.inputs:
        return
    # Check if this run has already been scored to avoid redundant work.
    if run.feedback_stats and "smog_index" in run.feedback_stats:
        return
    text = run.inputs["input"] # Get the input text.
    try:
        # A list of readability metric functions to compute.
        fns = [
            "flesch_reading_ease",
            "flesch_kincaid_grade",
            "smog_index",
            "coleman_liau_index",
            "automated_readability_index",
        ]
        # Compute each metric and store it in a dictionary.
        metrics = {fn: getattr(textstat, fn)(text) for fn in fns}
        # Loop through the computed metrics.
        for key, value in metrics.items():
            # Use the client to create feedback for the original run.
            client.create_feedback(
                run.id, # The ID of the run to attach feedback to.
                key=key, # The name of the metric (e.g., 'smog_index').
                score=value,  # The numeric score, used for monitoring charts.
                feedback_source_type="model", # Specify the source as 'model' or 'auto'.
            )
    except Exception:
        # Pass silently if textstat fails on a given input.
        pass

In [8]:
# Wrap our function in a RunnableLambda and use .batch() to apply it concurrently to all runs.
_ = RunnableLambda(compute_stats).batch(
    runs,
    {"max_concurrency": 10}, # Control the level of concurrency.
    return_exceptions=True, # Prevent the whole batch from failing if one run errors.
)

#### Example B: AI-Assisted Feedback

While simple statistics are useful, **AI-assisted feedback** is much more powerful. Here, we'll use an LLM as a judge to score our runs on more subjective or complex criteria. This allows you to create metrics that are highly specific to your application's goals.

In this example, we will create an evaluator chain that scores each user query along several axes: `relevance` (to LangChain), `difficulty`, `verbosity`, and `specificity`. We will use a pre-built prompt from the LangChain Hub and OpenAI's function-calling feature to ensure the LLM returns a structured JSON output with these scores.

In [9]:
from langchain import hub # Import the LangChain Hub client.

# Pull a pre-made prompt for this task from the Hub.
prompt = hub.pull(
    "wfh/automated-feedback-example", api_url="https://api.hub.langchain.com"
)

In [11]:
print(prompt)

input_variables=['prediction', 'question'] input_types={} partial_variables={} metadata={'lc_hub_owner': 'wfh', 'lc_hub_repo': 'automated-feedback-example', 'lc_hub_commit_hash': '36dd15a97ff473f5629e56cb327f5d20a4c3b01c4fda0bfce29d678843b46a08'} messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], input_types={}, partial_variables={}, template="You are grading user questions posed to LangChain's technical discussion board. LangChain is a software framework for building applications with large language models.\n You must rate the questions by relevance, difficulty, verbosity, and specificity."), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['prediction', 'question'], input_types={}, partial_variables={}, template='Grade the following question on a scale from 0 to 5:\n\n<question>\n{question}\n</question>\n\nThe bot responded:\n\n<response>\n{prediction}\n</response>'), additional_kwargs={}), SystemMessagePromptTempla

In [12]:
from langchain_core.output_parsers.openai_functions import JsonOutputFunctionsParser # Import the function output parser.
from langchain_core.tracers.context import collect_runs # Import a context manager to capture traces.
from langchain_openai import ChatOpenAI # Import the OpenAI chat model wrapper.

# Define the evaluator chain.
chain = (
    prompt
    # Bind a function-calling schema to the LLM to force structured output.
    | ChatOpenAI(model="gpt-3.5-turbo", temperature=1).bind(
        functions=[
            {
                "name": "submit_scores",
                "description": "Submit the graded scores for a user question and bot response.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "relevance": {"type": "integer", "minimum": 0, "maximum": 5, "description": "Score indicating the relevance of the question to LangChain/LangSmith."},
                        "difficulty": {"type": "integer", "minimum": 0, "maximum": 5, "description": "Score indicating the complexity or difficulty of the question."},
                        "verbosity": {"type": "integer", "minimum": 0, "maximum": 5, "description": "Score indicating how verbose the question is."},
                        "specificity": {"type": "integer", "minimum": 0, "maximum": 5, "description": "Score indicating how specific the question is."},
                    },
                    "required": ["relevance", "difficulty", "verbosity", "specificity"],
                },
            }
        ]
    )
    | JsonOutputFunctionsParser() # Parse the LLM's function call into a JSON object.
)


# Define a function to evaluate a single run.
def evaluate_run(run: Run) -> None:
    try:
        if "input" not in run.inputs or not run.outputs or "output" not in run.outputs:
            return
        if run.feedback_stats and "specificity" in run.feedback_stats:
            return
        # Use collect_runs to capture the trace of the evaluator chain itself.
        with collect_runs() as cb:
            result = chain.invoke(
                {
                    "question": run.inputs["input"][:3000],  # Truncate to avoid context length issues.
                    "prediction": run.outputs["output"][:3000],
                },
            )
            # Loop through the scores returned by the evaluator chain.
            for feedback_key, value in result.items():
                score = int(value) / 5 # Normalize the score to be between 0 and 1.
                # Create the feedback for the original run.
                client.create_feedback(
                    run.id,
                    key=feedback_key,
                    score=score,
                    # Link the feedback to the evaluator's run trace for auditability.
                    source_run_id=cb.traced_runs[0].id,
                    feedback_source_type="model",
                )
    except Exception as e:
        pass


wrapped_function = RunnableLambda(evaluate_run)

In [13]:
# Concurrently apply the AI-assisted feedback logic to all runs.
_ = wrapped_function.batch(runs, {"max_concurrency": 10}, return_exceptions=True)

After the feedback has been logged, you can read the project's aggregate feedback stats. It may take a few moments for the stats to update asynchronously.

In [14]:
# The project's feedback_stats are updated asynchronously.
client.read_project(project_name=project_name).feedback_stats

#### Example C: Using LangChain Evaluators

LangChain provides a number of pre-built, reference-free evaluators that you can use out-of-the-box. These can be easily integrated into a feedback pipeline. For more details on the available types, check out the [LangChain evaluation documentation](https://python.langchain.com/docs/guides/productionization/evaluation).

Below, we will demonstrate this by wrapping a `criteria` evaluator in a custom `RunEvaluator`. The criterion we'll use is "completeness".

In [15]:
from typing import Optional # Import typing hints.
from langchain import evaluation, callbacks # Import LangChain evaluation components.
from langsmith import evaluation as ls_evaluation # Import LangSmith evaluation components.


# Define our custom evaluator class, inheriting from the base RunEvaluator.
class CompletenessEvaluator(ls_evaluation.RunEvaluator):
    def __init__(self):
        # Define the criterion for the evaluator.
        criteria_description = (
            "Does the answer provide sufficient and complete information"
            "to fully address all aspects of the question (Y)?"
            " Or does it lack important details (N)?"
        )
        # Load the built-in 'criteria' evaluator with our custom criterion.
        self.evaluator = evaluation.load_evaluator(
            "criteria", criteria={"completeness": criteria_description}
        )

    # This is the core method that will be called for each run.
    def evaluate_run(
        self, run: Run, example: Optional[Example] = None
    ) -> ls_evaluation.EvaluationResult:
        # Safety check for required fields.
        if (
            not run.inputs
            or not run.inputs.get("input")
            or not run.outputs
            or not run.outputs.get("output")
        ):
            return ls_evaluation.EvaluationResult(key="completeness", score=None)
        question = run.inputs["input"]
        prediction = run.outputs["output"]
        # Use collect_runs to capture the trace of the evaluator itself.
        with callbacks.collect_runs() as cb:
            result = self.evaluator.evaluate_strings(
                input=question, prediction=prediction
            )
            run_id = cb.traced_runs[0].id
        # Return the result, linking the feedback to the evaluator's trace.
        return ls_evaluation.EvaluationResult(
            key="completeness", evaluator_info={"__run": {"run_id": run_id}}, **result
        )

By using `collect_runs` and passing the resulting run ID to the `evaluator_info` dictionary, we create a direct link in the LangSmith UI from the feedback score on the original run to the trace of the evaluator that produced that score. This is extremely useful for auditing and debugging your feedback logic.

In [16]:
evaluator = CompletenessEvaluator() # Instantiate our completeness evaluator.

# You could run this in a simple for loop:
# for run in runs:
#     client.evaluate_run(run, evaluator)

# Or, run it concurrently for better performance.
# The `client.evaluate_run` method handles both scoring and logging the feedback.
wrapped_function = RunnableLambda(lambda run: client.evaluate_run(run, evaluator))
_ = wrapped_function.batch(runs, {"max_concurrency": 10}, return_exceptions=True)

Check out your project in LangSmith again to see the new "completeness" feedback scores appear on your runs.

## Conclusion

Congratulations! You've successfully set up an algorithmic feedback pipeline to programmatically add quality scores to your traced runs. This is a powerful technique for enhancing your monitoring capabilities, helping you curate high-quality datasets for fine-tuning, and gaining deeper insights into your application's usage in production.