# LLM Pipeline

![Architecture](../images/architecture.png)

This example walks through a basic structure you could use to accomplish the following flow:
1. Internal & external user feedback is provided in the app for all AI interactions with optional text feedback. We would encourage internal users not to use Slack for this type of feedback.
2. This feedback is sent to a Slack channel that we monitor with appropriate repro and debugging information.
3. We can easily replay the LLM call
Replaying the LLM request uses the same inputs, prompt, model provider, and model name, but these dimensions can be changed and experimented with.
4. We iterate on the LLM call until the LLM consistently responds with the expected result. We use LLM-as-a-judge to solidify this expectation and prevent regressions.
5. This LLM request replay and LLM-as-a-judge system can be added to our eval test suite, so that it is run alongside our other evals.
Evals run post-merge, and results are reported to Slack & internal analytics.

## Setup

Let's start by loading our environment variables from our .env file. 

In [1]:
from dotenv import load_dotenv
import os
load_dotenv(dotenv_path="../.env", override=True)
# Loads the following env variables
# LANGSMITH_TRACING=true
# LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
# LANGSMITH_PROJECT="eli5"
# LANGSMITH_API_KEY="<redacted>"
# OPENAI_API_KEY="<redacted>"
# TAVILY_API_KEY="<redacted>"

# SLACK_TOKEN="<redacted>"
# SLACK_CHANNEL="<redacted>"

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


First, we'll set up our LangSmith client to use the SDK

In [2]:
import os
from langsmith import Client

client = Client()

ls_project = os.getenv("LANGSMITH_PROJECT")

We'll also set up a Slack client to send messages programmatically to Slack

In [3]:
import os
from slack_sdk import WebClient

slack_token = os.getenv("SLACK_TOKEN")
channel_id = os.getenv("SLACK_CHANNEL")

slack_client = WebClient(token=slack_token)

## Gathering Feedback

We'll set up a dummy function that represents the function / endpoint your app will call when a user provides feedback

In [4]:
def record_user_feedback(trace_id: str, feedback: dict):
    runs = list(client.list_runs(run_ids=[trace_id]))
    run = runs[0] if runs else None
    if run:
        client.create_feedback(
            key="user_rating",           # a feedback key/tag name
            score=feedback["rating"],    # a numeric score (can be 1 if just a comment)
            trace_id=trace_id,           # the trace ID to attach feedback to
            comment=feedback["comment"]  # the user comment string
        )
        run_url = client.get_run_url(run=run, project_name=ls_project)
        slack_client.chat_postMessage(
            channel=channel_id,
            text=f"Feedback received for trace {trace_id}: {feedback['comment']} \n View trace here: {run_url}"
        )

Next, let's test the function and add some feedback. We'll create a helper printing function to print results.

In [5]:
def pretty_print(trace_id: str):
    trace_id = "e97f7a67-5e89-48b3-a54d-bac17411bd6e"

    runs = list(client.list_runs(run_ids=[trace_id]))
    run = runs[0]
    print("Run Input ------------------------------------")
    print(run.inputs["question"] + "\n")
    print("Run Output -----------------------------------")
    print(run.outputs["output"] + "\n")
    print("Run Feedback ---------------------------------")
    print("Score: " + str(run.feedback_stats["user_rating"]["avg"]))
    print("Comments: "+ str(run.feedback_stats["user_rating"]["comments"]))

Let's run the function we created.

In [6]:
trace_id = "e97f7a67-5e89-48b3-a54d-bac17411bd6e"
feedback = {"rating": 4, "comment": "Does not mention the value of LangChain being open source"}
record_user_feedback(trace_id, feedback)

And print the results!

In [7]:
trace_id = "e97f7a67-5e89-48b3-a54d-bac17411bd6e"
pretty_print(trace_id)

Run Input ------------------------------------
Why is LangChain important?

Run Output -----------------------------------
Okay! Imagine you have a really smart robot that can talk and help you with questions, just like a super nice friend. But sometimes, it can be a bit confusing to tell the robot exactly what you want. 

LangChain is like a special toy tool that helps people make this robot even better at understanding and answering your questions. It helps your robot learn how to listen better and respond in a way that makes sense. 

So, if someone wants to create a new game or a fun chat with this robot, LangChain makes it easier for them. It’s like giving them the perfect recipe to make tasty cookies instead of just telling them to bake. This way, they can make their cool robot friend work faster and smarter! 

That’s why LangChain is important. It helps everyone build better talking robots that can help us in lots of fun and useful ways!

Run Feedback ----------------------------

## Playground

Next, with our feedback recorded, we can find the trace in LangSmith and bring it into the Prompt Playground to reproduce and iterate on it.
A link to a sample trace is available here: https://smith.langchain.com/public/2c87b0b2-23f9-4b0f-819e-bdd8a3aec544/r


## Evaluating Application Performance with Experiments

### Creating a Golden Dataset

It's a good idea to create a golden dataset of prompts we want our application to perform well on. If our application is receiving poor feedback for a given input, it'd be helpful to add that input to our golden dataset. This will allow us to test future versions of our application to make sure it gets the right answer to these problematic inputs!

To do this, we do need to record what a good (or "golden") response from our application would've looked like. What response would've gotten feedback? We need that golden response so that we have something to measure our application's response against.

In [8]:
def add_to_golden_dataset(trace_id: str, golden_response: str, dataset_name: str):    
    # Fetch the run (trace) by ID
    runs = list(client.list_runs(run_ids=[trace_id]))
    run = runs[0] if runs else None
    
    if run:
        if not client.has_dataset(dataset_name=dataset_name):
            dataset = client.create_dataset(dataset_name=dataset_name)
        else:
            dataset = client.read_dataset(dataset_name=dataset_name)

        example = {
            "inputs": run.inputs,
            "outputs": {"golden": golden_response},
            "metadata": {"source": "trace"}
        }
        # Add example to dataset
        client.create_examples(dataset_id=dataset.id, examples=[example])

We can either manually generate this golden response using Annotation Queues in LangSmith, or using techniques to generate synthetic data. One potential synthetic approach is to give an LLM the original response and feedback, and have it incorporate the feedback the user provided. Note that synthetic data should still be reviewed by humans before being added to your golden dataset.

Let's add a manually created "golden response" for our trace.

In [10]:
trace_id = "e97f7a67-5e89-48b3-a54d-bac17411bd6e"
dataset_name = "eli5-golden-tough"
golden_response = """Okay! Imagine you have a really smart robot that can talk and help you with questions, just like a super nice friend. But sometimes, it can be a bit confusing to tell the robot exactly what you want. 

LangChain is like a special toy tool that helps people make this robot even better at understanding and answering your questions. It helps your robot learn how to listen better and respond in a way that makes sense. 

So, if someone wants to create a new game or a fun chat with this robot, LangChain makes it easier for them. It’s like giving them the perfect recipe to make tasty cookies instead of just telling them to bake. This way, they can make their cool robot friend work faster and smarter! 

The best part is that anyone who finds a new recipe can add it to LangChain to make the tool even better! That’s why LangChain is important. It helps everyone build better talking robots that can help us in lots of fun and useful ways!"""

add_to_golden_dataset(trace_id, golden_response, dataset_name)

### Defining an Evaluator

We'll define an LLM that will compare our application's answer to the golden response in our dataset. We'll tell it to judge based on accuracy. A sample prompt is below.

In [11]:
llm_judge_prompt = """
    You are an expert data labeler evaluating model outputs for accuracy. Your task is to assign a 0 or 1 based on the following rubric:

    <Rubric>
        An accurate answer:
        - Provides accurate information
        - Contains no factual errors
        - Is complete and not missing any significant information
    </Rubric>

    <input>
        {}
    </input>

    <output>
        {}
    </output>

    Use the reference outputs below to help you evaluate the correctness of the response:
    <reference_outputs>
        {}
    </reference_outputs>
"""

Once we have a prompt, we can define an evaluator function that actually sends our application's response and the dataset example to an LLM. We've created an accuracy function that does this and returns the result to be recorded in LangSmith.

In [12]:
from pydantic import BaseModel, Field

# Define a scoring schema that our LLM must adhere to
class AccuracyScore(BaseModel):
    """Correctness score of the answer when compared to the reference answer."""
    score: int = Field(description="The score of the correctness of the answer, from 0 to 1")

In [13]:
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage


def accuracy(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    prompt = llm_judge_prompt.format(inputs["question"], outputs["output"], reference_outputs["golden"])
    structured_llm = ChatOpenAI(model_name="gpt-4o", temperature=0).with_structured_output(AccuracyScore)
    generation = structured_llm.invoke([HumanMessage(content=prompt)])
    return generation.score == 1


### Running an Experiment

We'll create a function that will run our application on the input format provided by our LangSmith dataest.

In [14]:
from eli5 import eli5

def run(inputs: dict):
    return eli5(inputs["question"])

  web_search_tool = TavilySearchResults(max_results=1)


Then we're ready to run an experiment! We've included a wrapper function to send the results to slack when they're ready.

In [None]:
import time
from langsmith import evaluate


def eval_and_report(run_func, dataset_name, evaluators, experiment_prefix):
    results = evaluate(
        run_func,
        data=dataset_name,
        evaluators=evaluators,
        experiment_prefix=experiment_prefix
    )
    results.wait()
    time.sleep(1) # experiment results process asynchronously and return when outputs exist so feedback may still be getting uploaded
    df = client.get_test_results(project_name=results.experiment_name)
    avg_score = df['feedback.accuracy'].mean()
    slack_client.chat_postMessage(
        channel=channel_id,
        text=f"Experiment {results.experiment_name} results: {avg_score} accuracy."
    )
    
eval_and_report(run, "eli5-golden-tough", [accuracy], "eli5-gpt-3.5")

  from .autonotebook import tqdm as notebook_tqdm


View the evaluation results for experiment: 'eli5-gpt-3.5-0beadf0f' at:
https://smith.langchain.com/o/bcad64b4-50f8-4e66-a0be-8dbaf6f6619c/datasets/32cce1c0-b6fe-4be8-9f72-7784fcbf14b5/compare?selectedSessions=994357a9-fc38-4ec1-a1b6-002d47d3edca




0it [00:00, ?it/s]