# 📓 The GenAI Revolution Cookbook

**Title:** How to Use Langfuse Tracing for Prompts, Evals, and Cost Control

**Description:** Instrument calls with Langfuse tracing to see traces/spans, version prompts safely, A/B on datasets, and tie evaluations to p95 latency, token cost, and acceptance-rate improvements.

**📖 Read the full article:** [How to Use Langfuse Tracing for Prompts, Evals, and Cost Control](https://blog.thegenairevolution.com/article/how-to-use-langfuse-tracing-for-prompts-evals-and-cost-control)

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



## Why Use Langfuse for This Problem
When you're building LLM-powered applications, you quickly realize you're flying blind without proper observability. You need to see what's happening with model calls, how your prompts are performing, and what users actually think of the responses. I've tried cobbling together solutions with OpenTelemetry and custom logging, and honestly, it's a mess. Langfuse provides a unified workflow that just makes sense for LLM apps:

<ul>
- **Unified model of traces, generations, and prompts** – Unlike trying to stitch together OpenTelemetry with custom logging or using something like Helicone, Langfuse actually understands LLM concepts natively. Generations, prompt versions, scores - they're all first-class citizens in one platform. No more glue code.
- **Prompt version pinning and A/B testing** – Sure, LangSmith and Phoenix offer tracing, but Langfuse's Prompt Library lets you version, pin, and A/B test prompts directly through the SDK. This is huge - you can experiment without managing some external config system.
- **Dataset-driven evaluation workflows** – Here's where it gets really interesting. Langfuse supports running experiments on datasets and logging scores per trace. You can set up CI-gated quality checks and prevent regressions with minimal custom code.
</ul>
This guide walks through instrumenting a question-answering (QA) endpoint end-to-end with Langfuse. We'll capture traces, spans, generations, scores, and prompt versions. By the end, you'll have:

<ul>
- A fully traced QA flow (input parsing, retrieval, generation, feedback)
- Prompt versioning via Langfuse's Prompt Library
- A/B experiments on two prompt versions
- A dataset evaluation loop with automated scoring
- A CI gating snippet to prevent regressions
</ul>
You'll be able to track p95 latency, tokens per request, and user acceptance. This gives you a baseline for quality, lets you A/B test changes confidently, and helps you roll out improvements without breaking things.

## Core Concepts for This Use Case
Langfuse organizes everything around four main primitives. Each one maps to a specific step in your QA workflow:

<ul>
- **Trace** – This is a single user request (like "What is the warranty period?"). It contains metadata (user ID, environment) and aggregates all the nested activity.
- **Span** – A logical step within a trace (input parsing, retrieval, that sort of thing). Use spans to measure latency and figure out where your bottlenecks are.
- **Generation** – A model call (your OpenAI completion, for instance). Langfuse logs input, output, model name, token usage, and latency automatically.
- **Score** – A quality or performance metric attached to a trace. Could be user acceptance, a groundedness heuristic, or LLM-as-judge helpfulness. Scores enable dataset evaluation and A/B comparison.
- **Prompt (via Prompt Library)** – A versioned, reusable template you fetch by label and version. Pin a version in production or fetch the latest in dev to control rollout and enable A/B tests.
</ul>
These primitives map directly to what we're building: you'll create a trace per request, add spans for parsing and retrieval, log a generation for the model call, attach scores for feedback and heuristics, and fetch prompts by version to run experiments.

## Setup
Run the following cell to install dependencies and configure environment variables. If you're in Colab without a `.env` file, just set the keys inline:

In [None]:
!pip install -q langfuse openai python-dotenv

import os
from dotenv import load_dotenv

# Load from .env if present (local), otherwise set inline (Colab)
load_dotenv()

if "LANGFUSE_PUBLIC_KEY" not in os.environ:
    os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."  # Replace with your key
if "LANGFUSE_SECRET_KEY" not in os.environ:
    os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."  # Replace with your key
if "LANGFUSE_HOST" not in os.environ:
    os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com"  # Or your self-hosted URL
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = "sk-..."  # Replace with your OpenAI key

# Optional: pin a specific environment and model
os.environ["ENV"] = os.getenv("ENV", "dev")
os.environ["OPENAI_MODEL"] = os.getenv("OPENAI_MODEL", "gpt-4o-mini")

**Note:** You'll need a Langfuse account (free at <a href="https://langfuse.com">langfuse.com</a>) and an OpenAI API key. If `gpt-4o-mini` isn't available for you, just swap in `gpt-3.5-turbo` or whatever model you have access to.

Next, we need to ensure the `qa-answer` prompt exists in your Langfuse Prompt Library. Run this cell once to create and publish it:

In [None]:
from langfuse import Langfuse

langfuse = Langfuse()

try:
    # Check if prompt exists
    langfuse.get_prompt("qa-answer")
    print("Prompt 'qa-answer' already exists.")
except Exception:
    # Create and publish a minimal prompt
    prompt_text = """Answer the question using only the context below. Be concise.

Context:
{{context}}

Question: {{question}}

Answer:"""
    
    langfuse.create_prompt(
        name="qa-answer",
        prompt=prompt_text,
        is_active=True,  # Publish immediately
    )
    print("Created and published prompt 'qa-answer'.")

If the prompt already exists, this cell will skip creation. Now you can fetch it by label in the tutorial code.

## Using the Tool in Practice
### Step 1: Initialize Clients and Configure Environment

In [None]:
from langfuse import Langfuse
from openai import OpenAI
from time import perf_counter

langfuse = Langfuse()
openai_client = OpenAI()

ENV = os.getenv("ENV", "dev")
DEFAULT_MODEL = os.getenv("OPENAI_MODEL", "gpt-4o-mini")

### Step 2: Define a Retrieval Function
Replace this stub with your actual retriever (vector DB, keyword search, whatever you're using). Keep it deterministic for reproducible tests though.

In [None]:
def retrieve_docs(query: str) -> list[str]:
    """Retrieve documents related to the query."""
    return [f"Doc about: {query}", "Another relevant doc."]

### Step 3: Start a Trace with Parsing and Retrieval Spans

In [None]:
def start_trace(user_id: str, question: str):
    """Create a trace and log input parsing + retrieval spans."""
    trace = langfuse.trace(
        name="qa_request",
        user_id=user_id,
        input={"question": question},
        metadata={"env": ENV, "component": "qa-service"},
    )
    
    # Span: parse input
    parse = trace.span(name="parse_input", input={"raw": question})
    normalized = question.strip()
    parse.end(output={"normalized": normalized})
    
    # Span: retrieve documents
    ret = trace.span(name="retrieval", input={"query": normalized})
    docs = retrieve_docs(normalized)
    ret.end(output={"docs_count": len(docs)})
    
    return trace, normalized, docs

### Step 4: Generate an Answer and Log the Generation

In [None]:
def generate_answer(trace, model: str, prompt_text: str):
    """Call the model and log generation details."""
    gen_span = trace.span(name="generation.prepare_prompt", input={"prompt_preview": prompt_text[:200]})
    
    t0 = perf_counter()
    response = openai_client.chat.completions.create(
        model=model,
        temperature=0.2,
        messages=[{"role": "user", "content": prompt_text}],
    )
    latency_ms = (perf_counter() - t0) * 1000
    gen_span.end(output={"latency_ms": latency_ms})
    
    content = response.choices[0].message.content
    usage = getattr(response, "usage", None)
    
    generation = trace.generation(
        name="answer_generation",
        model=response.model,
        input=prompt_text,
        output=content,
        metadata={"provider": "openai", "latency_ms": latency_ms},
        usage={
            "input_tokens": getattr(usage, "prompt_tokens", None),
            "output_tokens": getattr(usage, "completion_tokens", None),
            "total_tokens": getattr(usage, "total_tokens", None),
        },
    )
    generation.end()
    trace.update(output={"answer": content})
    return content

**Quick note on token costs:** If `usage` is missing (some providers don't return it), the token counts will be `None`. For cost estimation, you might want to use a fallback tokenizer like `tiktoken`, or just log a warning and track latency only.

### Step 5: Record Feedback and Heuristic Scores

In [None]:
def record_feedback(trace, accepted: bool, question: str, answer: str, docs: list[str]):
    """Attach user feedback and heuristic scores to the trace."""
    trace.score(name="acceptance", value=1.0 if accepted else 0.0, comment="user_feedback")
    
    # Heuristic: does answer cite a document term?
    hit = any(term.lower() in answer.lower() for term in set(" ".join(docs).split()) if len(term) > 5)
    trace.score(name="groundedness_heuristic", value=1.0 if hit else 0.0)
    
    # Heuristic: brevity penalty
    brevity = 1.0 if len(answer) < 800 else 0.0
    trace.score(name="brevity_ok", value=brevity)

### Step 6: Fetch and Compile Prompts from the Prompt Library

In [None]:
def get_compiled_prompt(label: str, variables: dict, pinned_version: str | None = None) -> str:
    """Fetch a prompt by label and version, then compile with variables."""
    if pinned_version:
        prompt = langfuse.get_prompt(label, version=pinned_version)
    else:
        prompt = langfuse.get_prompt(label)  # Latest published
    return prompt.compile(variables)

def build_prompt(question: str, docs: list[str]) -> str:
    """Build the QA prompt, optionally pinning a version."""
    version_pin = os.getenv("PROMPT_VERSION_PIN") or None
    return get_compiled_prompt(
        "qa-answer",
        {"question": question, "context": "\n\n".join(docs)},
        pinned_version=version_pin,
    )

### Step 7: Add LLM-as-Judge Scoring

In [None]:
def judge_helpfulness(trace, question: str, answer: str):
    """Use an LLM to score answer helpfulness."""
    judge_prompt = f"Rate helpfulness 0-1 for the answer given the question.\nQuestion: {question}\nAnswer: {answer}\nRespond with a number."
    
    t0 = perf_counter()
    judge_resp = openai_client.chat.completions.create(
        model=DEFAULT_MODEL,
        temperature=0,
        messages=[{"role": "user", "content": judge_prompt}],
    )
    latency_ms = (perf_counter() - t0) * 1000
    judge_text = judge_resp.choices[0].message.content.strip()
    
    try:
        score_val = float(judge_text.split()[0])
    except Exception:
        score_val = 0.0
    
    judge_gen = trace.generation(
        name="judge_helpfulness",
        model=judge_resp.model,
        input=judge_prompt,
        output=judge_text,
        metadata={"provider": "openai", "latency_ms": latency_ms},
        usage={
            "input_tokens": getattr(judge_resp.usage, "prompt_tokens", None),
            "output_tokens": getattr(judge_resp.usage, "completion_tokens", None),
        },
    )
    judge_gen.end()
    trace.score(name="helpfulness_llm", value=score_val)

### Step 8: Combine Steps into a Single Flow

In [None]:
def answer_and_score(user_id: str, question: str, accept: bool | None = None):
    """Generate an answer, record feedback, and judge helpfulness."""
    trace, normalized, docs = start_trace(user_id, question)
    prompt_text = build_prompt(normalized, docs)
    answer = generate_answer(trace, DEFAULT_MODEL, prompt_text)
    record_feedback(trace, accepted=bool(accept), question=normalized, answer=answer, docs=docs)
    judge_helpfulness(trace, question=normalized, answer=answer)
    return answer

## Run and Evaluate
### Run the QA Flow
Let's execute the flow a few times and see what we get:

In [None]:
print(answer_and_score("u1", "What is the warranty period for the Pro plan?", accept=True))
print(answer_and_score("u2", "How do I reset my password?", accept=True))

Check the Langfuse dashboard to see your traces, spans, generations, and scores. It's pretty satisfying to see everything laid out.

### A/B Test Two Prompt Versions

In [None]:
def run_ab(user_id: str, question: str, version_a: str, version_b: str):
    """Run an A/B test on two prompt versions."""
    results = []
    for variant, v in [("A", version_a), ("B", version_b)]:
        trace, normalized, docs = start_trace(user_id, question)
        trace.update(metadata={"experiment": "qa_prompt_ab:v1", "variant": variant})
        prompt_text = get_compiled_prompt("qa-answer", {"question": normalized, "context": "\n\n".join(docs)}, v)
        answer = generate_answer(trace, DEFAULT_MODEL, prompt_text)
        
        # Simple auto-score: does answer contain first question keyword?
        score = 1.0 if normalized.lower().split()[0] in answer.lower() else 0.0
        trace.score(name="keyword_hit", value=score)
        results.append((variant, answer, trace))
    return results

# Run A/B test (replace "3" and "4" with actual prompt version numbers from your Prompt Library)
ab = run_ab("u3", "Do you support SSO?", version_a="1", version_b="1")
for variant, ans, _ in ab:
    print(f"{variant}: {ans[:120]}")

**Note:** You'll want to create version 2 of the `qa-answer` prompt in the Langfuse UI. Just edit and publish a new version, then update `version_a` and `version_b` accordingly.

### Evaluate on a Dataset
Here's where things get really powerful. You can run your entire dataset through different prompt versions and compare:

In [None]:
def exact_match(answer: str, ground_truth: str) -> float:
    """Check if answer exactly matches ground truth (case-insensitive)."""
    return 1.0 if answer.strip().lower() == ground_truth.strip().lower() else 0.0

def contains_any(answer: str, ground_truth: str) -> float:
    """Check if answer contains any keyword from ground truth."""
    keywords = ground_truth.lower().split()
    return 1.0 if any(kw in answer.lower() for kw in keywords) else 0.0

def evaluate_dataset(dataset: list[dict], version_a: str, version_b: str):
    """Evaluate a dataset using two prompt versions."""
    for item in dataset:
        for variant, v in [("A", version_a), ("B", version_b)]:
            trace, normalized, docs = start_trace(user_id=item["id"], question=item["question"])
            trace.update(metadata={"experiment": "qa_dataset_eval:v1", "variant": variant, "dataset_id": "qa_sanity"})
            prompt_text = get_compiled_prompt("qa-answer", {"question": normalized, "context": "\n\n".join(docs)}, v)
            answer = generate_answer(trace, DEFAULT_MODEL, prompt_text)
            trace.score(name="accuracy_exact", value=exact_match(answer, item["ground_truth"]))
            trace.score(name="relevancy_contains", value=contains_any(answer, item["ground_truth"]))

# Example dataset
dataset = [
    {"id": "q1", "question": "What is the warranty period?", "ground_truth": "one year"},
    {"id": "q2", "question": "How do I reset my password?", "ground_truth": "click forgot password"},
]

evaluate_dataset(dataset, version_a="1", version_b="1")
print("Dataset evaluation complete. Check Langfuse for scores.")

### Fetch and Aggregate Scores Locally (Optional)

In [None]:
# Flush pending traces to Langfuse
langfuse.flush()

# Note: Fetching traces via API requires additional setup (API client, filtering by metadata).
# For now, inspect aggregated scores in the Langfuse dashboard under Experiments or Datasets.

With all this trace data, you can baseline quality and cost, A/B test prompt versions on the same inputs, and roll out changes with confidence. The iteration loop becomes so much tighter: diagnose, tweak, re-run, compare. You'll see smoother rollouts, lower p95 latency, and controlled token spend as you tighten the feedback loop. 

Actually, if you're interested in diving deeper into building Retrieval-Augmented Generation systems, you might find our guide on <a href="/article/44830763/5-essential-steps-to-building-agentic-rag-systems-with-langchain-and-chromadb">building agentic RAG systems with LangChain and ChromaDB</a> helpful.

## Production Considerations
### Overhead and Sampling Strategy
One thing I've learned is that Langfuse batches trace data and flushes asynchronously, which adds minimal latency (we're talking less than 10ms per trace in most cases). But for high-throughput production services, you need a strategy:

<ul>
- **Dev/staging:** Sample 100% of traces (`sample_rate=1.0`) to catch everything.
- **Production:** Sample 5–10% (`sample_rate=0.05`) to balance cost and visibility. Adjust based on your traffic volume and budget.
- **Flush on shutdown:** Always call `langfuse.flush()` before process exit to ensure all traces are sent. I've lost data by forgetting this more times than I'd like to admit.
</ul>
### Error Handling and Retries
Wrap your model calls in try/except blocks and log errors as spans:

In [None]:
try:
    response = openai_client.chat.completions.create(...)
except Exception as e:
    error_span = trace.span(name="generation_error", metadata={"error": str(e)})
    error_span.end()
    raise

And for transient failures (rate limits, timeouts), implement exponential backoff and log retry attempts as events:

In [None]:
import time

for attempt in range(3):
    try:
        response = openai_client.chat.completions.create(...)
        break
    except Exception as e:
        trace.event(name="retry", metadata={"attempt": attempt, "error": str(e)})
        time.sleep(2 ** attempt)

### CI Gating with Score Thresholds
This is where you can really prevent regressions. Gate your deployments on dataset evaluation scores. Here's an example CI snippet:

In [None]:
# Run evaluation and fetch mean acceptance score via Langfuse API
python evaluate.py --dataset qa_sanity --version $NEW_VERSION
SCORE=$(curl -X GET "https://cloud.langfuse.com/api/public/scores?dataset=qa_sanity&version=$NEW_VERSION" \
  -H "Authorization: Bearer $LANGFUSE_SECRET_KEY" | jq '.mean_acceptance')

if (( $(echo "$SCORE < 0.8" | bc -l) )); then
  echo "Score $SCORE below threshold 0.8. Failing build."
  exit 1
fi

Adjust the thresholds based on your quality requirements. I've found 0.8 is a good starting point, but your mileage may vary.

## Conclusion
So there you have it - a fully instrumented QA endpoint with Langfuse, capturing traces, spans, generations, and scores at every step. You can now:

<ul>
- Track p95 latency, token usage, and user acceptance in real time
- Version and A/B test prompts via the Prompt Library
- Evaluate changes on datasets with automated scoring
- Gate CI/CD pipelines on quality thresholds to prevent regressions
</ul>
This workflow scales from prototyping to production. It enables you to iterate faster, ship with confidence, and maintain quality as your LLM app grows. The visibility you get is honestly game-changing - no more guessing why users are unhappy or where your latency is coming from.

**Next steps:**

<ul>
- Explore advanced prompt engineering techniques and version multiple prompts for different use cases
- Integrate Langfuse with LangChain or LlamaIndex for deeper callback instrumentation (we cover this in separate guides)
- Set up alerts in Langfuse to notify your team when latency or acceptance drops below thresholds
</ul>