# W&B Weave: A Hands-On Tutorial

Welcome to this tutorial on W&B Weave! 🚀

**W&B Weave** is a powerful toolkit for developers building applications with Large Language Models (LLMs). It helps you track, visualize, debug, and evaluate your LLM-powered applications, making the development process more rigorous and efficient. Whether you're building a simple chatbot or a complex Retrieval-Augmented Generation (RAG) system, Weave provides the tools you need to understand and improve your application's performance.

In this notebook, we'll cover the most important concepts of Weave and walk you through a hands-on example of building and evaluating a simple RAG system. By the end of this tutorial, you'll be able to:

- **Trace your Python functions** with a single decorator to understand their execution flow.
- **Automatically capture LLM calls** to services like OpenAI.
- **Build and debug a RAG system** with full visibility into each step.
- **Evaluate your application's performance** using custom scorers.

Let's get started! 🎉

## 1. Setup and Installation

First, let's install the necessary libraries. We'll need `weave` for tracing and evaluation, `wandb` for logging, `openai` to interact with the GPT models, `python-dotenv` to manage our API keys, and `scikit-learn` for our RAG example.

In [None]:
#!pip install weave wandb openai sklearn -q

### 1.1. Environment Variables

It's a best practice to store your API keys and other sensitive information in a `.env` file. This keeps your secrets out of your code and makes it easy to manage different environments.

Create a file named `.env` in the same directory as this notebook and add your API keys like this:

In [None]:
WANDB_API_KEY="your_wandb_api_key"
OPENAI_API_KEY="your_openai_api_key"


You can get your W&B API key from the [W&B Authorize page](https://wandb.ai/authorize) and your OpenAI API key from the [OpenAI API keys page](https://platform.openai.com/account/api-keys).

In [1]:
import os

# Now you can access your API keys using os.getenv()
os.environ["WANDB_API_KEY"] = os.getenv("WANDB_API_KEY")
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

if not os.environ["WANDB_API_KEY"] or not os.environ["OPENAI_API_KEY"]:
    print("API keys not found. Make sure to create a .env file with your WANDB_API_KEY and OPENAI_API_KEY.")

### 1.2. Initialize Weave

Now, let's initialize Weave. This will set up a new project in your W&B account where all your traces and evaluations will be logged. If the project doesn't exist, it will be created automatically.

In [2]:
import weave
weave_client = weave.init("ai-builders-tutorial")  

[36m[1mweave[0m: Logged in as Weights & Biases user: devonsun_ml.
[36m[1mweave[0m: View Weave data at https://wandb.ai/datumverse/ai-builders-tutorial/weave


## 2. Core Concepts: Tracing with `@weave.op()`

The core of Weave's debugging capabilities is **tracing**. A trace is a record of the execution of a function, including its inputs, outputs, and any sub-operations it calls. The easiest way to create a trace is with the `@weave.op()` decorator.

Let's create a simple function and decorate it with `@weave.op()`:

In [3]:
@weave.op()
def add(a, b):
    """A simple function to add two numbers."""
    return a + b

# Call the decorated function
result = add(5, 10)
print(f"The result is: {result}")

The result is: 15


[36m[1mweave[0m: 🍩 https://wandb.ai/datumverse/ai-builders-tutorial/r/call/0198ea3b-e1cc-7667-acfe-4a988a0d00bb
[36m[1mweave[0m: 🍩 https://wandb.ai/datumverse/ai-builders-tutorial/r/call/0198ea3d-9099-7c43-a182-b3cd18aeda90
[36m[1mweave[0m: 🍩 https://wandb.ai/datumverse/ai-builders-tutorial/r/call/0198ea3f-0efb-7d54-9e7d-58891b710682


When you run the cell above, Weave will log a trace of the `add` function's execution. You can view this trace in your W&B project. The trace will show the inputs (`a=5`, `b=10`) and the output (`15`). This is a simple example, but it demonstrates how powerful tracing can be for understanding what your code is doing.

## 3. Automatic LLM Tracing

Weave automatically traces calls to many popular LLM libraries, including OpenAI. This means you don't need to add any special decorators to your LLM calls to get full visibility into the prompts, responses, token usage, and more.

Let's make a simple call to the OpenAI API:

In [4]:
from openai import OpenAI

client = OpenAI()

@weave.op()
def generate_text(prompt):
    """Generates text using the OpenAI API."""
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

prompt = "What is the capital of France?"
response = generate_text(prompt)
print(f"Prompt: {prompt}")
print(f"Response: {response}")

[36m[1mweave[0m: 🍩 https://wandb.ai/datumverse/ai-builders-tutorial/r/call/0198ea3c-bf9d-7926-a828-aa26c8e6b131


Prompt: What is the capital of France?
Response: The capital of France is Paris.


In your W&B project, you'll see a trace for the `generate_text` function. If you expand the trace, you'll see the underlying OpenAI API call with all the details, including the model used, the prompt, the completion, and even the token counts. This is incredibly useful for debugging your interactions with LLMs.

## 4. Hands-On Example: Building and Tracing a RAG System

Now, let's put it all together and build a simple Retrieval-Augmented Generation (RAG) system. A RAG system first retrieves relevant documents from a knowledge base and then uses those documents as context for an LLM to generate an answer.

We'll trace each step of our RAG system with `@weave.op()` to get a clear picture of how it works.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# 1. Our knowledge base (a simple list of documents)
documents = [
    "The Eiffel Tower is located in Paris, France.",
    "The Great Wall of China is one of the seven wonders of the world.",
    "The Statue of Liberty is in New York City, USA.",
    "The Colosseum is an ancient amphitheater in Rome, Italy."
]

# 2. Create a simple retriever
vectorizer = TfidfVectorizer().fit(documents)
doc_vectors = vectorizer.transform(documents)

@weave.op()
def retrieve_documents(query, k=1):
    """Retrieves the top k most relevant documents for a given query."""
    query_vector = vectorizer.transform([query])
    similarities = cosine_similarity(query_vector, doc_vectors).flatten()
    top_k_indices = similarities.argsort()[-k:][::-1]
    return [documents[i] for i in top_k_indices]

# 3. Create the RAG chain
@weave.op()
def rag_chain(query):
    """Our RAG system: retrieves documents and then generates an answer."""
    retrieved_docs = retrieve_documents(query)
    
    context = "\n".join(retrieved_docs)
    prompt = f"Based on the following context, answer the user's query.\n\nContext:\n{context}\n\nQuery: {query}"
    
    return generate_text(prompt)

# 4. Run the RAG system
query = "Where is the Eiffel Tower?"
answer = rag_chain(query)
print(f"Query: {query}")
print(f"Answer: {answer}")

Query: Where is the Eiffel Tower?
Answer: The Eiffel Tower is located in Paris, France.


When you view the trace for this `rag_chain` call in W&B, you'll see a nested structure. You can see the top-level `rag_chain` call, and inside it, you can see the `retrieve_documents` call and the `generate_text` call. This allows you to inspect the inputs and outputs of each component of your system, which is invaluable for debugging.

## 5. Evaluation with Weave

Now that we have a working RAG system, how do we know if it's any good? This is where **evaluation** comes in. Weave provides a framework for evaluating your LLM applications.

An **evaluation** consists of:

- A **model** (in our case, the `rag_chain` function).
- A **dataset** of examples to test the model on.
- A set of **scorers** that measure the quality of the model's outputs.

Let's create a simple evaluation for our RAG system.

In [14]:
import weave

dataset = [
    {"query": "Where is the Eiffel Tower?", "expected_answer": "Paris"},
    {"query": "What is in Rome?", "expected_answer": "Colosseum"},
]

class ContainsScorer(weave.Scorer):
    @weave.op()
    def score(self, output: str, expected_answer: str) -> dict:
        return {"contains_expected": expected_answer.lower() in (output or "").lower()}

evaluation = weave.Evaluation(
    dataset=dataset,
    scorers=[ContainsScorer()],
    evaluation_name="rag_contains_eval",
)

results = await evaluation.evaluate(rag_chain)
evaluation.get_scores()

[36m[1mweave[0m: retry_attempt
[36m[1mweave[0m: Evaluated 1 of 2 examples
[36m[1mweave[0m: Evaluated 1 of 2 examples
[36m[1mweave[0m: Evaluated 2 of 2 examples
[36m[1mweave[0m: Evaluation summary {
[36m[1mweave[0m:   "ContainsScorer": {
[36m[1mweave[0m:     "contains_expected": {
[36m[1mweave[0m:       "true_count": 2,
[36m[1mweave[0m:       "true_fraction": 1.0
[36m[1mweave[0m:     }
[36m[1mweave[0m:   },
[36m[1mweave[0m:   "model_latency": {
[36m[1mweave[0m:     "mean": 0.7733820676803589
[36m[1mweave[0m:   }
[36m[1mweave[0m: }
[36m[1mweave[0m: Evaluated 2 of 2 examples
[36m[1mweave[0m: Evaluation summary {
[36m[1mweave[0m:   "ContainsScorer": {
[36m[1mweave[0m:     "contains_expected": {
[36m[1mweave[0m:       "true_count": 2,
[36m[1mweave[0m:       "true_fraction": 1.0
[36m[1mweave[0m:     }
[36m[1mweave[0m:   },
[36m[1mweave[0m:   "model_latency": {
[36m[1mweave[0m:     "mean": 0.7733820676803589
[36m[1mw

{'0198ea45-2914-7e5f-a9ed-7debd467703d': {}}

When you run this evaluation, Weave will call your `rag_chain` function for each example in the dataset and then run your `ContainsScorer` on the output. The results will be displayed in a beautiful dashboard in your W&B project, where you can see the scores for each example and get an aggregate view of your model's performance.

![WandB Weave](/public/products/wandb_weave.png)

## 6. Conclusion

Congratulations! 🎉 You've successfully built, traced, and evaluated an LLM-powered application with W&B Weave.

In this tutorial, we've covered the fundamental concepts of Weave:

- **Tracing** with `@weave.op()` to get visibility into your code's execution.
- **Automatic logging** of LLM calls.
- **Building and debugging** a RAG system.
- **Evaluating** your application's performance with custom scorers.

This is just the beginning of what you can do with Weave. To learn more, check out the official W&B Weave documentation. 

Happy building! 🚀