# Using Weights & Biases Weave with AWS Bedrock

In this notebook, you will learn to use our newly released tool for LLM practitioners.

You can use [Weave](https://wandb.github.io/weave/) to:

- Log and debug language model inputs, outputs, and traces
- Build rigorous, apples-to-apples evaluations for language model use cases
- Organize all the information generated across the LLM workflow, from experimentation and evaluations to production

## Setup

In [1]:
# !pip install -U weave boto3

In [2]:
import json
import boto3
from pprint import pprint
from botocore.exceptions import ClientError

## Create a Weights & Biases `Weave` project to store your traces

In [3]:
import weave
weave.init('aws-genai')

Logged in as Weights & Biases user: capecape.
View Weave data at https://wandb.ai/capecape/aws-genai/weave


<weave.weave_client.WeaveClient at 0x13fdd7c50>

Decorate your function call, that's it!

In [4]:
bedrock_client = boto3.client(service_name='bedrock-runtime')

@weave.op() # <- just add this üòé
def call_model(
    model_id: str, 
    messages: str, 
    max_tokens: int=400,
    ) -> dict:

    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "messages": messages,
        "max_tokens": max_tokens})
        
    response = bedrock_client.invoke_model(body=body,modelId=model_id)

    response_body = json.loads(response.get('body').read())
    return response_body

Let's first try using the amazing `Claude Sonnet 3.5`

In [5]:
model_id = 'anthropic.claude-3-5-sonnet-20240620-v1:0'

messages = [{"role": "user", 
             "content": [
                 {"type": "text", 
                  "text": (
                        "In Bash, how do I list all text files in the current directory "
                        "(excluding subdirectories) that have been modified in the last month?")
                  }
                 ]
            }
            ]

outputs = call_model(model_id, messages)

üç© https://wandb.ai/capecape/aws-genai/r/call/46542f3a-c1be-4fc4-b7d5-32866c305e3f


In [6]:
pprint(outputs)

{'content': [{'text': 'To list all text files in the current directory '
                      '(excluding subdirectories) that have been modified in '
                      'the last month using Bash, you can use the `find` '
                      "command combined with some options. Here's how you can "
                      'do it:\n'
                      '\n'
                      '```bash\n'
                      'find . -maxdepth 1 -type f -mtime -30 -name "*.txt"\n'
                      '```\n'
                      '\n'
                      "Let's break down this command:\n"
                      '\n'
                      '1. `find .`: Start searching in the current directory '
                      '(`.`)\n'
                      '\n'
                      '2. `-maxdepth 1`: Limit the search to the current '
                      'directory only, excluding subdirectories\n'
                      '\n'
                      '3. `-type f`: Look for files only (not directories

realising this, we can refactor the code to function to be more concise

In [7]:
@weave.op
def format_prompt(prompt: str) -> list[dict]:
    messages = [{"role": "user", 
                "content": [
                    {"type": "text", 
                    "text": prompt}]}]
    return messages


@weave.op
def claude(model_id: str, prompt: str, max_tokens: int=400) -> str:
    messages = format_prompt(prompt)
    response_body = call_model(model_id, messages, max_tokens)
    return response_body["content"][0]["text"]

In [8]:
prompt = ("Give me a super simple starting code in PyTorch for training of a diffusion model. " 
          "Use a minimal dataset like CIFAR10")
response = claude(model_id, prompt, max_tokens=2000)

üç© https://wandb.ai/capecape/aws-genai/r/call/90857910-5aa4-4202-9c83-ce5d6dc190ad


In [9]:
print(response)

Here's a simple starting code for training a diffusion model using PyTorch and the CIFAR10 dataset. This example uses a basic U-Net architecture for the diffusion model and implements a simplified diffusion process:

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import numpy as np

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Hyperparameters
batch_size = 64
num_epochs = 100
learning_rate = 1e-4
num_timesteps = 1000

# Load CIFAR10 dataset
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# Define U-Net model (simplified version)
class UNet(nn.Module):
    def __init__(self):
        super(UNe

## Using the `anthropic` Python SDK with Bedrock

there is a better way of interacting with Claude

In [10]:
# !pip install -U "anthropic[bedrock]" instructor

In [11]:
from anthropic import AnthropicBedrock

client = AnthropicBedrock(
    aws_region="us-east-1",
)

output_message = client.messages.create(
    model="anthropic.claude-3-5-sonnet-20240620-v1:0",
    max_tokens=256,
    messages=[{"role": "user", "content": "Hello, world"}]
)

üç© https://wandb.ai/capecape/aws-genai/r/call/fd31331d-3b82-40ad-badd-0f4f388c929c


In [12]:
print(output_message.content[0].text)

Hello! How can I assist you today? Feel free to ask me any questions or let me know if you need help with anything.


We still should probably refactor this code to a higher level function to call claude

In [13]:
@weave.op
def call_claude_bis(prompt: str, model_id: str, max_tokens: int=400) -> str:
    "Call Bedrock Claude using the anthropic Python SDK"
    messages = format_prompt(prompt)
    response_body = client.messages.create(
        model="anthropic.claude-3-5-sonnet-20240620-v1:0",
        max_tokens=256,
        messages=messages)
    return response_body.content[0].text

In [14]:
response = call_claude_bis("How do I say: This is really handy, in french?", model_id)
print(response)

üç© https://wandb.ai/capecape/aws-genai/r/call/43ba967d-905c-453f-ac1f-eb43f44e2726
In French, you can say:

"C'est vraiment pratique."

Pronunciation guide:
‚Ä¢ "C'est" is pronounced like "say"
‚Ä¢ "vraiment" is pronounced "vreh-mahn"
‚Ä¢ "pratique" is pronounced "pra-teek"

This phrase conveys the same meaning as "This is really handy" in English. It expresses that something is very useful or convenient.

Alternatively, you could also say:
"C'est tr√®s utile." (This is very useful)
Pronounced: "say tray oo-teel"

Both phrases are commonly used in French to express that something is handy or practical.


## Evaluation driven development

When working with LLMs, it is important to evaluate the quality of the model's responses.

We can use Weave [`Evaluation`](https://wandb.github.io/weave/tutorial-eval) to build rigorous, apples-to-apples evaluations for language model use cases.

Let's evaluate the models on the [Factual Inconsistency Benchmark](https://arxiv.org/abs/2211.08412v1) challenging dataset to improve check model performance to detect hallucination by identifying inconsistencies between a piece of text and a "summary"

In [15]:
import json
import random
import instructor
from pydantic import BaseModel
from pathlib import Path

DATA_PATH = Path("./data")
NUM_SAMPLES = 20

def read_jsonl(path):
    "returns a list of dictionaries"
    with open(path, 'r') as file:
        return [json.loads(line) for line in file]

fib_ds = random.sample(read_jsonl(DATA_PATH / "fib-val.jsonl"), NUM_SAMPLES)

In [16]:
fib_ds[3]

{'premise': 'The district council-owned building has been upgraded for the first time in more than three decades.\nIt includes a new 33m pool, a shallow training pool, sauna and steam room.\nThe sports hall has been revamped and a ¬£250,000 climbing wall has been built at the Bridgefoot centre.\nThe double Olympic gold medallist said: "It\'s an amazing venue for these guys so that they can learn a life skill as well as having fun.\n"So many pools nowadays through this whole country have been knocked down and it\'s incredible that they\'ve invested the money back to make a safe place."\nIt is the second time Adlington has visited the centre after she opened the temporary pool provided as part of the British Gas \'Pools for Schools Programme\' earlier this year.\nOther athletes who joined the celebrations included water polo player, Rosie Morris, who competed as the GB goalkeeper at the London Olympics in 2012.\nZoe Reeve, former member of the GB Synchro Squad and triple Commonwealth Gol

In [17]:
fib_prompt = """You are an expert to detect factual inconsistencies and hallucinations. 
You will be given a document and a summary.
- Carefully read the full document and the provided summary.
- Identify Factual Inconsistencies: any statements in the summary that are not supported by or contradict the information in the document.
Factually Inconsistent: If any statement in the summary is not supported by or contradicts the document, label it as 0
Factually Consistent: If all statements in the summary are supported by the document, label it as 1

Highlight or list the specific statements in the summary that are inconsistent.
Provide a brief explanation of why each highlighted statement is inconsistent with the document.

Return in JSON format with `consistency` and a `reason` for the given choice. Encode special chars properly.

Document: 
{premise}
Summary: 
{hypothesis}
"""

we will use [`instructor`](https://github.com/jxnl/instructor) to get concsisten structured output from Claude.

In [18]:
class ModelOutput(BaseModel):
    consistency: int
    reason: str

# we need to patch the client with instructor to get structured output
inst_client = instructor.from_anthropic(client)

class ClaudeJudge(weave.Model):
    model_id: str
    max_tokens: int=1000
    system_message: str = "You are a helpful assistant and expert on extracting information in JSON format. Encode special chars properly."
    prompt_template: str = fib_prompt

    @weave.op
    def apply_prompt_template(self, premise:str, hypothesis:str) -> str:
        return self.prompt_template.format(premise=premise, hypothesis=hypothesis)

    @weave.op
    def predict(self, premise:str, hypothesis:str, **kwargs) -> int:
        prompt = self.apply_prompt_template(premise, hypothesis)
        messages = format_prompt(prompt)
        structured_output = inst_client.messages.create(
            model=self.model_id,
            max_tokens=self.max_tokens,
            system=self.system_message,
            messages=messages,
            response_model=ModelOutput)
        return structured_output.dict()

Let's try the model on a single sample

In [19]:
haiku = ClaudeJudge(model_id="anthropic.claude-3-haiku-20240307-v1:0")
response = haiku.predict(**fib_ds[3])
pprint(response)

üç© https://wandb.ai/capecape/aws-genai/r/call/eb83c52b-b91a-40c0-9850-1197acdb9034
{'consistency': 1,
 'reason': 'The summary accurately captures the key details provided in the '
           'document. The document states that the district council-owned '
           'building has been upgraded for the first time in more than three '
           'decades, and includes details about the new facilities like the '
           '33m pool, shallow training pool, sauna, steam room, revamped '
           'sports hall, and new climbing wall. The summary correctly states '
           'that Olympic champion Adlington has unveiled this massive '
           'refurbishment of the district building.'}


### Running the Evaluation

To do an evaluation, we will need:
- A dataset
- A model to evalute
- A scorer function

In [20]:
def accuracy(model_output, target):
    class_model_output = model_output.get('consistency') if model_output else None
    return {"accuracy": class_model_output == target}

In [21]:
evaluation = weave.Evaluation(dataset=fib_ds, scorers=[accuracy])

In [22]:
await evaluation.evaluate(haiku)

{'model_output': {'consistency': {'mean': 0.7}},
 'accuracy': {'accuracy': {'true_count': 15, 'true_fraction': 0.75}},
 'model_latency': {'mean': 3.827869200706482}}

Let's try the bigger brother `sonnet-3.5`

In [23]:
sonnet = ClaudeJudge(model_id="anthropic.claude-3-5-sonnet-20240620-v1:0")
await evaluation.evaluate(sonnet)

{'model_output': {'consistency': {'mean': 0.45}},
 'accuracy': {'accuracy': {'true_count': 18, 'true_fraction': 0.9}},
 'model_latency': {'mean': 7.83025358915329}}