# Using Weights & Biases `Weave` with AWS `Bedrock`

In this notebook, you will learn to use our newly released tool for LLM practitioners.

You can use [Weave](https://wandb.github.io/weave/) to:

- Log and debug language model inputs, outputs, and traces
- Build rigorous, apples-to-apples evaluations for language model use cases
- Organize all the information generated across the LLM workflow, from experimentation and evaluations to production

## Setup

In [1]:
# aws sso login

In [2]:
# !pip install -U "weave=0.50.15" boto3

In [3]:
import json
import boto3
from pprint import pprint
from botocore.exceptions import ClientError

from utils import mprint

## Create a Weights & Biases `Weave` project to store your traces

In [4]:
import weave
weave.init('aws-genai')

Logged in as Weights & Biases user: capecape.
View Weave data at https://wandb.ai/capecape/aws-genai/weave


<weave.weave_client.WeaveClient at 0x12ffb7620>

Decorate your function call, that's it!

In [5]:
bedrock_client = boto3.client(service_name='bedrock-runtime')

@weave.op # <- just add this 😎
def call_model(
    model_id: str, 
    messages: str, 
    system_message: str,
    max_tokens: int=400,
    ) -> dict:

        
    response = bedrock_client.converse(
        modelId=model_id,
        system=[{"text":"system_message"}], # it needs a list for some reason
        messages=messages,
        inferenceConfig={
            "maxTokens": max_tokens
        }
    )

    return response

Let's first try using the amazing `Claude Sonnet 3.5`

In [6]:
model_id = 'anthropic.claude-3-5-sonnet-20240620-v1:0'

system_message = "You are an expert software engineer that knows a lot of programming. You prefer short answers."
messages = [{"role": "user", 
             "content": [
                 {"text": (
                         "In Bash, how do I list all text files in the current directory "
                         "(excluding subdirectories) that have been modified in the last month?")
                  }
                 ]
            }
            ]

outputs = call_model(model_id, messages, system_message)

🍩 https://wandb.ai/capecape/aws-genai/r/call/0192a04b-b303-7831-9543-aad716530296


In [7]:
pprint(outputs)

{'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
                                      'content-length': '1580',
                                      'content-type': 'application/json',
                                      'date': 'Fri, 18 Oct 2024 15:42:58 GMT',
                                      'x-amzn-requestid': 'e3107ee0-c564-41fe-bf8f-7ce97c70d086'},
                      'HTTPStatusCode': 200,
                      'RequestId': 'e3107ee0-c564-41fe-bf8f-7ce97c70d086',
                      'RetryAttempts': 0},
 'metrics': {'latencyMs': 9079},
 'output': {'message': {'content': [{'text': 'To list all text files in the '
                                             'current directory (excluding '
                                             'subdirectories) that have been '
                                             'modified in the last month using '
                                             'Bash, you can use the `find` '
                              

realising this, we can refactor the code to be more concise

In [8]:
@weave.op
def format_prompt(prompt: str) -> list[dict]:
    return [{"role": "user", "content": [{"text": prompt}]}]


@weave.op
def claude(model_id: str, prompt: str, max_tokens: int=400) -> str:
    messages = format_prompt(prompt)
    response = call_model(model_id, messages, system_message, max_tokens)
    return response["output"]["message"]["content"][0]["text"]

In [9]:
prompt = ("Give me a super simple starting code in PyTorch for training of a diffusion model. " 
          "Use a minimal dataset like CIFAR10")
response = claude(model_id, prompt, max_tokens=2000)

🍩 https://wandb.ai/capecape/aws-genai/r/call/0192a04b-de93-7de1-a87b-9b43ffbb50bd


In [10]:
mprint(response)

Here's a super simple starting code for training a diffusion model using PyTorch and the CIFAR10 dataset. This example is minimal and doesn't include all the optimizations and complexities of a full-fledged diffusion model, but it should give you a good starting point:

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Hyperparameters
batch_size = 64
num_epochs = 100
learning_rate = 1e-4
num_timesteps = 1000
beta_start = 0.0001
beta_end = 0.02

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load CIFAR10 dataset
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# Simple U-Net model
class SimpleUNet(nn.Module):
    def __init__(self):
        super(SimpleUNet, self).__init__()
        self.down1 = nn.Conv2d(3, 64, 3, padding=1)
        self.down2 = nn.Conv2d(64, 128, 3, padding=1)
        self.up1 = nn.ConvTranspose2d(128, 64, 3, padding=1)
        self.up2 = nn.ConvTranspose2d(64, 3, 3, padding=1)
        self.act = nn.ReLU()

    def forward(self, x, t):
        t = t.unsqueeze(-1).unsqueeze(-1)
        x1 = self.act(self.down1(x + t))
        x2 = self.act(self.down2(x1 + t))
        x = self.act(self.up1(x2 + t))
        x = self.up2(x + x1 + t)
        return x

# Diffusion model
class DiffusionModel:
    def __init__(self):
        self.model = SimpleUNet().to(device)
        self.optimizer = optim.Adam(self.model.parameters(), lr=learning_rate)
        self.mse = nn.MSELoss()
        self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        self.alphas = 1 - self.betas
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)

    def train_step(self, x):
        t = torch.randint(0, num_timesteps, (x.shape[0],)).to(device)
        noise = torch.randn_like(x)
        x_noisy = self.add_noise(x, t, noise)
        predicted_noise = self.model(x_noisy, t.float() / num_timesteps)
        loss = self.mse(noise, predicted_noise)
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        return loss.item()

    def add_noise(self, x, t, noise):
        return (
            self.alphas_cumprod[t, None, None, None].sqrt() * x +
            (1 - self.alphas_cumprod[t, None, None, None]).sqrt() * noise
        )

# Training loop
diffusion = DiffusionModel()

for epoch in range(num_epochs):
    total_loss = 0
    for batch in train_loader:
        x, _ = batch
        x = x.to(device)
        loss = diffusion.train_step(x)
        total_loss += loss
    
    avg_loss = total_loss / len(train_loader)
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")

print("Training finished!")
```

This code does the following:

1. Sets up the necessary imports and hyperparameters.
2. Loads the CIFAR10 dataset.
3. Defines a simple U-Net model as the neural network for the diffusion process.
4. Creates a `DiffusionModel` class that handles the training process, including adding noise to images and predicting the noise.
5. Implements a training loop that runs for a specified number of epochs.

Note that this is a very basic implementation and lacks many features of state-of-the-art diffusion models, such as:

- More complex network architectures
- Attention mechanisms
- Improved sampling techniques
- Learning rate scheduling
- EMA (Exponential Moving Average) of model weights
- Advanced noise schedules

To create a more advanced diffusion model, you'd need to incorporate these features and possibly use a more sophisticated dataset. However, this simple example should give you a starting point to understand the basic concepts and structure of a diffusion model implementation.

## Using the `anthropic` Python SDK with Bedrock

there is a better way of interacting with Claude

In [11]:
# !pip install -U "anthropic[bedrock]" instructor

In [12]:
from anthropic import AnthropicBedrock

client = AnthropicBedrock(
    aws_region="us-east-1",
)

output_message = client.messages.create(
    model="anthropic.claude-3-5-sonnet-20240620-v1:0",
    max_tokens=256,
    messages=[{"role": "user", "content": "Hello, world"}]
)

🍩 https://wandb.ai/capecape/aws-genai/r/call/0192a04c-4e4b-79d3-99f7-77738e6a6442


In [13]:
mprint(output_message.content[0].text)

Hello! How can I assist you today? I'm here to help with any questions or topics you'd like to discuss.

We still should probably refactor this code to a higher level function to call claude

In [14]:
@weave.op
def format_prompt_anthropic(prompt: str) -> list[dict]:
    return [{"role": "user", "content": prompt}]

@weave.op
def call_claude_bis(prompt: str, model_id: str, max_tokens: int=400) -> str:
    "Call Bedrock Claude using the anthropic Python SDK"
    messages = format_prompt_anthropic(prompt)
    response_body = client.messages.create(
        model="anthropic.claude-3-5-sonnet-20240620-v1:0",
        max_tokens=256,
        messages=messages)
    return response_body.content[0].text

In [15]:
response = call_claude_bis("How do I say: This is really handy, in french?", model_id)
mprint(response)

🍩 https://wandb.ai/capecape/aws-genai/r/call/0192a04c-5d4a-7eb1-aea8-25571a2daa0a


In French, you can say:

"C'est vraiment pratique."

Pronunciation guide:
- "C'est" is pronounced like "say"
- "vraiment" is pronounced "vray-mahn"
- "pratique" is pronounced "pra-teek"

This phrase translates directly to "This is really practical" or "This is really handy" in English. It's a common expression used to describe something that is very useful or convenient.

## Evaluation driven development

When working with LLMs, it is important to evaluate the quality of the model's responses.

We can use Weave [`Evaluation`](https://wandb.github.io/weave/tutorial-eval) to build rigorous, apples-to-apples evaluations for language model use cases.

Let's evaluate the models on the [Factual Inconsistency Benchmark](https://arxiv.org/abs/2211.08412v1) challenging dataset to improve check model performance to detect hallucination by identifying inconsistencies between a piece of text and a "summary"

In [16]:
import json
import random
import instructor
from pydantic import BaseModel
from pathlib import Path

DATA_PATH = Path("./data")
NUM_SAMPLES = 20

def read_jsonl(path):
    "returns a list of dictionaries"
    with open(path, 'r') as file:
        return [json.loads(line) for line in file]

fib_ds = random.sample(read_jsonl(DATA_PATH / "fib-val.jsonl"), NUM_SAMPLES)

In [17]:
fib_ds[3]

{'premise': 'More places will also be made available at all of Scotland\'s teacher education universities.\nThe increase of 60 primary and 200 secondary student teacher places will bring the total intake next year to 3,490.\nThe government said it was the fifth consecutive annual increase.\nA campaign was launched in September to try to encourage more people to enter the teaching profession in Scotland.\nThe Scottish government\'s #inspiringteachers campaign is focusing on science, technology, engineering and maths.\nMinisters are also asking the new Strategic Board for Teacher Education to consider whether further actions are needed "to make sure we have the right numbers of teachers in our schools".\nIn September, the leaders of seven councils called for a national taskforce to be set up to help deal with teacher recruitment problems.\nThey made the call at a summit on tackling teacher shortages in northern and rural parts of Scotland.\nStudent teacher places next year:\n• 1,230 post

In [18]:
fib_prompt = """You are an expert to detect factual inconsistencies and hallucinations. 
You will be given a document and a summary.
- Carefully read the full document and the provided summary.
- Identify Factual Inconsistencies: any statements in the summary that are not supported by or contradict the information in the document.
Factually Inconsistent: If any statement in the summary is not supported by or contradicts the document, label it as 0
Factually Consistent: If all statements in the summary are supported by the document, label it as 1

Highlight or list the specific statements in the summary that are inconsistent.
Provide a brief explanation of why each highlighted statement is inconsistent with the document.

Return in JSON format with `consistency` and a `reason` for the given choice. Encode special chars properly.

Document: 
{premise}
Summary: 
{hypothesis}
"""

we will use [`instructor`](https://github.com/jxnl/instructor) to get concsisten structured output from Claude.

In [19]:
class ModelOutput(BaseModel):
    consistency: int
    reason: str

# we need to patch the client with instructor to get structured output
inst_client = instructor.from_anthropic(client)

premise = fib_ds[3]["premise"]
hypothesis = fib_ds[3]["hypothesis"]

prompt = fib_prompt.format(premise=premise, hypothesis=hypothesis)
messages = format_prompt_anthropic(prompt)

🍩 https://wandb.ai/capecape/aws-genai/r/call/0192a04c-6f52-7b31-8f08-967a39027b3a


In [20]:
pprint(messages)

[{'content': 'You are an expert to detect factual inconsistencies and '
             'hallucinations. \n'
             'You will be given a document and a summary.\n'
             '- Carefully read the full document and the provided summary.\n'
             '- Identify Factual Inconsistencies: any statements in the '
             'summary that are not supported by or contradict the information '
             'in the document.\n'
             'Factually Inconsistent: If any statement in the summary is not '
             'supported by or contradicts the document, label it as 0\n'
             'Factually Consistent: If all statements in the summary are '
             'supported by the document, label it as 1\n'
             '\n'
             'Highlight or list the specific statements in the summary that '
             'are inconsistent.\n'
             'Provide a brief explanation of why each highlighted statement is '
             'inconsistent with the document.\n'
             '\n'
   

In [29]:
class ClaudeJudge(weave.Model):
    model_id: str
    max_tokens: int=1000
    system_message: str = "You are a helpful assistant and expert on extracting information in JSON format. Encode special chars properly."
    prompt_template: str = fib_prompt

    @weave.op
    def apply_prompt_template(self, premise:str, hypothesis:str) -> str:
        return self.prompt_template.format(premise=premise, hypothesis=hypothesis)

    @weave.op
    def predict(self, premise:str, hypothesis:str) -> ModelOutput:
        prompt = self.apply_prompt_template(premise, hypothesis)
        messages = format_prompt_anthropic(prompt)
        structured_output = inst_client.messages.create(
            model=self.model_id,
            max_tokens=self.max_tokens,
            system=self.system_message,
            messages=messages,
            response_model=ModelOutput)
        return structured_output.model_dump()

Let's try the model on a single sample

In [30]:
haiku = ClaudeJudge(model_id="anthropic.claude-3-haiku-20240307-v1:0")
response = haiku.predict(premise=fib_ds[3]["premise"], hypothesis=fib_ds[3]["hypothesis"])
pprint(response)

🍩 https://wandb.ai/capecape/aws-genai/r/call/0192a04d-c9b8-7ce0-a577-d820aa9296ff
{'consistency': 1,
 'reason': 'The summary accurately reflects the key information provided in '
           'the document. It states that the Scottish government is making '
           'more places available to train an extra 260 teachers next year, '
           'which aligns with the details in the document about the increase '
           'in primary and secondary student teacher places.'}


### Running the Evaluation

To do an evaluation, we will need:
- A dataset
- A model to evalute
- A scorer function

In [31]:
def accuracy(model_output, target):
    class_model_output = model_output.get('consistency') if model_output else None
    return {"accuracy": class_model_output == target}

In [32]:
evaluation = weave.Evaluation(dataset=fib_ds, scorers=[accuracy])

In [33]:
await evaluation.evaluate(haiku)

{'model_output': {'consistency': {'mean': 0.6}},
 'accuracy': {'accuracy': {'true_count': 14, 'true_fraction': 0.7}},
 'model_latency': {'mean': 4.5104421973228455}}

Let's try the bigger brother `sonnet-3.5`

In [34]:
sonnet = ClaudeJudge(model_id="anthropic.claude-3-5-sonnet-20240620-v1:0")
await evaluation.evaluate(sonnet)

{'model_output': {'consistency': {'mean': 0.4}},
 'accuracy': {'accuracy': {'true_count': 16, 'true_fraction': 0.8}},
 'model_latency': {'mean': 8.396758449077605}}