# Compare LLMs

Creating programs that rely on LLMs is an iterative process so we need a workflow to compare each new pipeline/prompt/technique systematically.

In this tutorial, we'll experiment with models and create a workflow using Weave to:
- Run the same evaluation set for every new program and store the responses, token counts, etc.
- Display a table with any two pipelines' responses side-by-side, with the ability to page through examples and group/sort/filter from the UI
- Display a bar chart to compare each metric, like token count sum

# Evaluate different pipelines

So that we're comparing apples with apples, we'll create a single evaluation dataset for our pipelines. We'll save it to Weave so that any time we update it, it will be versioned and we can easily get the latest version by name.

In [None]:
import weave
import random

classes = ['positive', 'negative', 'neutral']
prompts = ["I absolutely love this product!", 
           "I'm really disappointed with this service.", 
           "The movie was just average."]
labels = ['positive', 'negative', 'neutral']
dataset = [{'prompt': prompt, 'label': label} for prompt, label in zip(prompts, labels)]
dataset_name = 'classification_test_set'
test_set = weave.save(dataset, name=dataset_name)

For this example, we'll just mock a few pipelines (this will also save us from using up our LLM budget while we build out this example :)

In [None]:
def pipeline(prompt: str) -> str:
    latency = random.uniform(0, 10)
    tokens = random.choice(range(0, 10))
    response = random.choice(classes)
    return response, latency, tokens

pipeline_1 = pipeline_2 = pipeline

Now that we have our pipelines defined, we can run them through our data and capture the responses. Here is where we'll capture other metrics we care about like latency and token count so we can use them for comparison later. 

In [None]:
test_set = weave.get(f"local-artifact:///{dataset_name}:latest/obj")

In [None]:
def evaluate(pipeline, pipeline_name):
    outs = []
    dataset_name = 'classification_test_set'
    test_set = weave.get(f"local-artifact:///{dataset_name}:latest/obj")
    for examples in test_set.val:
        prompt, label = examples['prompt'], examples['label']
        response, latency, tokens = pipeline(prompt)
        outs.append({'label': label,
                     'prompt': prompt,
                     'response': response,
                     'latency': latency,
                     'tokens': tokens})
    return weave.save(outs, name=pipeline_name)

pipeline_1_w = evaluate(pipeline_1, 'pipeline_1')
pipeline_2_w = evaluate(pipeline_2, 'pipeline_2')

We use `weave.save(<responses and metrics>, name=<name>)` to save metrics and responses to weave. We can choose any name for the models, and we collect responses and metrics in a list of Python dictionaries. 

Weave will intelligently display our data in a clear and useful format, so we can view `pipeline_1_w` in a **Weave Table** panel.
To view a panel in a notebook, enter just the variable name on a line by itself and run the cell:

In [None]:
pipeline_1_w

In this table, you can page through examples, filter results, and even create new columns by using data in other columns.

# Join prediction tables

Because we want to compare our pipelines for each prompt, we'll use a **Weave Op** (operation) to join the two tables on the `prompt`. We'll give the pipelines aliases `1` & `2`.

In [None]:
joined = weave.ops.join_2(
    pipeline_1_w,
    pipeline_2_w,
    lambda row: row['prompt'], 
    lambda row: row['prompt'],
    '1',
    '2',
    False,
    False)

joined

We only want to display our prompts, labels, and each model's response, so we'll need to use `weave.panels.Table` to do so. The `columns` argument may look a bit complex, but it's a very flexible abstraction for us to define which columns are used in the table. 

In [None]:
table = weave.panels.Table(
    joined,
    columns=[
        lambda row: row["1.label"],
        lambda row: row["1.prompt"],
        lambda row: row["1.response"],
        lambda row: row["1.response"] == row["1.label"],
        lambda row: row["2.response"],
        lambda row: row["2.response"] == row["2.label"],
    ],
)
table

In [None]:
latency_bar = weave.ops.dict_(
    pipeline_1=joined["1.latency"].avg(),
    pipeline_2=joined["2.latency"].avg(),
)
latency_bar

This might seem a bit magic, but it's the same as we saw above—Weave makes a best guess how to display your data. Here, we created a dict like `{'pipeline_1': metric_a, 'pipeline_2': metric_b}`, and it chooses to display this dictionary as a bar chart.

In [None]:
token_count_bar = weave.ops.dict_(
    pipeline_1=joined["1.tokens"].sum(),
    pipeline_2=joined["2.tokens"].sum(),
)
token_count_bar

Again, we created a dictionary with weave, and it knew to display it as a bar chart.

# Putting it all together in a Board

We want to easily jump between comparing different models, so we'll need some way to change which models we're comparing. To do this, we'll use a **Weave Board**.

Weave Boards can have variables, or **vars** which you can change dynamically to make your panels and plots update. 
Here, we'll define which models we're comparing, and we can change the local artifact path to get new models. We'll put each of our panels in a **BoardPanel** and we'll define the layout.

In [None]:
weave.panels.Board(
    vars={
        "pipeline_1": pipeline_1_w,
        "pipeline_2": pipeline_2_w,
        "joined": lambda pipeline_1, pipeline_2: 
        weave.ops.join_2(
            pipeline_1, 
            pipeline_2,
            lambda row: row['prompt'], 
            lambda row: row['prompt'],
            '1',
            '2',
            False,
            False)                                                 
            },
    panels=[
        weave.panels.BoardPanel(
            lambda joined: weave.panels.Table(
                                joined,
                                columns=[
                                    lambda row: row["1.label"],
                                    lambda row: row["1.prompt"],
                                    lambda row: row["1.response"],
                                    lambda row: row["1.response"] == row["1.label"],
                                    lambda row: row["2.response"],
                                    lambda row: row["2.response"] == row["2.label"],
                                ],
                            ),
            layout=weave.panels.BoardPanelLayout(x=0, y=0, w=24, h=9)
        ),
         weave.panels.BoardPanel(
            lambda joined: weave.ops.dict_(
                                pipeline_1=joined["1.tokens"].sum(),
                                pipeline_2=joined["2.tokens"].sum(),
                            ),
             layout=weave.panels.BoardPanelLayout(x=0, y=9, w=12, h=8)
        ),
        weave.panels.BoardPanel(
            lambda joined: weave.ops.dict_(
                                pipeline_1=joined["1.latency"].avg(),
                                pipeline_2=joined["2.latency"].avg(),
                            ),
             layout=weave.panels.BoardPanelLayout(x=12, y=9, w=12, h=8)
        ),
    ]
)

You can now open the Board in a new tab to make it full screen. Hover over the right side of the panel to expand a drawer menu and click the arrow to "Open in a new tab".

## Conclusion

And that's it. We've created a **Weave Board** to compare our pipeline responses, token counts, and latency. We've seen how Weave intelligently decides how to display data, whether that's in a table, bar chart or some other type of panel. We've learned how to define **Weave Panels** ourselves and how to define computations using **Weave Ops**.