# Compare LLMs

Creating programs that rely on LLMs is an iterative process so we need a workflow to compare each new pipeline/prompt/technique systemtically.

In this tutorial, we'll experiment on models and create a workflow using Weave to:
- Run the same evaluation set every new program and store responses, token count, etc.
- Display a table with any two pipelines responses side-by-side with the ability to page through examples and  group/sort/filter with the UI
- Display a bar chart to compare each metric, like token count sum

# Evaluate different pipelines

So that we're comparing apples with apples, we'll create an evaluation dataset to run our pipelines on.

In [12]:
import weave
weave.use_frontend_devmode()
import random

classes = ['positive', 'negative', 'neutral']
prompts = ["I absolutely love this product!", 
           "I'm really disappointed with this service.", 
           "The movie was just average."]
labels = ['positive', 'negative', 'neutral']

For this example, we'll just mock a few pipelines and this will also save us from using up our LLM budget while we build out this example.

In [77]:
def pipeline(prompt: str) -> str:
    latency = random.uniform(0, 10)
    tokens = random.choice(range(0, 10))
    response = random.choice(classes)
    return response, latency, tokens

pipeline_1 = pipeline_2 = pipeline

Now that we have our pipelines defined, we can run them through our pipelines and capture the responses. Here is where we'll capture other metrics we care about like latency and token count so we can use them for comparison later. 

In [78]:
def evaluate(pipeline, pipeline_name):
    outs = []
    for prompt, label in zip(prompts, labels):
        response, latency, tokens = pipeline(prompt)
        outs.append({'label': label,
                     'prompt': prompt,
                     'response': response,
                     'latency': latency,
                     'tokens': tokens})
    return weave.save(outs, name=pipeline_name)

pipeline_1_w = evaluate(pipeline_1, 'pipeline_1')
pipeline_2_w = evaluate(pipeline_2, 'pipeline_2')

We use `weave.save(<responses and metrics>, name=<name>` to save them to weave, we're choosing a model name ourself and we're collecting our responses and metrics in a list of Python dictionaries.

Now Weave will intelligently decides how to display our data so can view `model_1_w` in a `weave` table. In a notebook, run:

In [79]:
pipeline_1_w

In this table, you can page through examples, filter results and even create new columns by using data in other columns.

# Join Prediction Tables

Because we want to compare our pipelines for each prompt, we'll use a `weave` Op (operation) to join the two tables on the `prompt`. We'll give the pipelines aliases `1` & `2`.

In [81]:
joined = weave.ops.join_2(
    pipeline_1_w,
    pipeline_2_w,
    lambda row: row['prompt'], 
    lambda row: row['prompt'],
    '1',
    '2',
    False,
    False)

joined

We only want to display our prompts, labels, and each model's response, so we'll need to use `weave.panels.Table` to do so. The `columns` argument may look a bit strange, but it's a way for us to define which columns are used in the table. 

In [82]:
table = weave.panels.Table(
    joined,
    columns=[
        lambda row: row["1.label"],
        lambda row: row["1.prompt"],
        lambda row: row["1.response"],
        lambda row: row["1.response"] == row["1.label"],
        lambda row: row["2.response"],
        lambda row: row["2.response"] == row["2.label"],
    ],
)
table

Now we have a joined table, with only the columns we want displayed, and even some computed columns

## Compare average latency & token count

We also want a bar chart to compare our metrics so we'll create a dictionary for our computed metrics.

In [83]:
latency_bar = weave.ops.dict_(
    pipeline_1=joined["1.latency"].avg(),
    pipeline_2=joined["2.latency"].avg(),
)
latency_bar

This might seem a bit magic, but it's the same as we saw above, `weave` makes a best guess how to display your data. Here, we created a dict like `{'pipeline_1': metric_a, 'pipeline_2': metric_b}` and it chooses to display it as a bar chart.

In [84]:
token_count_bar = weave.ops.dict_(
    pipeline_1=joined["1.tokens"].sum(),
    pipeline_2=joined["2.tokens"].sum(),
)
token_count_bar

Again, we created a dictionary with weave, and it knew to display it as a bar chart.

## Putting it all together in a Board

Because we want to easily jump between comparing different models, we'll need some way to change which models we're comparing. To do this, we'll use a weave Board.

Weave Boards can have variables `vars` which you can change dynamically to make your panels and plots update. 
Here, we'll define which models we're comparing and we can change the local artifact path to get new models. We'll put each of our panels in a `BoardPanel` and we'll define the layout.

In [85]:
weave.panels.Board(
    vars={
        "pipeline_1": pipeline_1_w,
        "pipeline_2": pipeline_2_w,
        "joined": lambda pipeline_1, pipeline_2: 
        weave.ops.join_2(
            pipeline_1, 
            pipeline_2,
            lambda row: row['prompt'], 
            lambda row: row['prompt'],
            '1',
            '2',
            False,
            False)                                                 
            },
    panels=[
        weave.panels.BoardPanel(
            lambda joined: weave.panels.Table(
                                joined,
                                columns=[
                                    lambda row: row["1.label"],
                                    lambda row: row["1.prompt"],
                                    lambda row: row["1.response"],
                                    lambda row: row["1.response"] == row["1.label"],
                                    lambda row: row["2.response"],
                                    lambda row: row["2.response"] == row["2.label"],
                                ],
                            ),
            layout=weave.panels.BoardPanelLayout(x=0, y=0, w=24, h=9)
        ),
         weave.panels.BoardPanel(
            lambda joined: weave.ops.dict_(
                                pipeline_1=joined["1.tokens"].sum(),
                                pipeline_2=joined["2.tokens"].sum(),
                            ),
             layout=weave.panels.BoardPanelLayout(x=0, y=9, w=12, h=8)
        ),
        weave.panels.BoardPanel(
            lambda joined: weave.ops.dict_(
                                pipeline_1=joined["1.latency"].avg(),
                                pipeline_2=joined["2.latency"].avg(),
                            ),
             layout=weave.panels.BoardPanelLayout(x=12, y=9, w=12, h=8)
        ),
    ]
)

You can now open the Board in a new tab to make it full screen.

## Conclusion

And that's it. We've created a `weave Board` to compare our pipeline responses, token counts and latency. We've seen how `weave` intelligently decides how to display data, whether that's in a table, bar chart or some other panel. We've learned how to define `panels` ourself, and define computations using weave `Ops`.