# Compare LLMs

Creating programs that rely on LLMs is an iterative process so we need a workflow to compare each new pipeline/prompt/technique systemtically.

In this tutorial, we'll experiment on models and create a workflow using Weave to:
- Run the same evaluation set every new program and store responses, token count, etc.
- Display a table with any two pipelines responses side-by-side with the ability to page through examples and  group/sort/filter with the UI
- Display a bar chart to compare each metric, like token count sum

# Evaluate different pipelines

So that we're comparing apples with apples, we'll create an evaluation dataset to run our pipelines on.

In [2]:
import weave
import random

classes = ['positive', 'negative', 'neutral']
prompts = ["I absolutely love this product!", 
           "I'm really disappointed with this service.", 
           "The movie was just average."]
labels = ['positive', 'negative', 'neutral']

TODO: save and load with weave so the data is versioned

For this example, we'll just mock a few pipelines and this will also save us from using up our LLM budget while we build out this example.

In [18]:
def pipeline(prompt: str) -> str:
    latency = random.uniform(0, 10)
    tokens = random.choice(range(0, 10))
    response = random.choice(classes)
    return response, latency, tokens

pipeline_1 = pipeline_2 = pipeline

Now that we have our pipelines defined, we can run them through our pipelines and capture the responses. Here is where we'll capture other metrics we care about like latency and token count so we can use them for comparison later. 

In [23]:
def evaluate(pipeline, pipeline_name):
    outs = []
    for prompt, label in zip(prompts, labels):
        response, latency, tokens = pipeline(prompt)
        outs.append({'label': label,
                     'prompt': prompt,
                     'response': response,
                     'latency': latency,
                     'tokens': tokens})
    return weave.save(outs, name=pipeline_name)

pipeline_1_w = evaluate(pipeline_1, 'pipeline_1')
pipeline_2_w = evaluate(pipeline_2, 'pipeline_2')

We use `weave.save(<responses and metrics>, name=<name>` to save them to weave, we're choosing a model name ourself and we're collecting our responses and metrics in a list of Python dictionaries.

Now Weave will intelligently decides how to display our data so can view `model_1_w` in a `weave` table. In a notebook, run:

In [39]:
pipeline_1_w

In this table, you can page through examples, filter results and even create new columns by using data in other columns.

# Join Prediction Tables

Because we want to compare our pipelines for each prompt, we'll use a `weave` Op (operation) to join the two tables on the `prompt`.

TODO: Do this dynamically in the Board

In [40]:
res = weave.ops.join_2(
    model_1_w, model_2_w,
    lambda row: row['prompt'], 
    lambda row: row['prompt'],
    'model_a',
    'model_b',
    False,
    False)

x = weave.use(res)
res_w = weave.save(x, 'res_w')
res_w

We only want to display our prompts, labels, and each model's response, so we'll need to use `weave.panels.Table` to do so. The `columns` argument may look a bit strange, but it's a way for us to define which columns are used in the table. 

In [43]:
table = weave.panels.Table(
    res_w,
    columns=[
        lambda row: row["model_a.label"],
        lambda row: row["model_a.prompt"],
        lambda row: row["model_a.response"],
        lambda row: row["model_a.response"] == row["model_a.label"],
        lambda row: row["model_b.response"],
        lambda row: row["model_b.response"] == row["model_b.label"],
    ],
)
table

Now we have a joined table, with only the columns we want displayed, and even some computed columns

We also want a bar chart to compare our metrics so we'll create a dictionary for our computed metrics.

In [50]:
latency_bar = weave.ops.dict_(
    model_a=res_w["model_a.latency"].avg(),
    model_b=res_w["model_b.latency"].avg(),
)
latency_bar

This might seem a bit magic, but it's the same as we saw above, `weave` makes a best guess how to display your data. Here, we created a dict like `{'model_a': metric_a, 'model_b': metric_b}` and it chooses to display it as a bar chart.

In [52]:
token_count_bar = weave.ops.dict_(
    model_a=res_w["model_a.tokens"].sum(),
    model_b=res_w["model_b.tokens"].sum(),
)
token_count_bar

Again, we created a dictionary with weave, and it knew to display it as a bar chart.

## Putting it all together in a Board

Because we want to easily jump between comparing different models, we'll need some way to change which models we're comparing. To do this, we'll use a weave Board.

Weave Boards can have variables `vars` which you can change dynamically to make your panels and plots update. 
Here, we'll define which models we're comparing by name. We'll put each of our panels in a `BoardPanel` and we'll define the layout.

In [53]:
weave.panels.Board(
    vars={
        "res_w_0": res_w                                                 
    },
    panels=[
        weave.panels.BoardPanel(
            lambda res_w_0: table,
            layout=weave.panels.BoardPanelLayout(x=0, y=0, w=24, h=9)
        ),
         weave.panels.BoardPanel(
            lambda res_w_0: token_count_bar,
             layout=weave.panels.BoardPanelLayout(x=0, y=9, w=12, h=8)
        ),
        weave.panels.BoardPanel(
            lambda res_w_0: latency_bar,
             layout=weave.panels.BoardPanelLayout(x=12, y=9, w=12, h=8)
        ),
    ]
)

And that's it. We've created a `weave Board` to compare our pipeline responses, token counts and latency. We've seen how `weave` intelligently decides how to display data, whether that's in a table, bar chart or some other panel. We've learned how to define `panels` ourself, and define computations using weave `Ops`.