Skip to content

Latest commit



202 lines (163 loc) · 7.5 KB

File metadata and controls

202 lines (163 loc) · 7.5 KB
sidebar_position hide_table_of_contents

Tutorial: Build an Evaluation pipeline

To iterate on an application, we need a way to evaluate if it's improving. To do so, a common practice is to test it against the same set of examples when there is a change. Weave has a first-class way to track evaluations with Model & Evaluation classes. We have built the APIs to make minimal assumptions to allow for the flexibility to support a wide array of use-cases.

Build a Model

Models store and version information about your system, such as prompts, temperatures, and more. Weave automatically captures when they are used and update the version when there are changes.

Models are declared by subclassing Model and implementing a predict function definition, which takes one example and returns the response.


Known Issue: If you are using Google Colab, remove async from the following examples.


import json
import openai
import weave

# highlight-next-line
class ExtractFruitsModel(weave.Model):
    model_name: str
    prompt_template: str

    # highlight-next-line
    # highlight-next-line
    async def predict(self, sentence: str) -> dict:
        client = openai.AsyncClient()

        response = await
                {"role": "user", "content": self.prompt_template.format(sentence=sentence)}
        result = response.choices[0].message.content
        if result is None:
            raise ValueError("No response from model")
        parsed = json.loads(result)
        return parsed

You can instantiate Model objects as normal like this:

import asyncio
import weave


model = ExtractFruitsModel(model_name='gpt-3.5-turbo-1106',
                          prompt_template='Extract fields ("fruit": <str>, "color": <str>, "flavor": <str>) from the following text, as json: {sentence}')
sentence = "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy."
# if you're in a Jupyter Notebook, run:
# await model.predict(sentence)

:::note Checkout the Models guide to learn more. :::

Collect some examples

sentences = ["There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
"Pounits are a bright green color and are more savory than sweet.",
"Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them."]
labels = [
    {'fruit': 'neoskizzles', 'color': 'purple', 'flavor': 'candy'},
    {'fruit': 'pounits', 'color': 'bright green', 'flavor': 'savory'},
    {'fruit': 'glowls', 'color': 'pale orange', 'flavor': 'sour and bitter'}
examples = [
    {'id': '0', 'sentence': sentences[0], 'target': labels[0]},
    {'id': '1', 'sentence': sentences[1], 'target': labels[1]},
    {'id': '2', 'sentence': sentences[2], 'target': labels[2]}

Evaluate a Model

Evaluations assess a Models performance on a set of examples using a list of specified scoring functions.

Here, we'll use a default scoring function MultiTaskBinaryClassificationF1 and we'll also define our own fruit_name_score.

Here sentence is passed to the model's predict function, and target is used in the scoring function, these are inferred based on the argument names of the predict and scoring functions.

import weave
from weave.flow.scorer import MultiTaskBinaryClassificationF1


def fruit_name_score(target: dict, model_output: dict) -> dict:
    return {'correct': target['fruit'] == model_output['fruit']}

# highlight-next-line
evaluation = weave.Evaluation(
    # highlight-next-line
    dataset=examples, scorers=[MultiTaskBinaryClassificationF1(class_names=["fruit", "color", "flavor"]), fruit_name_score],
# highlight-next-line
# highlight-next-line
# if you're in a Jupyter Notebook, run:
# await evaluation.evaluate(model)

Pulling it all together

import json
import asyncio
# highlight-next-line
import weave
# highlight-next-line
from weave.flow.scorer import MultiTaskBinaryClassificationF1
import openai

# We create a model class with one predict function.
# All inputs, predictions and parameters are automatically captured for easy inspection.

# highlight-next-line
class ExtractFruitsModel(weave.Model):
    model_name: str
    prompt_template: str

    # highlight-next-line
    # highlight-next-line
    async def predict(self, sentence: str) -> dict:
        client = openai.AsyncClient()

        response = await
                {"role": "user", "content": self.prompt_template.format(sentence=sentence)}
            response_format={ "type": "json_object" }
        result = response.choices[0].message.content
        if result is None:
            raise ValueError("No response from model")
        parsed = json.loads(result)
        return parsed

# We call init to begin capturing data in the project, intro-example.

# We create our model with our system prompt.
model = ExtractFruitsModel(name='gpt4',
                           prompt_template='Extract fields ("fruit": <str>, "color": <str>, "flavor") from the following text, as json: {sentence}')
sentences = ["There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
"Pounits are a bright green color and are more savory than sweet.",
"Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them."]
labels = [
    {'fruit': 'neoskizzles', 'color': 'purple', 'flavor': 'candy'},
    {'fruit': 'pounits', 'color': 'bright green', 'flavor': 'savory'},
    {'fruit': 'glowls', 'color': 'pale orange', 'flavor': 'sour and bitter'}
examples = [
    {'id': '0', 'sentence': sentences[0], 'target': labels[0]},
    {'id': '1', 'sentence': sentences[1], 'target': labels[1]},
    {'id': '2', 'sentence': sentences[2], 'target': labels[2]}
# If you have already published the Dataset, you can run:
# dataset = weave.ref('example_labels').get()

# We define a scoring functions to compare our model predictions with a ground truth label.
def fruit_name_score(target: dict, model_output: dict) -> dict:
    return {'correct': target['fruit'] == model_output['fruit']}

# Finally, we run an evaluation of this model.
# This will generate a prediction for each input example, and then score it with each scoring function.
# highlight-next-line
evaluation = weave.Evaluation(
    # highlight-next-line
    dataset=examples, scorers=[MultiTaskBinaryClassificationF1(class_names=["fruit", "color", "flavor"]), fruit_name_score],
# highlight-next-line
# if you're in a Jupyter Notebook, run:
# await evaluation.evaluate(model)

What's next?