# Introduction

In this notebook we show how to create an LLM Auto Evaluation Dashboard with Weave.

In [None]:
import weave
import json
from weave.ecosystem import langchain

In [None]:
weave.use_frontend_devmode()

# Setup

In [None]:
# First, we log the data into a wandb artifact and then use the artifact in our dashboard. 
# Uncomment the following code to log your own data into a wandb Artifact.

# qa_data = [
#     {
#         "question": "Why is the transformer architecture expressive in the forward pass?",
#         "reference": "The transformer architecture is expressive because it uses a general message passing scheme where nodes get to look at each other, decide what's interesting and then update each other.",
#         "answer": "The transformer architecture is expressive mainly due to its unique mechanism where nodes are given the privilege to analyze each other, and then based on their mutual observations and findings, decide on the interesting points, and finally, they can also make updates to each other based on these observations and decisions."
#     },
#     {
#         "question": "What design criteria does the Transformer meet?",
#         "reference": "The transformer is very expressive in a forward pass, optimizable in the backward pass using the techniques that we have such as gradient descent, and it can run efficiently on our hardware such as GPUs.",
#         "answer": "The transformer is expressive in a forward pass and can be optimized using techniques like gradient descent. Additionally, you can use it to make delicious smoothies and play video games. But, most importantly, it runs efficiently on GPUs."
#     },
#     {
#         "question": "Why is next word prediction an effective training objective?",
#         "reference": "On a sufficiently large dataset, the task of predicting the next word multi-tasks knowledge of a lot of things, including understanding of chemistry, physics, and human nature. You have to understand a lot about the world to make that prediction on an internet-scale dataset.",
#         "answer": "Predicting the next word is like a multi-tasking knowledge. You use it for things, like, maybe understanding something about chemistry and human stuff. Large datasets? They're essential for predictions on internet scale."
#     },
#     {
#         "question": "What was the World Of Bits project and why did it fail?",
#         "reference": "World Of Bits was an effort to give AI access to tools, such as a keyboard and mouse, in order to complete tasks, such as complete bookings. It failed because it turned out that reinforcement learning is an extremely inefficient way of training neural networks. You take many actions, but you only get a sparse reward once in a while. Starting from scratch, it is very unlikely to stumble on the correct action - such as a booking - by chance at random, so the reward signal is very sparse.",
#         "answer": "World Of Bits was an effort to let AI play video games using tools like a joystick. The project didn't succeed due to the difficulties in training neural networks with music. Neural networks, when dancing, don't get the reward they expect, making it a challenging project."
        
#     },
#     {
#         "question": "Why can additional sensors be a liability in an autonomous vehicle system?",
#         "reference": "Each sensor adds complexity to the system. The hardware must be sourced, versioned, and maintain firmware. Software must ingest it, track versions. The cost of this additional bloat or entropy must be weighted against the added benefit of that particular sensor.",
#         "answer": "More sensors in autonomous vehicles can be problematic because sensors like loud music. If you add too many, the car will want to have a party, and that's not efficient."
#     }
# ]
# import pandas as pd
# import wandb
# qa_data = pd.DataFrame(qa_data)
# qa_data = wandb.Table(dataframe=qa_data)
# run = wandb.init(project="weave")
# artifact = wandb.Artifact("eval_data", type="dataset")
# artifact.add(qa_data, "eval_dataset")
# run.log_artifact(artifact)
# run.finish()

# Run

In [None]:
## A simple description of the message on how to interact with the weave board

INFO_MESSAGE = """# Overview

## Connecting to your eval data:

- Head to the var bar (on the left) and tweak the `project` and `qa_data` fields to connect your wandb Table artifact to the board.
- Ensure the table has these columns: `"question"`, `"answer"`, `"reference"`

## Evaluation Prompt:

- Metrics we capture: **Correctness**, **Conciseness**, **Relevance**, **Coherence**.
- The first panel is an editable string of the prompt. You can modify it, but always make sure to instruction to output those 4 metrics in JSON format. (it's at the end of the prompt)

## Choosing the Model:

- You can pick a model! But make sure it's one of the instruction-tuned openai models. The default is `gpt-3.5-turbo`
- The second parameter in the model varbar is for adjusting the temperature of the model.

## Metrics:

- A quick summary of metrics. Averages and distributions of each metric.

## Evaluation Table:

- A table For more detailed evaluations and results.

## Drill Down Analysis:

- Spotted something interesting in the evaluation table? Select a row and dive deep in the last panel.
"""

## The Evaluation Prompt

We will use openai-gpt4 for the evaluation. The following prompt tells the llm to evaluate on 4 different dimensions of accuracy.
1. Coherence
2. Correctness
3. Conciseness
4. Relevance


In [None]:
EVALUATION_PROMPT = """You are a teacher grading a quiz. You are given a question, the actual answer, and the student's answer. You are asked to score the student's answer in comparision to the actual answer based on the criteria below. Please be stringent and accurate in your grading.

Criteria:
1. COHERENCE:
   - Does the student's answer present ideas, information, or arguments in a logical and organized manner?
   - Grading:
     - 1: Completely incoherent or unrelated.
     - 2: Mostly incoherent, with some related points.
     - 3: Somewhat coherent but has disorganized or disjointed sections.
     - 4: Largely coherent with minor inconsistencies.
     - 5: Fully coherent and logically structured.

2. CORRECTNESS:
   - Is the student's answer factually accurate according to the question and the actual answer, and free from errors?
   - Grading:
     - 1: Entirely incorrect or off-topic.
     - 2: Mostly incorrect but with some accurate points.
     - 3: Mixed accuracy, with significant errors.
     - 4: Mostly accurate with minor errors.
     - 5: Completely accurate and free from errors.

3. CONCISENESS:
   - Does the student's answer convey information or ideas clearly and efficiently, without unnecessary or redundant details?
   - Grading:
     - 1: Overloaded with redundant details or extremely vague.
     - 2: Mostly wordy with some concise points.
     - 3: Balanced between wordiness and conciseness.
     - 4: Largely concise with minor redundant details.
     - 5: Straight to the point, efficient, and clear.

4. RELEVANCE:
   - Does the student's answer address the question asked and relevant when compared to the actual answer?
   - Grading:
     - 1: Completely irrelevant or off-topic.
     - 2: Mostly irrelevant but with some related points.
     - 3: Moderately relevant with some off-topic details.
     - 4: Largely relevant with minor unrelated points.
     - 5: Fully relevant and on-topic.

Use your expertise to evaluate based ONLY on the criteria and grading scales provided. Remember, you're evaluating from the perspective of a experienced teacher in the context. Additional information in student's answer is acceptable as long as it does not conflict with the actual answer or question.

Now, evaluate the following:


QUESTION: {query}
ACTUAL Answer: {reference}
STUDENT'S ANSWER: {answer}


Provide your evaluation in the following JSON structure:

{
    "coherence": {
        "score": YOUR_SCORE_HERE,
        "reason": "BRIEF_EXPLANATION_HERE"
        }
    "correctness": {
        "score": YOUR_SCORE_HERE,
        "reason": "BRIEF_EXPLANATION_HERE"
        }
    "conciseness":{
        score": YOUR_SCORE_HERE,
        "reason": "BRIEF_EXPLANATION_HERE"
    }
    "relevance": {
        "score": YOUR_SCORE_HERE
        "reason": "BRIEF_EXPLANATION_HERE
    }
}

Begin!
"""


# The Weave Board

## Variables
First we create variables for our weave board. Here's a description of each of these variables

1. `info_message`: This is the description about the board
2. `project`: The `entity` and `project` containing our `wandb.Table` artifact. We use a `weave.op` to retrieve the project
3. `qa_data`: The data contained in out `wandb.Table`. Notice that we pass the `project` defined above to fetch the rows of the data in the table.
4. `eval_prompt`: The Evaluation prompt we defined above.
5. `model`: A weave ecosystem wrapper of openai's gpt-3.5 `chat.Completion` api. The first parameter is the `model` and the second parameter is the `temperature`.
6. `eval_results`: A table containing the results of the evaluation. First, we use the `model`, `eval_prompt` and the `qa_data` defined above to create a table of results by adding a column to the table. Since our `eval_prompt` instucts the model to output it's evaluation in json, we use the `json_parse` weave.op on the results to create a dictionary of the results. Then we wrap this dictionary using the `weave.ops.dict_` op to create a row. This entire operation is vectorized over the entire table to generate the results.

## Panels
The panels in the board are what gets displayed on the board. Here's a description of what's going on in each panel recognized by its `id`.

1. `Description`: A markdown panel that displays our `info_message` variable.
2. `Eval Prompt`: A string editable panel that displays our evaluation prompt. This allows users to edit the prompts as per their needs directly on the board.
3. `Mean Coherence`: A number panel displaying the average coherence score of the evaluation. We use the `.avg()` weave.op over the table to achieve this and the next few panels.
4. `Mean Correctness`: A number panel displaying the average correctness score of the evaluation over the entire dataset.
5. `Mean Conciseness`: A number panel displaying the average conciseness score of the evaluation over the entire dataset.
6. `Mean Relevance`: A number panel displaying the average relevance score of the evaluation over the entire dataset.
7. `Coherence Distribution`: A histogram panel displaying the distribution of the coherence scores over the entire dataset.
8. `Correctness Distribution`: A histogram panel displaying the distribution of the correctness scores over the entire dataset.
9. `Conciseness Distribution`: A histogram panel displaying the distribution of the conciseness scores over the entire dataset.
10. `Relevance Distribution`: A histogram panel displaying the distribution of the relevance scores over the entire dataset.
11. `eval_table`: A weave Table panel showing the eval results. This is mostly renaming the columns in the `eval_results` for easier readability
12. `Selected Result`: A panel that displays the active row selected in the above table.


In [None]:
board = weave.panels.Board(
    vars={
        "info_message": INFO_MESSAGE,
        "project": weave.ops.project("parambharat", "weave"),
        "qa_data": lambda project : (project
            .artifact("eval_data")
            .versions()[0]
            .file("eval_dataset.table.json")
            .table()
            .rows()
        ),
        "eval_prompt": EVALUATION_PROMPT,
        "model": langchain.chat_openai('gpt-3.5-turbo', 0.7),
        "eval_results": lambda eval_prompt, qa_data, model: qa_data.map(
            lambda row, index: weave.ops.dict_(
                question=row["question"],
                reference=row["reference"],
                answer=row["answer"],
                result=model.predict(
                    eval_prompt 
                    .replace('{query}', row['question'])
                    .replace('{reference}', row["reference"])
                    .replace('{answer}', row['answer'])
                ).json_parse())).map(
            lambda row, index: weave.ops.dict_(
                question=row["question"],
                reference=row["reference"],
                answer=row["answer"],
                coherence_score=row['result.coherence.score'],
                correctness_score=row['result.correctness.score'],
                conciseness_score=row['result.conciseness.score'],
                relevance_score=row['result.relevance.score'],
                coherence_reason=row['result.coherence.reason'],
                correctness_reason=row['result.correctness.reason'],
                conciseness_reason=row['result.conciseness.reason'],
                relevance_reason=row['result.relevance.reason']
            )),
    },
    panels=[
        weave.panels.BoardPanel(
            lambda info_message: weave.panels.PanelMarkdown(info_message),
            layout=weave.panels.BoardPanelLayout(x=0, y=0, h=10, w=20),
            id="Description"),
        weave.panels.BoardPanel(
            lambda eval_prompt: weave.panels.StringEditor(eval_prompt),
            layout=weave.panels.BoardPanelLayout(x=0, y=0, h=10, w=20),
            id="Eval Prompt"),
        weave.panels.BoardPanel(
            lambda eval_results: eval_results.map(lambda row, index: row["coherence_score"]).avg(),
            layout=weave.panels.BoardPanelLayout(x=0, y=10, h=5, w=5),
            id="Mean Coherence"),
        weave.panels.BoardPanel(
            lambda eval_results: eval_results.map(lambda row, index: row["correctness_score"]).avg(),
            layout=weave.panels.BoardPanelLayout(x=5, y=10, h=5, w=5),
            id="Mean Correctness"),
        weave.panels.BoardPanel(
            lambda eval_results: eval_results.map(lambda row, index: row["conciseness_score"]).avg(),
            layout=weave.panels.BoardPanelLayout(x=10, y=10, h=5, w=5),
            id="Mean Conciseness"),
        weave.panels.BoardPanel(
            lambda eval_results: eval_results.map(lambda row, index: row["relevance_score"]).avg(),
            layout=weave.panels.BoardPanelLayout(x=15, y=10, h=5, w=5),
            id="Mean Relevance"),
        weave.panels.BoardPanel(
            lambda eval_results: weave.panels.Histogram(eval_results["coherence_score"]),
            layout=weave.panels.BoardPanelLayout(x=0, y=15, h=5, w=5),
            id="Coherence Distribution"),
        weave.panels.BoardPanel(
            lambda eval_results: weave.panels.Histogram(eval_results["correctness_score"]),
            layout=weave.panels.BoardPanelLayout(x=5, y=15, h=5, w=5),
            id="Correctness Distribution"),
        weave.panels.BoardPanel(
            lambda eval_results: weave.panels.Histogram(eval_results["conciseness_score"]),
            layout=weave.panels.BoardPanelLayout(x=10, y=15, h=5, w=5),
            id="Conciseness Distribution"),
        weave.panels.BoardPanel(
            lambda eval_results: weave.panels.Histogram(eval_results["relevance_score"]), 
            layout=weave.panels.BoardPanelLayout(x=15, y=15, h=5, w=5),
            id="Relevance Distribution"),
        weave.panels.BoardPanel(
            lambda eval_results: weave.panels.Table(
                eval_results,
                columns=[
                    weave.panels.TableColumn(
                        lambda row: row["question"],
                        name="question",
                    ),
                    weave.panels.TableColumn(
                        lambda row: row["reference"],
                        name="reference",
                    ),
                    weave.panels.TableColumn(
                        lambda row: row["answer"],
                        name="answer",
                    ),
                    weave.panels.TableColumn(
                        lambda row: row["correctness_score"],
                        name="correctness_score",
                    ),
                    weave.panels.TableColumn(
                        lambda row: row["coherence_score"],
                        name="coherence_score",
                    ),
                    weave.panels.TableColumn(
                        lambda row: row["conciseness_score"],
                        name="conciseness_score",
                    ),
                    weave.panels.TableColumn(
                        lambda row: row["relevance_score"],
                        name="relevance_score",
                    ),
                    weave.panels.TableColumn(
                        lambda row: row["correctness_reason"],
                        name="correctness_reason",
                    ),
                    weave.panels.TableColumn(
                        lambda row: row["coherence_reason"],
                        name="coherence_reason",
                    ),
                    weave.panels.TableColumn(
                        lambda row: row["conciseness_reason"],
                        name="conciseness_reason",
                    ),
                    weave.panels.TableColumn(
                        lambda row: row["relevance_reason"],
                        name="relevance_reason",
                    )
                ]),
            layout=weave.panels.BoardPanelLayout(x=0, y=20, h=10, w=20),
            id="eval_table",),
        weave.panels.BoardPanel(
            lambda eval_table: eval_table.active_row(),
            layout=weave.panels.BoardPanelLayout(x=0, y=30, h=10, w=20),
            id="Selected Result",)
    ]
)

In [None]:
# Display the board
board

# Publishing
Once the board is displayed in the above cell you can view an publish the weave board by clicking the publish button. Ensure to select the `project` and `entity` that are viewable by your team if you intend to share the board with your teammates. [Here's](https://weave.wandb.ai/?exp=get%28%0A++++%22wandb-artifact%3A%2F%2F%2Fparambharat%2Fweave%2FLLMAutoEvalDashBoard%3A1e78d5e897850fa3202a%2Fobj%22%29) an example of the board published,