## LLM-Powered Evaluation System

In this section, we build an LLM-Powered evaluator system for the ML Tagging system we previous built. We will be logging and assessing the results in Comet.

In [None]:
! pip install openai==0.28 comet-llm --quiet

In [None]:
import openai
import os
import IPython
import json
import pandas as pd
import numpy as np
from urllib.request import urlopen

# API configuration
openai.api_key = "OPENAI_API_KEY"

Let's load the helper function to generate responses from the model:

In [None]:
def get_completion(messages, model="gpt-3.5-turbo", temperature=0, max_tokens=300):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return response.choices[0].message["content"]

### Load the Data

The code below loads both the few-shot demonstrations and validation dataset:

In [None]:
# print markdown
def print_markdown(text):
    """Prints text as markdown"""
    IPython.display.display(IPython.display.Markdown(text))

# load validation data
response = urlopen("https://raw.githubusercontent.com/comet-ml/comet-llmops/main/data/article-tags.json")
val_data =  json.loads(response.read())

# load few shot data
response = urlopen("https://raw.githubusercontent.com/comet-ml/comet-llmops/main/data/few_shot.json")
few_shot_data = json.loads(response.read())

### Few-shot

First, we define a few-shot template which will leverage the few-shot demonstration data loaded previously.

In [None]:
# function to define the few-shot template
def get_few_shot_template(few_shot_prefix, few_shot_suffix, few_shot_examples):

    return few_shot_prefix + "\n\n" + "\n".join([ "Abstract: "+ ex["abstract"] + "\n" + "Tags: " + str(ex["tags"]) + "\n" for ex in few_shot_examples]) + "\n\n" + few_shot_suffix

# function to sample few shot data
def random_sample_data (data, n):
    return np.random.choice(few_shot_data, n, replace=False)

# the few-shot prefix and suffix
few_shot_prefix = """Your task is to extract model names from machine learning paper abstracts. Your response is an an array of the model names in the format [\"model_name\"]. If you don't find model names in the abstract or you are not sure, return [\"NA\"]"""
few_shot_suffix = """Abstract: {input}\nTags:"""

# load 3 samples from few shot data
few_shot_template = get_few_shot_template(few_shot_prefix, few_shot_suffix, random_sample_data(few_shot_data, 3))

### Zero-shot


The code below defines the zero-shot template. Note that we use the same instruction from the few-shot prompt template. But in this case, we don't use the demonstrations.

In [None]:
zero_shot_template = """
Your task is extract model names from machine learning paper abstracts. Your response is an an array of the model names in the format [\"model_name\"]. If you don't find model names in the abstract or you are not sure, return [\"NA\"]

Abstract: {input}
Tags:
"""

### Evaluate

In this subsection, we perform the evaluation and logs the results to Comet.

The following is a helper function to obtain the final predictions from the model given a prompt template (e.g., zero-shot or few-shot) and the provided input data.

In [None]:
def get_predictions(prompt_template, inputs):

    responses = []

    for i in range(len(inputs)):
        messages = messages = [
            {
                "role": "system",
                "content": prompt_template.format(input=inputs[i])
            }
        ]
        response = get_completion(messages)
        responses.append(response)

    return responses

We then generated all the predictions using the validation data as inputs:

In [None]:
# extract abstract from val_data
abstracts = [val_data[i]["abstract"] for i in range(len(val_data))]
few_shot_predictions = get_predictions(few_shot_template, abstracts)
zero_shot_predictions = get_predictions(zero_shot_template, abstracts)
expected_tags = [str(val_data[i]["tags"]) for i in range(len(val_data))]

After obtaining the predictions, we now build a system prompt that will perform the automatic LLM-powered evaluation of the results obtained from the different prompt templates. Note that the system prompt expects the expected answers and the predictions from the different prompt we tried.

In [None]:
# llm-powered evaluation

system_prompt = """"You are a teacher grading a quiz. You will be given the expected answers (delimited by ```) and the answers from a student (delimited by ###). Your task is to grade the student. You will output either CORRECT or INCORRECT for each question. \n\nGrade the question as CORRECT if the student's answer overlaps with the expected answer. Ignore differences in punctuation and phrasing between the student's answer and the expected answer. The student's answer is CORRECT if it contains more information than the expected answer, but it should at least cover what's in the expected answer. The order of the items in each answer is also not a problem.\n\nGrade the question as INCORRECT if the student's answer is not factual or doesn't overlap with the expected answer.\n\nHere are the expected answers:\n```{expected_answers}```\n\nHere are the student's answers:\n###{predictions}###\n\nThe output format will be:\n[\"<grade for item 1>\", \"<grade for item 2>\",...]"""

# function to get the final llm grading
def get_llm_grading(expected_answers, predictions, system_prompt):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
        {
        "role": "system",
        "content": system_prompt.format(expected_answers=expected_answers, predictions=predictions)
        }
        ],
        temperature=0,
        max_tokens=256,
        frequency_penalty=0,
        presence_penalty=0
    )

    return response.choices[0].message["content"]

# run the llm grading using the predictions obtained before
zero_shot_eval_predictions = eval(get_llm_grading(expected_tags, zero_shot_predictions, system_prompt))
few_shot_eval_predictions = eval(get_llm_grading(expected_tags, few_shot_predictions, system_prompt))

### Logging Prompt Results to Comet

Once we have those predictions from the LLM evaluator system prompt, we can log the results to Comet. We are logging several pieces of information like the model, the prompt template type, the expected results, the grading, and more. All of this information will help us assess how the good the LLM-powered evaluator is for this use case.

In [None]:
# log prediction for both few-shot and zero-shot using Comet
import comet_llm

COMET_WORKSPACE = "COMET_WORKSPACE"
COMET_API_KEY = "COMET_API_KEY"

comet_llm.init(project="tagger-llm-evaluator", api_key=COMET_API_KEY)

for i in range(len(val_data)):
    # log zero-shot predictions
    comet_llm.log_prompt(
        prompt = system_prompt.format(expected_answers=expected_tags[i], predictions=zero_shot_predictions[i]),
        tags = ["gpt-3.5-turbo", "zero-shot"],
        metadata = {
            "model_name": "gpt-3.5-turbo",
            "temperature": 0,
            "expected_output": expected_tags[i],
            "model_output": zero_shot_predictions[i]
        },
        output = zero_shot_eval_predictions[i]
    )

    # log few-shot predictions
    comet_llm.log_prompt(
        prompt = system_prompt.format(expected_answers=expected_tags[i], predictions=few_shot_predictions[i]),
        tags = ["gpt-3.5-turbo", "few-shot"],
        metadata = {
            "model_name": "gpt-3.5-turbo",
            "temperature": 0,
            "expected_output": expected_tags[i],
            "model_output": few_shot_predictions[i]
        },
        output = few_shot_eval_predictions[i]
    )


Prompt logged to https://www.comet.com/omarsar/tagger-llm-evaluator
