# ChartQA Evaluation using Braintrust

Chart Question & Answer is an increasing use cases as the capability of modern vision language models keeps increasing. Today's models can visually analyze documents and start to reason about them. To assess how well models are doing, I decided to analyze some models using the [ChartQA benchmark dataset](https://github.com/vis-nlp/ChartQA). 

Published benchmarks show GPT-4o getting about 85% accuracy. 
I wanted to run my own evaluation, where I could analyze the failure cases for the models. 

### Install dependencies 

In [None]:
%pip install autoevals braintrust requests openai

## Setup LLM clients

We'll use OpenAI's GPT-4o against some of the ChartQA dataset. We will access these models
behind the vanilla OpenAI client using Braintrust's proxy.

In [None]:
import braintrust
import openai

client = braintrust.wrap_openai(
    openai.AsyncOpenAI(
        api_key=os.environ["BRAINTRUST_API_KEY"],
        base_url="https://api.braintrust.dev/v1/proxy",
    )
)

## Downloading the data and sanity testing it

I pull the ChartQA dataset from Hugging Face hub at [lmms-lab/ChartQA](https://huggingface.co/datasets/lmms-lab/ChartQA)

The datasets includes the question, answer, and image - let's test this out and see if we can query this data against GPT-4o

In [None]:
from datasets import load_dataset
from PIL import Image

# Load ChartQA dataset
dataset = load_dataset("lmms-lab/ChartQA")

# Function to load question, answer, and image from ChartQA
def load_chart_qa_example(index):
    example = dataset['test'][index]
    question = example['question']
    answer = example['answer']
    image_data = example['image']
    
    # Check if image_data is a URL or an image object
    if isinstance(image_data, str):  # If it's a URL, fetch it
        import requests
        from io import BytesIO
        image_response = requests.get(image_data)
        image = Image.open(BytesIO(image_response.content))
    elif isinstance(image_data, Image.Image):  # If it's already an image object
        image = image_data
    else:
        raise ValueError("Unexpected image data type.")
    
    # Convert image to RGB if needed
    if image.mode != "RGB":
        image = image.convert("RGB")
    
    return question, answer, image

# Example usage
question, answer, image = load_chart_qa_example(0)
print("Question:", question)
print("Answer:", answer)
image.show()


### Adding LLM as a Judge Scorer
A common problem with ChartQA is that model output is close, but not perfectly aligned with the corrrect answer. Let's add a LLM that will tell us if we are close to the correct answer.

In [None]:
from braintrust import Eval
from autoevals import LLMClassifier
 
partialc = LLMClassifier(
    name="PartialCredit",
    prompt_template="You are going to judge the results of a QA task. The model answers on the basis of image and sometimes misses percentages, decimals, or other numerical transformations. You should ignore decimal places and percentage signs. If the answer is correct or similar/close to the answer, give partial credit. The scoring should be 0 for No credit and 1 for Full or Partial credit. An example of full credit would be an expected value of 3% and the output of 3 units \n\n Expected value: {{expected}} and output: {{output}}",#prompt_template="You are going to judge the results of a QA task. Some of the results are returned as whole number, percentages, or decimals. If the answer is correct or similar/close to the answer, give a result iof partial credit: 0 for No credit, 1 for Full or Partial credit.  An example of partial credit would be an expected value of 3 and the output of 3% units \n\nExpected value {{expected}} compared to {{output}}",
    choice_scores={"No": 0, "Partial": 1},
    use_cot=True,
)

In [None]:
import base64

# Models to evaluate
MODELS = [
    "gpt-4o",
    "gpt-4o-mini",
  #  "meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo",
  #  "meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo",
  #  "pixtral-12b-2409",
]

SYSTEM_PROMPT = """Answer the following question based on the provided image. 
Do not decorate the output with any explanation, or markdown. Just return the answer. 
{key}
"""

# Function to encode the image as base64 with a data URL prefix
def encode_image(image):
    from io import BytesIO
    buffered = BytesIO()
    image.save(buffered, format="PNG")
    #return base64.b64encode(buffered.getvalue()).decode("utf-8")
    return f"data:image/png;base64,{base64.b64encode(buffered.getvalue()).decode('utf-8')}"

# Function to call the API with an image
async def extract_value(model, key, base64_image):
    # Add the data URL prefix within the API call
    data_url = f"data:image/png;base64,{base64_image}"
    response = await client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": SYSTEM_PROMPT.format(key=key)
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": key},
                    {
                        "type": "image_url",
                        "image_url": {"url": base64_image}  # Add prefix here
                    }
                ]
            }
        ],
        temperature=0,
    )
    return response.choices[0].message.content.strip()

# Example usage
question, answer, image = load_chart_qa_example(0)
print("Question:", question)
print("Answer:", answer)
image.show()

# Encode the loaded image
base64_image = encode_image(image)

# Iterate over each model and print the response
async def process_example(question):
    for model in MODELS:
        print("Running model:", model)
        result = await extract_value(model, question, base64_image)
        print("Model:", model, "| Answer:", result, "\n")

# Run with the example question
await process_example(question)


## Running the evaluation

Now that we were able to perform a basic sanity test, let's run an evaluation! We'll use the `Levenshtein` and `Factuality` scorers to assess performance.
`Levenshtein` is heuristic and will tell us how closely the actual and expected strings match. Assuming some of the models will occasionally spit out superfluous
explanation text, `Factuality`, which is LLM based, should be able to still give us an accuracy measurement.


In [None]:
from braintrust import Eval
from autoevals import Factuality, Levenshtein

NUM_EXAMPLES = 100

# Prepare data with base64-encoded images instead of img_path
data = []
for idx in range(NUM_EXAMPLES):
    question, answer, image = load_chart_qa_example(idx)
    base64_image = encode_image(image)  # Encode the image to base64
    
    data.append({
        "input": {
            "key": question,
            "img_data": base64_image,
        },
        "expected": answer,
        "metadata": {
            "idx": idx,
        },
    })

# Run evaluation for each model
for model in MODELS:

    async def task(input):
        # Use `img_data` as the encoded image
        return await extract_value(model, input["key"], input["img_data"])

    await Eval(
        "ChartQA Extraction",
        data=data,
        task=task,
        scores=[Levenshtein, Factuality,partialc],
        experiment_name=f"ChartQA Extraction - {model}",
        metadata={"model": model},
    )


## Interesting Takeaways from Using Braintrust

- I could use multiple models - I am passing the image directly, so some other vision models, such as from together would require me reworking the datasets into an image URL

- I was able to evaluate a vision lanaguage model

- Easy to run across multiple modes

- Easy to add my own scorer - LLM as a judge

- I could follow improvements / regressions 

- Did this all from a notebook (could have set this up through the UI)

- Evalaution is easy to drill into - see specific examples, see the actual text passed to the model 

- Was able to see how my custom model did

- Obviously, many more comparisons that i could do - lot more for the tool here
