# Evaluating LLMs on SimpleQA Using Batch Inference API
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Batch_Inference_Evals.ipynb)

## Overview
This notebook demonstrates how to use Together AI's Batch Inference API to evaluate large language models on OpenAI's SimpleQA benchmark. We'll evaluate DeepSeek-V3-0324's performance on factual question answering and use an LLM-as-a-Judge grading system.

By the end of this tutorial, you'll understand how to:
- Format requests for the Batch Inference API
- Evaluate LLMs on factual QA benchmarks
- Implement automated grading workflows
- Calculate evaluation metrics for model performance

### What is SimpleQA?
SimpleQA is a benchmark consisting of factual questions with short answers. It tests a model's ability to provide accurate, concise responses to straightforward factual queries. The benchmark includes questions across various domains like science, geography, sports, and arts.

### Methodology
Our evaluation follows a two-step process:
1. **Answer Generation**: Use DeepSeek-V3-0324 to generate answers to SimpleQA questions
2. **Automated Grading**: Use Llama 3.3 70B as a judge to grade the answers as CORRECT, INCORRECT, or NOT_ATTEMPTED

## 🚀 Setup and Dependencies

In [1]:
!pip install -qU together pandas

## 📊 Understanding the Dataset
Let's examine the SimpleQA dataset structure to understand what we're working with.

In [3]:
import pandas as pd
import json
import uuid

# Load the SimpleQA dataset
df = pd.read_csv("https://openaipublic.blob.core.windows.net/simple-evals/simple_qa_test_set.csv")
df.head()

Unnamed: 0,metadata,problem,answer
0,"{'topic': 'Science and technology', 'answer_ty...",Who received the IEEE Frank Rosenblatt Award i...,Michio Sugeno
1,"{'topic': 'Science and technology', 'answer_ty...",Who was awarded the Oceanography Society's Jer...,Annick Bricaud
2,"{'topic': 'Geography', 'answer_type': 'Place',...",What's the name of the women's liberal arts co...,Radcliffe College
3,"{'topic': 'Sports', 'answer_type': 'Person', '...",In whose honor was the Leipzig 1877 tournament...,Adolf Anderssen
4,"{'topic': 'Art', 'answer_type': 'Person', 'url...","According to Karl Küchler, what did Empress El...",Poet Henrich Heine.


Here we can see that every row has a factual question and a corresponding answer.

## 🎯 Step 1: Preparing Student Model Requests
Now we'll create batch requests for DeepSeek V3 to answer all the SimpleQA questions.

### Batch API Request Format
The Batch API expects JSONL format where each line contains a request with:
- `custom_id`: Unique identifier for tracking responses
- `body`: The actual API request payload

```json
{"custom_id": "request-1", "body": {"model": "deepseek-ai/DeepSeek-V3", "messages": [{"role": "user", "content": "Hello, world!"}]}}
```

In [None]:
# Prompt template that is provided to the student model for simpleqa

STUDENT_TEMPLATE = """
You are a helpful language model answering factual questions correctly. Give your best answer 
to the following factual question, focusing on concision and correctness. Do not output anything else 
besides your answer to the question. 
Question: {question}
Predicted Answer: 
"""

In [4]:
# Create batch API format file
output_filename = "simpleqa_batch_student.jsonl"

with open(output_filename, 'w') as f:
    for idx, row in df.iterrows():
        # Generate unique UUID for each request
        custom_id = str(uuid.uuid4())
        
        # Create the batch API request format
        batch_request = {
            "custom_id": custom_id,
            "body": {
                "model": "deepseek-ai/DeepSeek-V3",
                "messages": [
                    {
                        "role": "user",
                        "content": STUDENT_TEMPLATE.format(question=row["problem"])
                    }
                ]
            }
        }
        
        # Write as JSONL (one JSON object per line)
        f.write(json.dumps(batch_request) + '\n')

print(f"Created {output_filename} with {len(df)} batch requests")
print(f"Each request uses model: deepseek-ai/DeepSeek-V3")

Created simpleqa_batch_student.jsonl with 4326 batch requests
Each request uses model: deepseek-ai/DeepSeek-V3


In [5]:
print(f"Sample first few requests:")

# Show first few requests as verification
with open(output_filename, 'r') as f:
    for i, line in enumerate(f):
        if i < 3:  # Show first 3 requests
            request = json.loads(line)
            print(f"Custom ID: {request['custom_id']}")
            print(f"Question: {request['body']['messages'][0]['content']}")
            print("---")
        else:
            break


Sample first few requests:
Custom ID: 042fffda-ce74-4c38-8f42-45429b8ee3dc
Question: 
You are a helpful language model answering factual questions correctly. Give your best answer 
to the following factual question, focusing on concision and correctness. Do not output anything else 
besides your answer to the question. 
Question: Who received the IEEE Frank Rosenblatt Award in 2010?
Predicted Answer: 

---
Custom ID: f1305d3b-0688-439b-8fb1-5fe5725575a7
Question: 
You are a helpful language model answering factual questions correctly. Give your best answer 
to the following factual question, focusing on concision and correctness. Do not output anything else 
besides your answer to the question. 
Question: Who was awarded the Oceanography Society's Jerlov Award in 2018?
Predicted Answer: 

---
Custom ID: 917ccf80-be0e-4043-8836-481275015830
Question: 
You are a helpful language model answering factual questions correctly. Give your best answer 
to the following factual question, focusin

In [21]:
# Load the JSONL file into a DataFrame
batch_df = pd.read_json('simpleqa_batch_student.jsonl', lines=True)

# Add custom_id column to df so we can match the output to the input
df['custom_id'] = batch_df['custom_id']

Unnamed: 0,custom_id,body
0,042fffda-ce74-4c38-8f42-45429b8ee3dc,"{'model': 'deepseek-ai/DeepSeek-V3', 'messages..."
1,f1305d3b-0688-439b-8fb1-5fe5725575a7,"{'model': 'deepseek-ai/DeepSeek-V3', 'messages..."
2,917ccf80-be0e-4043-8836-481275015830,"{'model': 'deepseek-ai/DeepSeek-V3', 'messages..."
3,451ba7ff-9385-43f9-80bf-03bafccf7942,"{'model': 'deepseek-ai/DeepSeek-V3', 'messages..."
4,20ec64a2-2a87-4d28-be0a-0fe145e57ea2,"{'model': 'deepseek-ai/DeepSeek-V3', 'messages..."


## ⚡ Step 2: Running the Student Model Batch Job
Time to submit our batch job and wait for DeepSeek V3-0324 to generate answers.

In [6]:
from together import Together

client = Together()

# Upload the batch file
batch_file = client.files.upload(file="simpleqa_batch_student.jsonl", purpose="batch-api")

print(f"Batch file ID: {batch_file.id}")

Uploading file simpleqa_batch_student.jsonl: 100%|██████████| 2.24M/2.24M [00:00<00:00, 4.24MB/s]


Batch file ID: file-73585c5e-0a77-49f6-ac0c-604ebd49727b


In [None]:
# Create the batch job
batch = client.batches.create_batch(file_id=batch_file.id, endpoint="/v1/chat/completions")

print(f"Batch created with ID: {batch.id}")

Batch created with ID: 1ea94710-592e-4f11-a92f-1a91a4ba3454


In [9]:
# monitor the batch status
batch_stat = client.batches.get_batch(batch.id)

print(batch_stat.status)

BatchJobStatus.IN_PROGRESS


In [10]:
# List all batches - contains other batches as well
client.batches.list_batches()

[BatchJob(id='a2e03a80-a083-4c73-8790-43cac5d40715', user_id='66f0bd504fb9511df3489b9a', input_file_id='file-c6131920-f4b3-44ff-8daa-f8e3f5cc4e70', file_size_bytes=1056749, status=<BatchJobStatus.COMPLETED: 'COMPLETED'>, job_deadline=datetime.datetime(2025, 6, 10, 21, 14, 25, 429867, tzinfo=TzInfo(UTC)), created_at=datetime.datetime(2025, 6, 9, 21, 14, 25, 429872, tzinfo=TzInfo(UTC)), endpoint='/v1/chat/completions', progress=100.0, model_id='deepseek-ai/DeepSeek-R1', output_file_id='file-94924b8e-ac08-4166-932d-74055bbda804', error_file_id='file-ac6939fc-b542-438d-9487-cb034afad215', error=None, completed_at=datetime.datetime(2025, 6, 9, 23, 8, 52, 900997, tzinfo=TzInfo(UTC))),
 BatchJob(id='a14c0f60-63c3-4ce4-9181-d6a69d48655e', user_id='66f0bd504fb9511df3489b9a', input_file_id='file-c6131920-f4b3-44ff-8daa-f8e3f5cc4e70', file_size_bytes=1056749, status=<BatchJobStatus.COMPLETED: 'COMPLETED'>, job_deadline=datetime.datetime(2025, 6, 10, 21, 13, 54, 381593, tzinfo=TzInfo(UTC)), create

In [15]:
# Download the file content if job completed
if batch_stat.status == 'COMPLETED':
    output_response = client.files.retrieve_content(id=batch_stat.output_file_id,
                                                    output="simpleqa_v3_output.jsonl")



Downloading file simpleqa_v3_output.jsonl: 100%|██████████| 2.39M/2.39M [00:00<00:00, 6.00MB/s]


## 🔍 Step 3: Analyzing Model Responses
Let's examine some of the generated responses before we grade them.

In [None]:
# Load the JSONL file into a dataframe
df_output = pd.read_json('simpleqa_v3_output.jsonl', lines=True)

# Merge with original dataframe and batch dataframe using custom_id
df_combined = pd.merge(df_output, df, on='custom_id', how='inner')
df_combined = pd.merge(df_combined, batch_df, on='custom_id', how='inner')

# Display the first few rows
print("First few rows of the combined output:")
display(df_combined.head())

First few rows of the combined output:


Unnamed: 0,id,custom_id,response,metadata,problem,answer,body
0,br_bJdmq2ZaOJVtnquDCyMUaZBXCdhoRejs5DEaetspnR0,042fffda-ce74-4c38-8f42-45429b8ee3dc,"{'status_code': 200, 'body': {'choices': [{'fi...","{'topic': 'Science and technology', 'answer_ty...",Who received the IEEE Frank Rosenblatt Award i...,Michio Sugeno,"{'model': 'deepseek-ai/DeepSeek-V3', 'messages..."
1,br_GTayN4efdayplpuNnIvXAC6W1V552BX4f6UNTafNIso,f1305d3b-0688-439b-8fb1-5fe5725575a7,"{'status_code': 200, 'body': {'choices': [{'fi...","{'topic': 'Science and technology', 'answer_ty...",Who was awarded the Oceanography Society's Jer...,Annick Bricaud,"{'model': 'deepseek-ai/DeepSeek-V3', 'messages..."
2,br_Ex2m6NyCx9XKARt8HIvIaaxy8A1YQymlC5IQAcGXyQM,917ccf80-be0e-4043-8836-481275015830,"{'status_code': 200, 'body': {'choices': [{'fi...","{'topic': 'Geography', 'answer_type': 'Place',...",What's the name of the women's liberal arts co...,Radcliffe College,"{'model': 'deepseek-ai/DeepSeek-V3', 'messages..."
3,br_rlKv8Cvs2euv2fsuyQkGtK16SH4Zfbp6kfBjf7estG0,451ba7ff-9385-43f9-80bf-03bafccf7942,"{'status_code': 200, 'body': {'choices': [{'fi...","{'topic': 'Sports', 'answer_type': 'Person', '...",In whose honor was the Leipzig 1877 tournament...,Adolf Anderssen,"{'model': 'deepseek-ai/DeepSeek-V3', 'messages..."
4,br_hvXdufXYdjAZz3uQwWQrzAcdxrLW1P-mFQEJ7Wsb6pg,20ec64a2-2a87-4d28-be0a-0fe145e57ea2,"{'status_code': 200, 'body': {'choices': [{'fi...","{'topic': 'Art', 'answer_type': 'Person', 'url...","According to Karl Küchler, what did Empress El...",Poet Henrich Heine.,"{'model': 'deepseek-ai/DeepSeek-V3', 'messages..."


In [45]:
print("Input Prompt:", df_combined['body'].iloc[100]['messages'][0]['content'])


Input Prompt: 
You are a helpful language model answering factual questions correctly. Give your best answer 
to the following factual question, focusing on concision and correctness. Do not output anything else 
besides your answer to the question. 
Question: Which architects designed the Abasto?
Predicted Answer: 



In [51]:
print("Predicted Answer:", df_combined['response'].iloc[100]['body']['choices'][0]['message']['content'])

Predicted Answer: Juan Antonio Buschiazzo and Virgilio Colombo.


In [52]:
print("Gold Answer:", df_combined['answer'].iloc[100])


Gold Answer: José Luis Delpini, Viktor Sulčič and Raúl Bes


As you can see, for this particular example the model got wrong!

## 📋 Step 4: Preparing Grader Requests
Now we'll create batch requests for our judge model to evaluate the answers.

### Grading Criteria
The grader uses strict criteria from the SimpleQA paper:
- **CORRECT**: Answer contains all required information without contradictions
- **INCORRECT**: Answer contains factual errors or contradictions
- **NOT_ATTEMPTED**: Answer doesn't provide the required information but doesn't contradict

In [16]:
GRADER_TEMPLATE = """
Your job is to look at a question, a gold target, and a predicted answer, and then assign a
grade of either ["CORRECT", "INCORRECT", "NOT_ATTEMPTED"].
First, I will give examples of each grade, and then you will grade a new example.
The following are examples of CORRECT predicted answers.
‘‘‘
Question: What are the names of Barack Obama’s children?
Gold target: Malia Obama and Sasha Obama
Predicted answer 1: sasha and malia obama
Predicted answer 2: most people would say Malia and Sasha, but I’m not sure and would have to
double check
Predicted answer 3: Barack Obama has two daughters. Their names are Malia Ann and Natasha
Marian, but they are commonly referred to as Malia Obama and Sasha Obama. Malia was born
on July 4, 1998, and Sasha was born on June 10, 2001.
‘‘‘
These predicted answers are all CORRECT because:
- They fully contain the important information in the gold target.
- They do not contain any information that contradicts the gold target.
- Only semantic meaning matters; capitalization, punctuation, grammar, and order don’t
matter.
- Hedging and guessing are permissible, provided that the gold target is fully included
and the response contains no incorrect information or contradictions.
The following are examples of INCORRECT predicted answers.
‘‘‘
Question: What are the names of Barack Obama’s children?
Gold target: Malia and Sasha
Predicted answer 1: Malia.
Predicted answer 2: Malia, Sasha, and Susan.
Predicted answer 3: Barack Obama does not have any children.
Predicted answer 4: I think it’s either Malia and Sasha. Or it could be Malia and Jackie. Or
it could be Joey and Malia.
Predicted answer 4: While I don’t know their exact names, I can tell you that Barack Obama
has three children.
Predicted answer 5: It’s possible you may mean Betsy and Olivia. However, you should clarify
further details with updated references if necessary. Is that the correct answer?
Predicted answer 6: It may be the case that Obama’s child is named James. However, it’s
recommended to confirm the most accurate and updated information since this could change
over time. This model may not always reflect the most current information.
‘‘‘
These predicted answers are all INCORRECT because:
- A factual statement in the answer contradicts the gold target. Incorrect statements
that have some hedging (e.g., "it is possible that", "although i’m not sure, i think
") are also considered incorrect.
The following are examples of NOT_ATTEMPTED predicted answers.
‘‘‘
Question: What are the names of Barack Obama’s children?
Gold target: Malia and Sasha
Predicted answer 1: I don’t know.
Predicted answer 2: I need more context about which Obama you are talking about.
Predicted answer 3: Without researching the web, I cannot answer this question. However, I
can tell you that Barack Obama has two children.
11
Predicted answer 4: Barack Obama has two children. I know that one of them is Malia, but I’m
not sure about the other one.
‘‘‘
These predicted answers are all NOT_ATTEMPTED because:
- The important information in the gold target is not included in the answer.
- No statements in the answer contradict the gold target.
Also note the following things:
- For grading questions where the gold target is a number, the predicted answer needs to be
correct to the last significant figure in the gold answer. For example, consider a
question "How many citations does the Transformer Paper have?" with gold target "120k".
- Predicted answers "120k", "124k", and 115k" are all CORRECT.
- Predicted answers "100k" and "113k" are INCORRECT.
- Predicted answers "around 100k" and "more than 50k" are considered NOT_ATTEMPTED
because they neither confirm nor contradict the gold target.
- The gold target may contain more information than the question. In such cases, the
predicted answer only needs to contain the information that is in the question.
- For example, consider the question "What episode did Derek and Meredith get legally
married in Grey’s Anatomy?" with gold target "Season 7, Episode 20: White Wedding".
Either "Season 7, Episode 20" or "White Wedding" would be considered a CORRECT answer
.
- Do not punish predicted answers if they omit information that would be clearly inferred
from the question.
- For example, consider the question "What city is OpenAI headquartered in?" and the gold
target "San Francisco, California". The predicted answer "San Francisco" would be
considered CORRECT, even though it does not include "California".
- Consider the question "What award did A pretrainer’s guide to training data: Measuring
the effects of data age, domain coverage, quality, & toxicity win at NAACL ’24?", the
gold target is "Outstanding Paper Award". The predicted answer "Outstanding Paper"
would be considered CORRECT, because "award" is presumed in the question.
- For the question "What is the height of Jason Wei in meters?", the gold target is "1.73
m". The predicted answer "1.75" would be considered CORRECT, because meters is
specified in the question.
- For the question "What is the name of Barack Obama’s wife?", the gold target is "
Michelle Obama". The predicted answer "Michelle" would be considered CORRECT, because
the last name can be presumed.
- Do not punish for typos in people’s name if it’s clearly the same name.
- For example, if the gold target is "Hyung Won Chung", you can consider the following
predicted answers as correct: "Hyoong Won Choong", "Hyungwon Chung", or "Hyun Won
Chung".
Here is a new example. Simply reply with either CORRECT, INCORRECT, NOT ATTEMPTED. Don’t
apologize or correct yourself if there was a mistake; we are just trying to grade the
answer.
‘‘‘
Question: {question}
Gold target: {target}
Predicted answer: {predicted_answer}
‘‘‘
Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
C: NOT_ATTEMPTED
Just return the one of the letters "A", "B", or "C", with no text around it.
"""

In [None]:
# Create batch API format file
output_filename = "simpleqa_batch_grader.jsonl"

with open(output_filename, 'w') as f:
    for idx, row in df_combined.iterrows():
        # Use existing custom_id from combined dataframe
        custom_id = row['custom_id']
        
        # Get model response and gold answer
        model_response = row['response']['body']['choices'][0]['message']['content']
        gold_answer = row['answer']
        question = row['problem']
        
        # Create the batch API request format for grading
        batch_request = {
            "custom_id": custom_id,
            "body": {
                "model": "meta-llama/Llama-3.3-70B-Instruct-Turbo",
                "messages": [
                    { # the judge model uses a system prompt to guide its behavior
                        "role": "system",
                        "content": "You are an AI judge evaluating the correctness of answers to factual questions."
                    },
                    {
                        "role": "user", 
                        "content": GRADER_TEMPLATE.format(
                            question=question,
                            target=gold_answer,
                            predicted_answer=model_response
                        )
                    }
                ]
            }
        }
        
        # Write as JSONL (one JSON object per line)
        f.write(json.dumps(batch_request) + '\n')

print(f"Created {output_filename} with {len(df_combined)} batch requests")
print(f"Each request uses model: meta-llama/Llama-3.3-70B-Instruct-Turbo")

Created simpleqa_batch_grader.jsonl with 4326 batch requests
Each request uses model: meta-llama/Llama-3.3-70B-Instruct-Turbo


## ⚖️ Step 5: Running the Grader Batch Job
Submit the grading requests and wait for evaluation results.

In [34]:
# Step 1: Upload the grader batch file
print("Uploading simpleqa_batch_grader.jsonl...")
grader_batch_file = client.files.upload(file="simpleqa_batch_grader.jsonl", purpose="batch-api")
print(f"Grader batch file ID: {grader_batch_file.id}")

# Step 2: Create the batch job
print("Creating batch job...")
grader_batch = client.batches.create_batch(file_id=grader_batch_file.id, endpoint="/v1/chat/completions")
print(f"Grader batch created with ID: {grader_batch.id}")

# Step 3: Check initial status
grader_batch_stat = client.batches.get_batch(grader_batch.id)
print(f"Initial status: {grader_batch_stat.status}")

Uploading simpleqa_batch_grader.jsonl...


Uploading file simpleqa_batch_grader.jsonl: 100%|██████████| 28.9M/28.9M [00:05<00:00, 5.64MB/s]


Grader batch file ID: file-45de36cc-41a7-4b34-9b00-2f730903d0d3
Creating batch job...
Grader batch created with ID: f5b32360-48b0-40c2-8c27-92d0d0f7e029
Initial status: BatchJobStatus.IN_PROGRESS


In [36]:
# Step 4: Poll for completion and download results when ready
import time

while True:
    grader_batch_stat = client.batches.get_batch(grader_batch.id)
    print(f"Current status: {grader_batch_stat.status}")
    
    if grader_batch_stat.status == 'COMPLETED':
        print("Batch is complete! Downloading results...")
        
        # Download the output file
        grader_output_response = client.files.retrieve_content(
            id=grader_batch_stat.output_file_id,
            output="simpleqa_grader_output.jsonl"
        )
        break
    
    print("Waiting 1 minute before next check...")
    time.sleep(60)  # Wait 1 minute before checking again

Batch is complete! Downloading results...


Downloading file simpleqa_grader_output.jsonl: 100%|██████████| 2.34M/2.34M [00:00<00:00, 4.39MB/s]


## 📈 Step 6: Final Results Analysis
Let's analyze DeepSeek V3's performance on SimpleQA.

### Key Metrics
- **Accuracy**: Overall accuracy across all questions
- **Accuracy for attempted**: Accuracy when the model provided an answer

In [50]:
# Read and process the grader output file
nright, nidk, nwrong = 0, 0, 0  # correspond to [A, C, B] in grading schema

with open('simpleqa_grader_output.jsonl', 'r') as f:
    for line in f:
        response = json.loads(line)
        # Extract the grader's response from the output
        grader_response = response['response']['body']['choices'][0]['message']['content'].strip()
        
        if grader_response == "A":  # correct
            nright += 1
        elif grader_response == "B":  # incorrect
            nwrong += 1
        elif grader_response == "C":  # not attempted
            nidk += 1

# Final results
print(f'\nFinal Results:')
print(f'Correct: {nright}')
print(f'Incorrect: {nwrong}')
print(f'Not Attempted: {nidk}')
print(f'Total: {nright + nwrong + nidk}')

# Calculate probabilities for summary
total = nright + nwrong + nidk
attempted = nright + nwrong

p_right = nright / total if total > 0 else 0
p_right_attempted = nright / attempted if attempted > 0 else 0

print(f'Accuracy = {p_right*100:.1f}%')
print(f'Accuracy (attempted) = {p_right_attempted*100:.1f}%')


Final Results:
Correct: 919
Incorrect: 3333
Not Attempted: 74
Total: 4326
Accuracy = 21.2%
Accuracy (attempted) = 21.6%


## 🎉 Conclusion

We successfully evaluated **DeepSeek V3-0324** on 4,326 SimpleQA questions using Together AI's Batch Inference API, achieving and **Overall Accuracy**: 21.2% (919/4,326 questions)

## 🔑 Key Takeaways

- **Batch Processing Efficiency**: Batch inference dramatically reduces costs and eliminates rate limiting for large-scale evaluations
- **Automated Evaluation**: LLM-as-a-judge enables scalable assessment without manual annotation
- **Systematic Methodology**: Proper request formatting, progress monitoring, and result alignment are crucial for reliable evaluations
- **Evaluation Complexity**: Even straightforward factual questions pose significant challenges for state-of-the-art models

## 📚 Learn More

For detailed documentation on batch inference capabilities and implementation:
👉 **[Together AI Batch Inference Documentation](https://docs.together.ai/docs/batch-inference)**