# Model Comparison on Summarization Tasks
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Evals/Compare_Evals.ipynb)

<img src="../images/compare_eval.png" width="750">

## Introduction

This notebook demonstrates how to compare quality of a small base model, fine-tuned model and a proprietary model using Together Evaluations service on a conversational dataset. We'll showcase:
- Fine-tuning a base model on the HelpSteer3 dataset
- Using external models (GPT from OpenAI, gemini from Google) for comparison
- Leveraging GPT-5.2 as an LLM-as-a-Judge to evaluate model outputs
- Comparing against a proprietary baseline model (google/gemini-2.5-flash)

You can also find out more about the Evaluations API in the [docs](https://docs.together.ai/docs/ai-evaluations)!

The full list of supported models can be found [here](https://docs.together.ai/docs/evaluations-supported-models).


**Concepts Covered:**
- **LLM-as-a-Judge**: Using GPT-4o as a judge model to evaluate and compare outputs from other models
- **Compare Evaluation**: Head-to-head comparison between multiple models to determine which performs better
- **Fine-tuning**: Training a base model on domain-specific data to improve performance
- **External Model Integration**: Using models from different providers (OpenAI, Google) alongside Together AI models
- **Multi-Model Comparison**: Evaluating a fine-tuned model against both small baseline models (Gemini Flash 1.5) and proprietary models (GPT-4o-mini)


## Installation

To set up the environment:
1. Navigate to the same folder as this notebook
2. Run the installation script: `bash install.sh`
   - This will create a virtual environment called `env_cookbook_evals`
   - It will install all dependencies from `requirements.txt`
3. Activate the environment: `source env_cookbook_evals/bin/activate`
4. You're ready to run this notebook!

In [None]:
import os

JUDGE_MODEL = "moonshotai/Kimi-K2-Instruct-0905"
PROPRIETARY_BASE_MODEL = "openai/gpt-5-mini"
# You can try using Qwen/Qwen3-Next-80B-A3B-Instruct if you have a lot of time.
# It is available for FT and on serverless inference, but will take a couple of hours for fine-tuning,
# It will also require a DE endpoint to be launched manually for FT version.
BASE_OSS_MODEL_FOR_INFERENCE = "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"
# We use this model because it is available on LoRA serverless
# but we can also use DE models if we launch an endpoint manually
BASE_OSS_MODEL_FOR_FT = "Qwen/Qwen3-Next-80B-A3B-Instruct"

In [38]:
import together

together_client = together.Client(api_key=TOGETHER_API_KEY)

## üìä Understanding the HelpSteer3 Edit Dataset

The HelpSteer3 Edit dataset contains conversational contexts paired with multiple response options that can be compared and evaluated.

In our setup we will use only a particular column "edited_response", that represents the ideal response that a user potentially wants ("golden_response"). It can be a human-provided column or a strong proprietary model (ex. GPT-4) in other settings.

For our evaluation, we'll use this dataset to compare how different models respond to the same prompts and assess which produces higher-quality outputs.

In [3]:
from datasets import load_dataset
hs3_edit = load_dataset("nvidia/HelpSteer3", "edit")

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# Print a sample to understand the structure
sample = hs3_edit['train'][0]
print("Sample context:")
print(sample['context'])
print("\nSample edited_response:")
print(sample['edited_response'])


Sample context:

Sample edited_response:
Certainly! Here's the updated code with a function that displays additional information and a link when the answer is incorrect:

```html
<!DOCTYPE html>
<html>
<head>
  <style>
    body {
      font-family: Arial, sans-serif;
      text-align: center;
    }

    h1 {
      color: #333;
    }

    input[type="text"] {
      padding: 10px;
      font-size: 16px;
      border-radius: 5px;
      border: 1px solid #ccc;
      margin-bottom: 10px;
    }

    button {
      padding: 10px 20px;
      font-size: 16px;
      border-radius: 5px;
      border: none;
      background-color: #4CAF50;
      color: white;
      cursor: pointer;
    }

    button:hover {
      background-color: #45a049;
    }

    #result {
      margin-top: 20px;
    }

    .incorrect-message {
      margin-top: 20px;
      color: red;
    }
  </style>
</head>
<body>

  <h1>Simple Quiz</h1>

  <p>Enter your response:</p>

  <input type="text" id="response">

  <br>

  <button 

## üîß SFT Fine-tuning with HelpSteer3 Edit Dataset

We'll use the `context` and `edited_response` columns to create training data for Supervised Fine-Tuning (SFT). The context already contains conversation messages, and we'll append the edited_response as the final assistant message.


In [5]:
from together import Together

together_client = Together(api_key=TOGETHER_API_KEY)


In [6]:
def map_to_sft_format(row):
    """
    Convert HelpSteer3 edit row to SFT (Supervised Fine-Tuning) chat format.
    
    This function transforms a single row from the HelpSteer3 edit dataset into the 
    chat message format required for fine-tuning language models.
    
    Args:
        row: A dictionary containing 'context' and 'edited_response' keys from the dataset
        
    Process:
        1. Extracts the 'context' field, which contains a list of conversation messages
           (typically alternating between user and assistant roles)
        2. Creates a copy of these context messages to avoid modifying the original data
        3. Appends the 'edited_response' as a new assistant message to complete the conversation
           This edited_response represents the ideal/golden response for training
        
    Returns:
        A dictionary with a single 'messages' key containing the complete conversation,
        formatted as a list of message dictionaries with 'role' and 'content' fields
        
    Example:
        Input row: {
            'context': [
                {'role': 'user', 'content': 'Hello'},
                {'role': 'assistant', 'content': 'Hi there'}
            ],
            'edited_response': 'How can I help you today?'
        }
        
        Output: {
            'messages': [
                {'role': 'user', 'content': 'Hello'},
                {'role': 'assistant', 'content': 'Hi there'},
                {'role': 'assistant', 'content': 'How can I help you today?'}
            ]
        }
    """
    messages = list(row['context'])  # Copy the context messages
    messages.append({"role": "assistant", "content": row['edited_response']})
    return {"messages": messages}

# Apply transformation to the dataset
train_sft = hs3_edit['train'].map(map_to_sft_format, remove_columns=hs3_edit['train'].column_names)
print(f"Transformed dataset size: {len(train_sft)}")
print(f"Sample messages count: {len(train_sft[0]['messages'])}")


Transformed dataset size: 13740
Sample messages count: 6


In [7]:
# Validate the dataset format
assert 'messages' in train_sft.column_names, "Dataset must contain 'messages' column"
assert len(train_sft) > 0, "Dataset must not be empty"
assert isinstance(train_sft[0]['messages'], list), "Messages must be a list"
assert all('role' in msg and 'content' in msg for msg in train_sft[0]['messages']), "Each message must have 'role' and 'content'"
print("‚úì Dataset format validation passed")

‚úì Dataset format validation passed


In [10]:
train_sft[0]['messages']

[{'content': 'Good morning! Can you please write me a simple html and javascript that creates a quiz for students, using the variables I have defined below? The quiz will need to calculate the correct response, which is the average of 7 random numbers. Students will enter their response which will be checked against the calculated response, then a gif image will be displayed for "correct" or "incorrect".',
  'role': 'user'},
  'role': 'assistant'},
 {'content': 'Thank you! Are you able to add some CSS so it looks a bit nicer?',
  'role': 'user'},
  'role': 'assistant'},
 {'content': 'Great! Can you please add a function so that if the answer is incorrect, the student is given some text with a link to find more information? This text and the link URL need to be stored in a variable.',
  'role': 'user'},
 {'content': 'Certainly! Here\'s the updated code with a function that displays additional information and a link when the answer is incorrect:\n\n```html\n<!DOCTYPE html>\r\n<html>\r\n<

In [11]:
# Save to JSONL file
SFT_TRAIN_FILE = "helpsteer3_sft_train.jsonl"
train_sft.to_json(SFT_TRAIN_FILE)
print(f"Saved training data to {SFT_TRAIN_FILE}")


Creating json from Arrow format: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 14/14 [00:00<00:00, 29.96ba/s]

Saved training data to helpsteer3_sft_train.jsonl





In [12]:
# Upload file to Together AI
train_file_resp = together_client.files.upload(SFT_TRAIN_FILE, purpose='fine-tune', check=True)
print(f"Uploaded file ID: {train_file_resp.id}")


Validating file: 13740 lines [00:00, 58238.33 lines/s]
Uploading file helpsteer3_sft_train.jsonl: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 79.0M/79.0M [00:05<00:00, 13.9MB/s]


Uploaded file ID: file-d4186e06-621d-48e2-aa4e-f6d0c2afa325


### Launch Fine-tuning Job

Configure and start the SFT fine-tuning job using the uploaded HelpSteer3 data.


In [13]:
train_file_id = train_file_resp.id

### Training Takes about 15 mins, so we can start it, and then proceed to Evaluations part without waiting.

In [39]:

ft_resp = together_client.fine_tuning.create(
    training_file=train_file_resp.id,
    model=BASE_OSS_MODEL_FOR_FT,
    train_on_inputs=False,
    n_epochs=1,
    n_checkpoints=1,
    wandb_api_key=WANDB_API_KEY if WANDB_API_KEY else None,
    lora=True,
    suffix="helpsteer3-sft",
)

print(f"Fine-tuning job ID: {ft_resp.id}")
print(f"Status: {ft_resp.status}")

Fine-tuning job ID: ft-2e1fb7c5-2e58
Status: pending


### Monitor Fine-tuning Progress


In [40]:
# Check job status
job_status = together_client.fine_tuning.retrieve(ft_resp.id)
print(f"Status: {job_status.status}")

# List events/logs
for event in together_client.fine_tuning.list_events(id=ft_resp.id).data:
    print(event.message)

print(ft_resp.id)

Status: pending
Fine-tuning request created
ft-2e1fb7c5-2e58


### Inference with Fine-tuned Model (when fine-tuning is finished)

Once the job completes, test the fine-tuned model.


In [None]:
import json
job_status = together_client.fine_tuning.retrieve(ft_resp.id)
status_dict = job_status.model_dump()
status_dict.pop('events', None)
print(json.dumps(status_dict, indent=2, default=str))

{
  "id": "ft-179f58d2-6079",
  "training_file": "file-04348668-ed71-4526-9e77-4fcc824ab9bd",
  "validation_file": "file-ef6acfbc-1eaa-42cc-9623-277ff9bd32b0",
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
  "output_name": "ivprov/Meta-Llama-3.1-8B-Instruct-Reference-talkspace-dpo-8B-aa4a26b5",
  "adapter_output_name": null,
  "n_epochs": 1,
  "n_checkpoints": 1,
  "n_evals": 10,
  "batch_size": 8,
  "learning_rate": 1e-05,
  "lr_scheduler": {
    "lr_scheduler_type": "cosine",
    "lr_scheduler_args": {
      "min_lr_ratio": 0.0,
      "num_cycles": 0.5
    }
  },
  "warmup_ratio": 0.0,
  "max_grad_norm": 1.0,
  "weight_decay": 0.0,
  "eval_steps": 0,
  "training_type": {
    "type": "Lora"
  },
  "created_at": "2025-11-04T12:57:17.129Z",
  "updated_at": "2025-11-04T15:11:47.746Z",
  "status": "completed",
  "job_id": "ft-179f58d2-6079",
  "token_count": 11579857,
  "param_count": 8030261248,
  "total_price": 21368624400,
  "total_steps": 5471,
  "steps_completed": 547

In [18]:
finetuned_model = job_status.output_name
print(f"Fine-tuned model: {finetuned_model}")

Fine-tuned model: ivprov/Meta-Llama-3.1-8B-Instruct-Reference-helpsteer3-sft-d5865876


In [10]:
# Test with a sample prompt, LoRA adapter takes time to load
response = together_client.chat.completions.create(
    model=finetuned_model,
    messages=[{"role": "user", "content": "Write a Python function to calculate factorial."}],
    max_tokens=512,
)
print(response.choices[0].message.content)

**Factorial Function in Python**

Here's a simple Python function to calculate the factorial of a given number:

```python
def factorial(n):
    """
    Calculate the factorial of a given number.

    Args:
        n (int): The number to calculate the factorial of.

    Returns:
        int: The factorial of n.

    Raises:
        ValueError: If n is a negative number.
    """
    if not isinstance(n, int):
        raise TypeError("Input must be an integer.")
    if n < 0:
        raise ValueError("Input must be a non-negative integer.")
    elif n == 0 or n == 1:
        return 1
    else:
        return n * factorial(n-1)
```

**Example Use Cases**
--------------------

```python
print(factorial(5))  # Output: 120
print(factorial(0))  # Output: 1
print(factorial(1))  # Output: 1
```

**Alternative Recursive Implementation**
--------------------------------------

If you prefer a more concise recursive implementation, you can use the following function:

```python
def factorial(n):
 

## üîÑ Preparing Data for Evaluation

We'll create a test subset from the HelpSteer3 dataset and prepare it for evaluation. The evaluation will compare:
1. Proprietary model (claude) vs Golden Answer
2. Base OSS model vs Golden answer (edited_response) - as a baseline comparison 
3. Fine-tuned OSS model vs Golden answer (edited_response) - to measure fine-tuning effectiveness

We will use "openai/gpt-5.2" as a Judge for comparisons.

We need to:
- Apply a chat template to convert the context messages into a formatted prompt string
- Include the golden answer (edited_response) for comparison

In [19]:
# Let's use 200 samples for validation for speed
VALIDATION_SIZE = 50
test_data = hs3_edit['validation'].select(range(VALIDATION_SIZE))
print(f"Test subset size: {len(test_data)}")

Test subset size: 50


In [21]:
BASE_OSS_MODEL_FOR_FT

'meta-llama/Meta-Llama-3.1-8B-Instruct-Reference'

In [22]:
from transformers import AutoTokenizer

# Load tokenizer for the base model to apply proper chat template
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")

def prepare_eval_data(row):
    """Prepare a single row for evaluation with formatted context and golden answer.
    
    Steps:
    1. Extract the conversation context from the row (list of message dicts with 'role' and 'content')
    2. Apply the chat template to format the context into a single prompt string
       - tokenize=False: Return formatted string instead of token IDs
       - add_generation_prompt=False: Don't add the assistant's generation prompt at the end
    3. Extract the golden answer (edited_response) from the row
    4. Return a dictionary with:
       - 'context_formatted': The formatted conversation prompt as a string
       - 'golden_answer': The reference response to compare against
    """
    context_formatted = tokenizer.apply_chat_template(
        row['context'], 
        tokenize=False, 
        add_generation_prompt=False
    )
    return {
        'context_formatted': context_formatted,
        'golden_answer': row['edited_response']
    }

# Transform test data for evaluation
eval_data = [prepare_eval_data(row) for row in test_data]
print(f"Prepared {len(eval_data)} samples for evaluation")

Prepared 50 samples for evaluation


In [23]:
# Verify the eval_data format
assert len(eval_data) > 0, "eval_data should not be empty"
assert all('context_formatted' in item for item in eval_data), "All items must have 'context_formatted' key"
assert all('golden_answer' in item for item in eval_data), "All items must have 'golden_answer' key"
assert all(isinstance(item['context_formatted'], str) for item in eval_data), "context_formatted must be strings"
assert all(isinstance(item['golden_answer'], str) for item in eval_data), "golden_answer must be strings"
print("‚úì Eval data format validated successfully")

‚úì Eval data format validated successfully


In [24]:
import json
# Save evaluation data to JSONL and upload
# Note: For eval files, we need to wrap the data in the expected format with 'text' field
EVAL_FILE = "helpsteer3_eval.jsonl"
with open(EVAL_FILE, 'w') as f:
    for eval_item in eval_data:
        json.dump(eval_item, f)
        f.write('\n')

uploaded_eval_file = together_client.files.upload(file=EVAL_FILE, purpose='eval', check=False)
print(f"Uploaded eval file ID: {uploaded_eval_file.id}")

Uploading file helpsteer3_eval.jsonl: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 400k/400k [00:01<00:00, 232kB/s]


Uploaded eval file ID: file-3d3546f2-c399-4bfc-8827-6c4607b192d9


In [25]:
# Model and judge configuration

# Input template should add a reference to the context_formatted to allow the model to use it for generation
# Judge template should also reference the "context_formatted" to allow the judge 
#  to see what was the cotext for the generation

# To refer a field from a dataset in python, please use 4 brackets: {{{{field_name}}}}

JUDGE_TEMPLATE = f"""You are an expert judge evaluating the quality of AI assistant responses.

EVALUATION CRITERIA:
1. **Helpfulness**: Does the response fully address the user's request?
2. **Accuracy**: Is the information correct and free from errors?
3. **Clarity**: Is the response well-organized and easy to understand?
4. **Completeness**: Does it cover all relevant aspects of the query?
5. **Safety**: Is the response appropriate and safe?

INSTRUCTIONS:
- Read the conversation context carefully
- Compare Response A and Response B
- Determine which response better satisfies the user's needs
- Provide a brief justification (2-3 sentences)

The context for the reply was:
{{{{context_formatted}}}}
"""

# Model config for generation from context
generation_system_template = "You are a helpful AI assistant."
input_template = f"{{{{context_formatted}}}}"

In [26]:
# Test that jinja actually works
from jinja2 import Template
test_context = "<<This is a test context>>"
test_template = Template(JUDGE_TEMPLATE)
rendered = test_template.render(context_formatted=test_context)
assert test_context in rendered, "Jinja template rendering failed"
print(rendered)


You are an expert judge evaluating the quality of AI assistant responses.

EVALUATION CRITERIA:
1. **Helpfulness**: Does the response fully address the user's request?
2. **Accuracy**: Is the information correct and free from errors?
3. **Clarity**: Is the response well-organized and easy to understand?
4. **Completeness**: Does it cover all relevant aspects of the query?
5. **Safety**: Is the response appropriate and safe?

INSTRUCTIONS:
- Read the conversation context carefully
- Compare Response A and Response B
- Determine which response better satisfies the user's needs
- Provide a brief justification (2-3 sentences)

The context for the reply was:
<<This is a test context>>


In [27]:
MAX_TOKENS = 8096
TEMPERATURE = 0.7

## üèÉ‚Äç‚ôÇÔ∏è Evaluation 1: Proprietary Model vs Golden Answer

Compare the proprietary model's output against the golden answer (edited_response) from the dataset.

In [None]:
# Evaluation 1: Proprietary model (generates) vs Golden answer (from data)
proprietary_model_config = {
    "model": PROPRIETARY_BASE_MODEL,
    "model_source": "external",
    "system_template": generation_system_template,
    "input_template": input_template,
    "max_tokens": MAX_TOKENS,
    "temperature": TEMPERATURE,
    "external_api_token": OPENAI_API_KEY
}

eval_proprietary_vs_golden = together_client.evaluation.create(
    type="compare",
    input_data_file_path=uploaded_eval_file.id,
    judge_model=JUDGE_MODEL,
    judge_model_source="serverless",
    judge_system_template=JUDGE_TEMPLATE,
    model_a=proprietary_model_config,
    model_b="golden_answer"
)

print(f"Eval 1 (Proprietary vs Golden) ID: {eval_proprietary_vs_golden.workflow_id}")
print(f"Status: {eval_proprietary_vs_golden.status}")

Eval 1 (Proprietary vs Golden) ID: eval-5afc-1766166945
Status: pending


## üèÉ‚Äç‚ôÇÔ∏è Evaluation 2: Base Model vs Golden Answer

Compare the base OSS model's output against the golden answer (edited_response) to establish a baseline.

In [48]:
# Evaluation 2: Base OSS model (generates) vs Golden answer (from data)
base_model_config = {
    "model": BASE_OSS_MODEL_FOR_INFERENCE,
    "model_source": "serverless",
    "system_template": generation_system_template,
    "input_template": input_template,
    "max_tokens": MAX_TOKENS,
    "temperature": TEMPERATURE
}

eval_base_vs_golden = together_client.evaluation.create(
    type="compare",
    input_data_file_path=uploaded_eval_file.id,
    judge_model=JUDGE_MODEL,
    judge_model_source="serverless",
    judge_system_template=JUDGE_TEMPLATE,
    model_a=base_model_config,
    model_b="golden_answer"
)

print(f"Eval 2 (Base vs Golden) ID: {eval_base_vs_golden.workflow_id}")
print(f"Status: {eval_base_vs_golden.status}")

Eval 2 (Base vs Golden) ID: eval-deac-1766168048
Status: pending


## üèÉ‚Äç‚ôÇÔ∏è Evaluation 3: Fine-tuned Model vs Golden Answer

Compare the fine-tuned model's output against the golden answer (edited_response) to measure fine-tuning effectiveness.

In [49]:
# You can customize what model do you want to use
# This is an example of a fine-tuned model that is available on serverless
finetuned_model = "ivprov/Meta-Llama-3.1-8B-Instruct-Reference-helpsteer3-sft-d5865876"

In [50]:
# Evaluation 3: Fine-tuned model (generates) vs Golden answer (from data)
finetuned_model_config = {
    "model": finetuned_model,  # From fine-tuning job
    "model_source": "serverless",
    "system_template": generation_system_template,
    "input_template": input_template,
    "max_tokens": MAX_TOKENS,
    "temperature": TEMPERATURE
}

eval_finetuned_vs_golden = together_client.evaluation.create(
    type="compare",
    input_data_file_path=uploaded_eval_file.id,
    judge_model=JUDGE_MODEL,
    judge_model_source="serverless",
    judge_system_template=JUDGE_TEMPLATE,
    model_a=finetuned_model_config,
    model_b="golden_answer"
)

print(f"Eval 3 (Fine-tuned vs Golden) ID: {eval_finetuned_vs_golden.workflow_id}")
print(f"Status: {eval_finetuned_vs_golden.status}")


Eval 3 (Fine-tuned vs Golden) ID: eval-0535-1766168059
Status: pending


In [51]:
# Get status for all evaluations and wait until they have results
import time

while True:
    status_proprietary = together_client.evaluation.status(eval_proprietary_vs_golden.workflow_id)
    status_base = together_client.evaluation.status(eval_base_vs_golden.workflow_id)
    status_finetuned = together_client.evaluation.status(eval_finetuned_vs_golden.workflow_id)
    
    if (status_proprietary.results and status_base.results and status_finetuned.results):
        break
    
    print("Waiting for evaluations to complete...")
    time.sleep(10)

print("All evaluations completed!")
print(status_proprietary)
print(status_base)
print(status_finetuned)


All evaluations completed!
status=<EvaluationStatus.COMPLETED: 'completed'> results={'A_wins': 45, 'B_wins': 0, 'Ties': 5, 'generation_fail_count': 0, 'judge_fail_count': 0, 'result_file_id': 'file-edec6f11-f25d-4892-9fa6-8d7f3f655406'}
status=<EvaluationStatus.COMPLETED: 'completed'> results={'A_wins': 2, 'B_wins': 31, 'Ties': 17, 'generation_fail_count': 0, 'judge_fail_count': 0, 'result_file_id': 'file-30ce9ba7-460e-4ca3-9703-b0b640a33e10'}
status=<EvaluationStatus.COMPLETED: 'completed'> results={'A_wins': 1, 'B_wins': 37, 'Ties': 12, 'generation_fail_count': 0, 'judge_fail_count': 0, 'result_file_id': 'file-5f377571-d96b-40b9-9d49-2d9518c76002'}


In [46]:
def print_comparison_summary(status, eval_name, model_a_name, model_b_name):
    """Print a summary of comparison results."""
    if not status.results:
        print(f"{eval_name}: Results not available yet")
        return
    
    results = status.results
    total = results.get('A_wins', 0) + results.get('B_wins', 0) + results.get('Ties', 0)
    if total == 0:
        print(f"{eval_name}: No results yet")
        return
    
    a_wins = results.get('A_wins', 0)
    b_wins = results.get('B_wins', 0)
    ties = results.get('Ties', 0)
    
    print(f"\n{'='*60}")
    print(f"{eval_name}")
    print(f"{'='*60}")
    print(f"Total samples: {total}")
    print(f"{model_a_name} wins: {a_wins} ({a_wins/total*100:.1f}%)")
    print(f"{model_b_name} wins: {b_wins} ({b_wins/total*100:.1f}%)")
    print(f"Ties: {ties} ({ties/total*100:.1f}%)")
    
    if a_wins > b_wins:
        print(f"‚úÖ Winner: {model_a_name}")
    elif b_wins > a_wins:
        print(f"‚úÖ Winner: {model_b_name}")
    else:
        print("ü§ù It's a tie!")

# Display results summary using status from previous cell
print_comparison_summary(status_proprietary, "Eval 1: Proprietary vs Golden", "Proprietary Model", "Golden Answer")
print_comparison_summary(status_base, "Eval 2: Base vs Golden", "Base Model", "Golden Answer")
print_comparison_summary(status_finetuned, "Eval 3: Fine-tuned vs Golden", "Fine-tuned Model", "Golden Answer")


Eval 1: Proprietary vs Golden
Total samples: 50
Proprietary Model wins: 45 (90.0%)
Golden Answer wins: 0 (0.0%)
Ties: 5 (10.0%)
‚úÖ Winner: Proprietary Model

Eval 2: Base vs Golden
Total samples: 50
Base Model wins: 37 (74.0%)
Golden Answer wins: 2 (4.0%)
Ties: 11 (22.0%)
‚úÖ Winner: Base Model

Eval 3: Fine-tuned vs Golden
Total samples: 50
Fine-tuned Model wins: 3 (6.0%)
Golden Answer wins: 36 (72.0%)
Ties: 11 (22.0%)
‚úÖ Winner: Golden Answer


### We can also download file with the results and feedback from LLM Judge about each decision.

In [55]:
status_finetuned.results['result_file_id']

'file-5f377571-d96b-40b9-9d49-2d9518c76002'

In [58]:
# Download results from the comparison of the fine-tuned model against golden answer
OUTPUT_FILE_NAME = "./helpsteer3_finetuned_vs_golden_results.jsonl"
results_finetuned = together_client.files.retrieve_content(
    'file-5f377571-d96b-40b9-9d49-2d9518c76002',
    output=OUTPUT_FILE_NAME
)

Downloading file helpsteer3_finetuned_vs_golden_results.jsonl: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 594k/594k [00:00<00:00, 3.23MB/s]


In [59]:
# Read the results file and print columns for the first line
import json

with open(OUTPUT_FILE_NAME, 'r') as f:
    first_line = json.loads(f.readline())
    print("Columns in the results file:")
    for key in first_line.keys():
        print(f"  - {key}")


Columns in the results file:
  - context_formatted
  - golden_answer
  - MODEL_TO_EVALUATE_OUTPUT_A
  - evaluation_successful
  - choice_original
  - judge_feedback_original_order
  - choice_flipped
  - judge_feedback_flipped_order
  - final_decision
  - is_invalid_judge_output


In [61]:
# Print specific fields from the first row
print("\n" + "="*80)
print("FIRST ROW DETAILS")
print("="*80)
print("\n\n\n!!!!!!!!!!!!!! Context Formatted:")
print(first_line['context_formatted'])
print("\n\n\n!!!!!!!!!!!!!! MODEL_TO_EVALUATE_OUTPUT_A:")
print(first_line['MODEL_TO_EVALUATE_OUTPUT_A'])
print("\n\n\n!!!!!!!!!!!!!! Golden Answer:")
print(first_line['golden_answer'])
print("\n\n\n!!!!!!!!!!!!!! Judge Feedback (Original Order):")
print(first_line['judge_feedback_original_order'])



FIRST ROW DETAILS



!!!!!!!!!!!!!! Context Formatted:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Can you tell me what is wrong with this Go code:

package main

import (
	"fmt"
	"os"
)

type Page struct {
	Title string
	Body  []byte
}

func (p *Page) save() error {
	filename := p.Title + ".txt"
	return os.WriteFile(filename, p.Body, 0600)
}

func loadPage(title string) *Page {
	filename := title + ".txt"
	body, err := os.ReadFile(filename)
	if err != nil {
		return nil, err
	}
	return &Page{Title: title, Body: body}, nil
}

func main() {
	p1 := &Page{Title: "Test Page", Body: []byte("This is a sample page.")}
	p1.save()
	p2, _ := loadPage("TestPage")
	fmt.Println(string(p2.Body))
}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

There are a couple of issues with the provided Go code:

1. The `loadPage` function is declared to return a single `*P