# Comparing GTICL and TICL Implementations

This notebook provides a focused comparison of GTICL (GRPO Trial-Error-Explain In-Context Learning) and TICL (Trial-Error-Explain In-Context Learning) using:

1. The same dataset for training
2. The same tasks for inference
3. The same LLM provider

In [12]:
# Import necessary libraries
import os
import sys
import logging
import dotenv
import random
import time
from typing import List, Dict, Tuple

# Add parent directory to path to import modules
sys.path.append(os.path.abspath('.'))

# Import GTICL and TICL modules
from gticl.gticl_writing import GTICL, GTICLConfig
from gticl.llm_providers import OpenAIProvider
from gticl.dataset_utils import save_dataset
from ticl.ticl import TICL

# Load environment variables
dotenv.load_dotenv()

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger("GTICL-TICL-Comparison")

## Configure LLM Provider

Set up a single LLM provider to be used by both GTICL and TICL.

In [13]:
# Configure OpenAI provider with OpenRouter
llm_provider = OpenAIProvider(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.getenv("OPENROUTER_API_KEY"),
    model_id=os.getenv("LLM_MODEL_ID")
)

print(f"Using model: {os.getenv('LLM_MODEL_ID')} via OpenRouter")

Using model: google/gemini-2.0-flash-001 via OpenRouter


## Create Dataset

We'll create a consistent dataset of email examples to use for both GTICL and TICL.

In [14]:
# Create a dataset with casual email examples
casual_emails_dataset = [
    # Work-related emails
    {
        "task": "Write a casual email to Sam discussing yesterday's meeting where Alex proposed a risky new marketing push. Suggest meeting for coffee tomorrow after 2pm to discuss it further.",
        "reference_output": "Subject: thoughts on Friday's meeting?\n\nHey Sam,\n\nSooo... that meeting yesterday was interesting, huh? 😅 Not exactly what I expected!\n\nAnyway, I've been thinking about what Alex said about the new marketing push. Do you think we should go for it? Seems kinda risky to me, but then again, what do I know lol.\n\nWanna grab coffee tomorrow and chat about it? I'm free after 2.\n\nLater!\nJamie"
    },
    {
        "task": "Write an email to Taylor asking for feedback on a presentation you've been struggling with for 3 hours. Mention you need their help because they're better at this, and you'd appreciate feedback by Friday.",
        "reference_output": "Subject: can you help me with this?\n\nHey Taylor,\n\nI'm totally stuck on this presentation. Been staring at my screen for like 3 hours now! 🤦‍♀️\n\nYou're way better at this stuff than me... any chance you could take a quick look? Just need someone to tell me if it makes sense or if I sound like I'm rambling.\n\nNo rush - sometime before Friday would be amazing tho.\n\nThanks a million!\nAlex"
    },
    {
        "task": "Write an email apologizing for missing a scheduled call this morning because your alarm didn't go off (or you hit snooze too many times). Suggest rescheduling for tomorrow or Thursday after lunch.",
        "reference_output": "Subject: sry for missing the call\n\nHey,\n\nCrap, I totally missed our call this morning! My alarm didn't go off... or maybe I just hit snooze too many times? Either way, my bad.\n\nCan we reschedule? I'm free pretty much anytime tomorrow. Or Thursday after lunch maybe?\n\nLet me know what works for you.\n\nSorry again!\nRiley"
    },
    {
        "task": "Write an email to your team explaining that you'll miss the project deadline due to backend issues taking longer than expected. Suggest you need 1-2 more weeks and offer to discuss options on a call.",
        "reference_output": "Subject: sooo about that deadline...\n\nHey team,\n\nUgh, I hate to be the bearer of bad news, but we're gonna miss our target date. 😬 The backend issues are taking wayyy longer to fix than we thought.\n\nI'm thinking we need another week, maybe two? I know that's not what anyone wants to hear, but better to be realistic than promise something we can't deliver, right?\n\nLemme know your thoughts. We can hop on a quick call if you wanna discuss options.\n\nThanks for understanding (hopefully lol),\nJordan"
    },
    {
        "task": "Write an email to your boss requesting vacation days from May 15-19 because you're feeling burnt out. Assure them this is before the product launch and you'll have your work completed before leaving.",
        "reference_output": "Subject: vacation days next month??\n\nHey boss,\n\nSoooo I'm thinking about taking a little break before I completely lose my mind lol. 😅 Been feeling super burnt out lately.\n\nCould I take off May 15-19? That's the week before the big product launch, so everything should be pretty much wrapped up by then. I'll make sure all my stuff is done before I go, obvi.\n\nLet me know if that works or if you need me to shift the dates around!\n\nThanks a bunch,\nCasey"
    },
    {
        "task": "Write an email to the dev team about app issues you discovered during testing, where it crashes every 5 minutes on the checkout page. Mention you've tried troubleshooting with no success and need this fixed before tomorrow's launch.",
        "reference_output": "Subject: umm our app is kinda broken??\n\nHey dev team,\n\nSo I've been testing the app all morning and... yikes. 🙈 It's crashing like every 5 minutes on the checkout page.\n\nI tried all the usual stuff - cleared cache, restarted, even that weird thing with holding two buttons that Mark taught me lol. Nothing's working!\n\nCan someone take a look asap? Customers are gonna freeeeak if this goes live tomorrow.\n\nThanks for saving the day (again),\nJamie"
    },
    
    # Social/team emails
    {
        "task": "Write an email to your overworked team suggesting a happy hour this Thursday at the new place on 5th street that has deals until 7pm. Offer to buy the first round and make it clear there's no pressure to attend.",
        "reference_output": "Subject: drinks this week??\n\nHey everyone!\n\nSooo we've all been working crazy hours lately, and I think we deserve a break! 🍹\n\nAnyone up for happy hour this Thursday? That new place on 5th has amazing deals till 7pm. First round's on me (don't tell accounting lol).\n\nDrop a quick reply if you can make it. No pressure if you can't, but it'd be awesome to hang with you all outside this crazy office!\n\nCheers!\nCasey"
    },
    {
        "task": "Write an email to your team about Sarah's birthday next Friday. Suggest decorating her desk before she arrives (she's usually late), having lunch at her favorite Thai restaurant, and chipping in for a plant as a gift.",
        "reference_output": "Subject: sarah's bday surprise!!\n\nHeyyy team,\n\nSooo it's Sarah's birthday next Friday and we HAVE to do something fun! 🎂\n\nI'm thinking we could decorate her desk before she gets in (she always comes in late lol) and maybe grab lunch at that Thai place she loves?\n\nAnyone wanna chip in for a gift? I found this super cute plant for her desk - she's always talking about wanting more green stuff around.\n\nLet me know if you're in! And shhh... don't say anything to her!\n\nThx!\nTaylor"
    },
    {
        "task": "Write an email suggesting a zombie-themed escape room downtown for the next team-building activity, followed by dinner. Mention that everyone can bring a plus-one and you need responses by Tuesday to make the booking for next Friday after work.",
        "reference_output": "Subject: team bonding idea!!\n\nHey all,\n\nOk so I had this random thought last night... what if we did an escape room as our next team thing?? 🔎\n\nThere's this new place downtown that's supposed to be AMAZING. They have this zombie apocalypse theme that sounds super fun (and kinda terrifying tbh).\n\nWe could go next Friday after work and grab dinner after? Everyone can bring their plus-one if they want.\n\nWho's in?? Lemme know by Tuesday so I can book it!\n\nCan't wait!\nJordan"
    },
    
    # Personal/external emails
    {
        "task": "Write an excited email to a friend announcing your trip to their area from May 10-15. Ask to try the brunch place they posted about recently, and request to stay at their place for a few nights since hotels are expensive. Promise to bring their favorite cookies.",
        "reference_output": "Subject: my trip next month!!\n\nHeyyyy!\n\nOMG I just booked my flights for next month and I'm soooo excited! 🛫 Gonna be in your area May 10-15!\n\nPlease tell me you're free to hang?? I've been dying to try that brunch place you posted about last week - those pancakes looked INSANE.\n\nAlso, any chance I could crash at your place for a couple nights? Hotels are crazy expensive rn. Promise I'm still a good roomie lol. I'll bring those cookies you love!\n\nCan't wait to see youuuu!\nRiley"
    },
    {
        "task": "Write an email apologizing for completely forgetting your friend's birthday yesterday because your phone was dead all day. Offer to make it up with dinner at the new Italian restaurant next week.",
        "reference_output": "Subject: i'm the WORST friend ever 😭\n\nHey you,\n\nOmg I just realized I completely missed your birthday yesterday and I feel TERRIBLE!!! 😩 My phone was dead all day (classic me, right?) and I totally spaced.\n\nYou know I'd never forget on purpose, right? You're literally my favorite human!\n\nCan I make it up to you with dinner next week? That new Italian place just opened and I heard their pasta is to dieeee for.\n\nPretty please forgive me? 🙏\n\nLove youuuu!\nTaylor"
    },
    {
        "task": "Write a polite complaint email to a local store about finding bacon in your veggie sandwich that you purchased yesterday around 3pm. Mention you'd like to return despite this issue since people say their food is amazing.",
        "reference_output": "Subject: issue with my order yesterday\n\nHi there,\n\nSooo I was in your store yesterday around 3pm and had kind of a weird experience? 🙃\n\nI ordered the veggie sandwich but when I got home and unwrapped it... definitely not vegetarian! There was bacon all over it lol. I mean, bacon's great but not when you're not expecting it!\n\nNot sure if you guys were super busy or what, but thought you should know! I'd love to come back and try again since everyone says your food is amazing.\n\nThanks for listening to my little rant haha!\n\nAll the best,\nAlex"
    }
]

# Save dataset to a file
dataset_path = "dataset/comparison_dataset.json"
save_dataset(casual_emails_dataset, dataset_path)
print(f"Dataset with {len(casual_emails_dataset)} examples saved to {dataset_path}")

Dataset with 12 examples saved to dataset/comparison_dataset.json


## Initialize GTICL and TICL

Configure both frameworks with similar parameters for a fair comparison.

In [15]:
# Configure GTICL
gticl_config = GTICLConfig(
    k_candidates=3,       # Generate 3 candidates per input
    capacity=12,           # Store up to 5 negative samples per task
    iterations=2,         # Run 2 refinement iterations
    use_rewrite_update=True,  # Disable rewrite refinement for simplicity
    examples_per_prompt=2  # Use 2 examples per prompt
)

# Initialize GTICL
gticl = GTICL(llm_provider=llm_provider, config=gticl_config)

# Initialize TICL with matching parameters
ticl = TICL(
    llm_provider=llm_provider, 
    iterations=2,            # Match GTICL's iterations
    examples_per_prompt=2,   # Match GTICL's examples_per_prompt
    capacity=12               # Match GTICL's capacity
)

# Load the dataset into both
gticl.load_dataset(dataset=casual_emails_dataset)
ticl.load_dataset(casual_emails_dataset)

print("Both GTICL and TICL initialized with the same dataset and parameters")

INFO:gticl.gticl_writing:Loaded 12 samples into GTICL dataset
INFO:ticl.ticl:Loaded 12 samples into TICL dataset


Both GTICL and TICL initialized with the same dataset and parameters


## Run Training

Train both frameworks to collect negative samples and improve their understanding of the writing style.

In [16]:
# Function to time execution
def time_execution(fn, *args, **kwargs):
    start_time = time.time()
    result = fn(*args, **kwargs)
    end_time = time.time()
    execution_time = end_time - start_time
    return result, execution_time

# Run GTICL alignment
print("\n--- Running GTICL Alignment ---")
gticl_results, gticl_time = time_execution(
    gticl.run_alignment  # No task parameter needed now
)

# Run TICL
print("\n--- Running TICL Training ---")
ticl_results, ticl_time = time_execution(ticl.run)

# Compare training results
gticl_negative_samples = sum(len(samples) for samples in gticl.negative_samples_store.values())
ticl_negative_samples = sum(len(sample.generated_outputs) for sample in ticl.dataset)

print("\n--- Training Results ---")
print(f"GTICL Training Time: {gticl_time:.2f} seconds")
print(f"TICL Training Time: {ticl_time:.2f} seconds")
print(f"GTICL Negative Samples Collected: {gticl_negative_samples}")
print(f"TICL Negative Samples Collected: {ticl_negative_samples}")

INFO:gticl.gticl_writing:Starting iteration 1/2
INFO:gticl.gticl_writing:Processing sample 1/12



--- Running GTICL Alignment ---


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:gticl.gticl_writing:Candidate 1/3: Subject: that meeting yesterday...

Hey Sam,

What was up with Alex's idea yesterday?? A risky marketing push?? Seriously?

I was thinking we should chat more about it. Wanna grab coffee tomorrow? I'm free after 2pm.

Let me know!
Jamie
INFO:gticl.gticl_writing:Candidate 2/3: Subject: that meeting yday...

Hey Sam,

What do you think about Alex's crazy marketing idea from yesterday? Seems kinda risky, right?

I was wondering if you had time to grab coffee tomorrow and chat about it? Maybe after 2pm?

Let me know if you're free.

Talk soon,
[Your Name]
INFO:gticl.gticl_writing:Candidate 3/3: Subject: that meeting yesterday...

Hey Sam,

What was up with Alex's idea yeste


--- Running TICL Training ---


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:ticl.ticl:Adding negative sample for sample 1
INFO:ticl.ticl:Processing sample 2/12
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:ticl.ticl:Adding negative sample for sample 2
INFO:ticl.ticl:Processing sample 3/12
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:ticl.ticl:Adding negative sample for sample 3
INFO:ticl.ticl:Processing sample 4/12
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:ticl.ticl:Adding negative sample for sample 4
INFO:ticl.ticl:Processin


--- Training Results ---
GTICL Training Time: 184.75 seconds
TICL Training Time: 77.93 seconds
GTICL Negative Samples Collected: 24
TICL Negative Samples Collected: 24


## Compare Negative Samples Storage

Let's examine how GTICL and TICL store their negative samples after training.

In [17]:
import pprint
from tabulate import tabulate

pp = pprint.PrettyPrinter(indent=2, width=120)

print("\n===== GTICL Storage Details =====\n")
print(f"Number of tasks with negative samples: {len(gticl.negative_samples_store)}")

for task, samples in gticl.negative_samples_store.items():
    print(f"\nTask: {task}")
    print(f"Number of negative samples: {len(samples)}")
    
    data = []
    for i, sample in enumerate(samples):
        # Format candidate and explanation for better readability
        candidate = sample['candidate'].replace('\n', '\n  ')[:100] + "..." if len(sample['candidate']) > 100 else sample['candidate']
        explanation = sample['explanation'].replace('\n', '\n  ')[:100] + "..." if len(sample['explanation']) > 100 else sample['explanation']
        
        data.append([f"Sample {i+1}", candidate, explanation])
    
    print(tabulate(data, headers=["ID", "Candidate (truncated)", "Explanation (truncated)"], tablefmt="grid"))
    
    # Show discarded samples summary if any
    if gticl.negative_samples_summary.get(task):
        print("\nDiscarded Samples Summary:")
        print(gticl.negative_samples_summary[task])

print("\n\n===== TICL Storage Details =====\n")
print(f"Number of samples with negative examples: {sum(1 for sample in ticl.dataset if len(sample.generated_outputs) > 0)}")

for i, sample in enumerate(ticl.dataset):
    if len(sample.generated_outputs) > 0:
        print(f"\nSample {i+1}: {sample.task[:50]}...")
        print(f"Number of negative outputs: {len(sample.generated_outputs)}")
        
        data = []
        for j, neg in enumerate(sample.generated_outputs):
            # Format output and explanation for better readability
            output = neg.output.replace('\n', '\n  ')[:100] + "..." if len(neg.output) > 100 else neg.output
            explanation = neg.explanation.replace('\n', '\n  ')[:100] + "..." if len(neg.explanation) > 100 else neg.explanation
            
            data.append([f"Negative {j+1}", output, explanation])
        
        print(tabulate(data, headers=["ID", "Output (truncated)", "Explanation (truncated)"], tablefmt="grid"))

# Show discarded samples for TICL if any
if hasattr(ticl, "discarded_samples") and ticl.discarded_samples:
    print("\n\nTICL Discarded Samples:")
    for task, samples in ticl.discarded_samples.items():
        print(f"\nTask: {task[:50]}...")
        print(f"Number of discarded samples: {len(samples)}")
        pp.pprint(samples)

print("\n\n===== Storage Comparison =====\n")
print(f"GTICL total negative samples: {sum(len(samples) for samples in gticl.negative_samples_store.values())}")
print(f"TICL total negative samples: {sum(len(sample.generated_outputs) for sample in ticl.dataset)}")


===== GTICL Storage Details =====

Number of tasks with negative samples: 12

Task: Write a casual email to Sam discussing yesterday's meeting where Alex proposed a risky new marketing push. Suggest meeting for coffee tomorrow after 2pm to discuss it further.
Number of negative samples: 2
+----------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| ID       | Candidate (truncated)                                    | Explanation (truncated)                                                                                 |
| Sample 1 | Subject: that meeting yday...                            | The sign-off "Talk soon, [Your Name]" should be removed, and the question "Seems kinda risky, right?... |
|          |                                                          |                                                                                                        

## Official Evaluation Test

Use the specified prompt template to evaluate which model's generations are more similar to the author's writing style.

In [19]:
# Define test tasks for the official evaluation
official_eval_tasks = [
    {
        "name": "Project Update",
        "task": "Write an email to update the team on current project status, highlight challenges, and request input on next steps."
    },
    {
        "name": "Office Move",
        "task": "Write an email to inform everyone about our upcoming office move to downtown next month."
    },
    {
        "name": "New Team Member",
        "task": "Write an email to introduce Sarah, our new UX designer who's starting on Monday."
    },
    {
        "name": "Team Outing",
        "task": "Write an email to gather suggestions for our next team outing and schedule a planning meeting."
    },
    {
        "name": "Apology for Birthday",
        "task": "Write an email apologizing for completely forgetting your friend's birthday yesterday because your phone was dead all day. Offer to make it up with dinner at the new Italian restaurant next week.",
    },
    {
        "name": "Vegetarian Sandwich Complaint",
        "task": "Write a polite complaint email to a local store about finding bacon in your veggie sandwich that you purchased yesterday around 3pm. Mention you'd like to return despite this issue since people say their food is amazing.",
    },
    {
        "name": "Vacation Request",
        "task": "Write an email to your boss requesting vacation days from May 15-19 because you're feeling burnt out. Assure them this is before the product launch and you'll have your work completed before leaving.",
    },
    {
        "name": "Feedback Request",
        "task": "Write an email to your mentor asking for feedback on your recent project submission and any areas for improvement."
    },
    {
        "name": "Meeting Follow-Up",
        "task": "Write an email to follow up on the meeting with the client yesterday, summarize key points, and confirm next steps."
    },
    {
        "name": "Project Closure",
        "task": "Write an email to the team announcing the closure of the project, thanking everyone for their hard work, and discussing next steps for future projects."
    },
    {
        "name": "Task Review",
        "task": "Write an email to the project manager suggesting a review of all tasks completed in the last quarter to assess performance and set new goals."
    },
    {
        "name": "Training Session Request",
        "task": "Write an email to HR requesting a training session on effective communication skills for all team members."
    },
    {
        "name": "Weekly Status Update",
        "task": "Write an email to your team providing a status update on ongoing projects and any upcoming deadlines."
    },
    {
        "name": "Dentist Support",
        "task": "Write an email to your colleague offering support as they're nervous about their dentist appointment tomorrow."
    },
    {
        "name": "Pickup Boyfriend's Suit",
        "task": "Write an email to your boyfriend reminding him to pick up his suit from the dry cleaners before the wedding this weekend."
    },
    {
        "name": "Last Minute Gift Reminder",
        "task": "Write an email to your friend reminding them to get a last-minute gift for the wedding."
    },
    {
        "name": "Post-Wedding Thank You",
        "task": "Write an email to guests thanking them for attending the wedding and for their gifts."
    },
    {
        "name": "Birthday Party Invitation",
        "task": "Write an email inviting friends to your birthday party next weekend at the new karaoke place downtown."
    },
    {
        "name": "Dinner Plans",
        "task": "Write an email to your friend confirming dinner plans for tonight at the new Italian restaurant."
    },
    {
        "name": "Movie Night",
        "task": "Write an email to your friends inviting them to a movie night at your place this Saturday."
    },
    {
        "name": "Book Club Meeting",
        "task": "Write an email to book club members reminding them of the meeting next week and suggesting discussion topics."
    },
    {
        "name": "Game Night",
        "task": "Write an email to friends inviting them to a game night at your place this Friday."
    },
    {
        "name": "Coffee Date",
        "task": "Write an email to your crush asking them out for coffee this weekend."
    },
    {
        "name": "Networking Event Invitation",
        "task": "Write an email inviting colleagues to an upcoming networking event next month."
    }
]

print(f"Created {len(official_eval_tasks)} tasks for official evaluation")

Created 24 tasks for official evaluation


In [20]:
# Generate outputs from GTICL and TICL for each evaluation task
def generate_outputs_for_evaluation(task_info):
    name = task_info["name"]
    task = task_info["task"]
    
    print(f"\n--- Generating outputs for: {name} ---")
    print(f"Task: {task}")
    
    # Generate with GTICL
    gticl_result = gticl.generate_with_alignment(
        task=task
    )
    gticl_output = gticl_result["output"]
    
    # Generate with TICL
    ticl_output = ticl.get_final_output(task)
    
    return {
        "name": name,
        "task": task,
        "gticl_output": gticl_output,
        "ticl_output": ticl_output
    }

# Generate outputs for all evaluation tasks
evaluation_outputs = [generate_outputs_for_evaluation(task) for task in official_eval_tasks]
print(f"Generated outputs for {len(evaluation_outputs)} evaluation tasks")


--- Generating outputs for: Project Update ---
Task: Write an email to update the team on current project status, highlight challenges, and request input on next steps.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Office Move ---
Task: Write an email to inform everyone about our upcoming office move to downtown next month.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: New Team Member ---
Task: Write an email to introduce Sarah, our new UX designer who's starting on Monday.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Team Outing ---
Task: Write an email to gather suggestions for our next team outing and schedule a planning meeting.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Apology for Birthday ---
Task: Write an email apologizing for completely forgetting your friend's birthday yesterday because your phone was dead all day. Offer to make it up with dinner at the new Italian restaurant next week.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Vegetarian Sandwich Complaint ---
Task: Write a polite complaint email to a local store about finding bacon in your veggie sandwich that you purchased yesterday around 3pm. Mention you'd like to return despite this issue since people say their food is amazing.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Vacation Request ---
Task: Write an email to your boss requesting vacation days from May 15-19 because you're feeling burnt out. Assure them this is before the product launch and you'll have your work completed before leaving.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Feedback Request ---
Task: Write an email to your mentor asking for feedback on your recent project submission and any areas for improvement.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Meeting Follow-Up ---
Task: Write an email to follow up on the meeting with the client yesterday, summarize key points, and confirm next steps.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Project Closure ---
Task: Write an email to the team announcing the closure of the project, thanking everyone for their hard work, and discussing next steps for future projects.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Task Review ---
Task: Write an email to the project manager suggesting a review of all tasks completed in the last quarter to assess performance and set new goals.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Training Session Request ---
Task: Write an email to HR requesting a training session on effective communication skills for all team members.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Weekly Status Update ---
Task: Write an email to your team providing a status update on ongoing projects and any upcoming deadlines.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Dentist Support ---
Task: Write an email to your colleague offering support as they're nervous about their dentist appointment tomorrow.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Pickup Boyfriend's Suit ---
Task: Write an email to your boyfriend reminding him to pick up his suit from the dry cleaners before the wedding this weekend.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Last Minute Gift Reminder ---
Task: Write an email to your friend reminding them to get a last-minute gift for the wedding.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Post-Wedding Thank You ---
Task: Write an email to guests thanking them for attending the wedding and for their gifts.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Birthday Party Invitation ---
Task: Write an email inviting friends to your birthday party next weekend at the new karaoke place downtown.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Dinner Plans ---
Task: Write an email to your friend confirming dinner plans for tonight at the new Italian restaurant.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Movie Night ---
Task: Write an email to your friends inviting them to a movie night at your place this Saturday.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Book Club Meeting ---
Task: Write an email to book club members reminding them of the meeting next week and suggesting discussion topics.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Game Night ---
Task: Write an email to friends inviting them to a game night at your place this Friday.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Coffee Date ---
Task: Write an email to your crush asking them out for coffee this weekend.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"



--- Generating outputs for: Networking Event Invitation ---
Task: Write an email inviting colleagues to an upcoming networking event next month.


INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"


Generated outputs for 24 evaluation tasks


In [21]:
# Function to run the official evaluation using the specified template
def run_official_evaluation(task_output, randomize_order=True):
    # Format author's writing examples from our dataset
    author_examples = "\n\n".join([
        f"EXAMPLE {i+1}: {sample['reference_output']}" 
        for i, sample in enumerate(casual_emails_dataset)
    ])
    
    # Determine which output is A and which is B (randomize to prevent bias)
    if randomize_order and random.random() < 0.5:
        option_a = task_output["ticl_output"]
        option_b = task_output["gticl_output"]
        key_a, key_b = "ticl", "gticl"
    else:
        option_a = task_output["gticl_output"]
        option_b = task_output["ticl_output"]
        key_a, key_b = "gticl", "ticl"
    
    # Format the official evaluation prompt
    evaluation_prompt = f"""You are an impartial evaluator of style similarity. Below are samples of an author's writing and two options.

Author's Writing:
{author_examples}

Option A:
{option_a}

Option B:
{option_b}

Task
Which option is more likely to have been written by the author based on style similarity to the samples given as AUTHOR'S WRITING above? Consider each option's similarity with regards to the (1) length, (2) format, (3) paragraph structure, (4) sentence structure, (5) punctuation, (6) syntax, (7) voice, and (8) diction of the author's writing, but NOT the content it covers. If one option has incoherent/odd text or formatting (e.g., random dashes, repetitive text, random signatures, etc.) that is not present in the author's writing while the other doesn't, it should be considered less similar. {{ "explanation": {{ "length": "<Explanation>", "format": "<Explanation>", "paragraph structure": "<Explanation>", "sentence structure": "<Explanation>", "punctuation": "<Explanation>", "syntax": "<Explanation>", "voice": "<Explanation>", "diction": "<Explanation>", "odd incoherent text/formatting": "<Explanation>" }} "answer": "<The option more similar to the AUTHOR'S WRITING; either A or B>" }} ALWAYS REMAIN IMPARTIAL WHEN EVALUATING OUTPUTS AND PENALIZE ODD INCOHERENT TEXT."""
    
    # Send to LLM for evaluation
    messages = [
        {"role": "system", "content": "You are an impartial evaluator of writing style similarity."},
        {"role": "user", "content": evaluation_prompt}
    ]
    
    response_format = {"type": "json_object"}
    evaluation = llm_provider.generate(messages=messages, response_format=response_format)
    
    # Process response (could be string or dict depending on provider implementation)
    if isinstance(evaluation, str):
        try:
            import json
            evaluation = json.loads(evaluation)
        except json.JSONDecodeError:
            logger.error(f"Failed to parse evaluation response: {evaluation[:100]}...")
            return {
                "error": "Invalid JSON response",
                "raw_response": evaluation
            }
    
    # Map the answer back to actual model names
    if isinstance(evaluation, dict) and "answer" in evaluation:
        winner = evaluation["answer"].strip()
        if winner == "A":
            evaluation["winner"] = key_a
            evaluation["winner_output"] = option_a
            evaluation["loser"] = key_b
            evaluation["loser_output"] = option_b
        elif winner == "B":
            evaluation["winner"] = key_b
            evaluation["winner_output"] = option_b
            evaluation["loser"] = key_a
            evaluation["loser_output"] = option_a
        else:
            evaluation["error"] = f"Unexpected answer: {winner}"
    else:
        evaluation = {
            "error": "Missing 'answer' in response",
            "raw_response": evaluation
        }
    
    return {
        **task_output,  # Include original task info
        "evaluation": evaluation,
        "a_was": key_a,
        "b_was": key_b
    }

# Run the evaluation for each task
official_evaluations = [run_official_evaluation(output) for output in evaluation_outputs]

# Print evaluation results
print("\n=== Official Evaluation Results ===\n")
for result in official_evaluations:
    print(f"Task: {result['name']}")
    if "error" in result.get("evaluation", {}):
        print(f"Error: {result['evaluation']['error']}")
    else:
        winner = result["evaluation"].get("winner", "unknown")
        print(f"Winner: {winner.upper()}")
        if "explanation" in result["evaluation"]:
            print("Key differences:")
            for key, explanation in result["evaluation"]["explanation"].items():
                if len(explanation) > 100:
                    explanation = explanation[:97] + "..."
                print(f"  - {key}: {explanation}")
    print()

INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/c


=== Official Evaluation Results ===

Task: Project Update
Winner: GTICL
Key differences:
  - length: Both options are similar in length to the provided examples.
  - format: Both options follow a similar email format with a subject line, greeting, body, and closing.
  - paragraph structure: Option B's paragraph structure is more similar to the examples, with a slightly longer first para...
  - sentence structure: Option B has sentence structures that are closer to the author's writing, including more elaborat...
  - punctuation: Both options use similar punctuation, including emojis and exclamation points, but option B uses ...
  - syntax: Option B's syntax is more conversational and less formal, aligning better with the provided examp...
  - voice: Option B has a more casual and collaborative voice, similar to the author's tone.
  - diction: Option B uses more specific and relatable language ("running into some issues with integrating th...
  - odd incoherent text/formatting: Option 

In [22]:
print(official_evaluations)

[{'name': 'Project Update', 'task': 'Write an email to update the team on current project status, highlight challenges, and request input on next steps.', 'gticl_output': "Subject: Project Update - Need your thoughts!\n\nHey Team,\n\nJust wanted to give a quick update on where we're at with the project. Things are moving along, but we've hit a couple snags. 😬\n\nSpecifically, we're running into some issues with integrating the new software. It's proving to be a bigger challenge than we initially thought!\n\nAny brilliant ideas on how to tackle this? Maybe we can hop on a quick call this week to brainstorm next steps? Let me know when you're free!\n\nThanks!\n[Your Name]", 'ticl_output': "```\nSubject: project update + need your brains!\n\nHey team,\n\nQuick update on where we're at with the project. We're mostly on track, but hitting a few snags. 😬\n\nBiggest challenge rn is [insert challenge here]. Anyone have thoughts on how to tackle this?\n\nAlso, need input on next steps for [inse

In [None]:
# Summarize the official evaluation results
def summarize_evaluations(evaluations):
    total = len(evaluations)
    valid_evals = [e for e in evaluations if "error" not in e.get("evaluation", {})]
    valid_count = len(valid_evals)
    
    if valid_count == 0:
        return {"error": "No valid evaluations found"}
    
    # Count wins by framework
    gticl_wins = sum(1 for e in valid_evals if e["evaluation"].get("winner") == "gticl")
    ticl_wins = sum(1 for e in valid_evals if e["evaluation"].get("winner") == "ticl")
    
    # Calculate percentages
    gticl_win_pct = (gticl_wins / valid_count) * 100 if valid_count > 0 else 0
    ticl_win_pct = (ticl_wins / valid_count) * 100 if valid_count > 0 else 0
    
    # Calculate by category (assuming 10 tasks per category)
    categories = {
        "Work-related": valid_evals[:10],
        "Social/team": valid_evals[10:20] if len(valid_evals) > 10 else [],
        "Personal/external": valid_evals[20:30] if len(valid_evals) > 20 else []
    }
    
    category_results = {}
    
    for category_name, category_evals in categories.items():
        if not category_evals:
            continue
            
        cat_total = len(category_evals)
        cat_gticl_wins = sum(1 for e in category_evals if e["evaluation"].get("winner") == "gticl")
        cat_ticl_wins = sum(1 for e in category_evals if e["evaluation"].get("winner") == "ticl")
        
        cat_gticl_pct = (cat_gticl_wins / cat_total) * 100 if cat_total > 0 else 0
        cat_ticl_pct = (cat_ticl_wins / cat_total) * 100 if cat_total > 0 else 0
        
        category_results[category_name] = {
            "total": cat_total,
            "gticl_wins": cat_gticl_wins,
            "ticl_wins": cat_ticl_wins,
            "gticl_win_percentage": cat_gticl_pct,
            "ticl_win_percentage": cat_ticl_pct
        }
    
    # Create detailed report
    summary = {
        "total_tasks": total,
        "valid_evaluations": valid_count,
        "gticl_wins": gticl_wins,
        "ticl_wins": ticl_wins,
        "gticl_win_percentage": gticl_win_pct,
        "ticl_win_percentage": ticl_win_pct,
        "categories": category_results
    }
    
    return summary

# Generate and display summary
eval_summary = summarize_evaluations(official_evaluations)

print("\n===== Official Evaluation Summary =====\n")
print(f"Total tasks evaluated: {eval_summary['total_tasks']}")
print(f"Valid evaluations: {eval_summary['valid_evaluations']}")
print(f"GTICL wins: {eval_summary['gticl_wins']} ({eval_summary['gticl_win_percentage']:.1f}%)")
print(f"TICL wins: {eval_summary['ticl_wins']} ({eval_summary['ticl_win_percentage']:.1f}%)")

# Print category results
print("\nResults by Category:")
for category_name, results in eval_summary.get('categories', {}).items():
    print(f"\n{category_name}:")
    print(f"  Tasks: {results['total']}")
    print(f"  GTICL wins: {results['gticl_wins']} ({results['gticl_win_percentage']:.1f}%)")
    print(f"  TICL wins: {results['ticl_wins']} ({results['ticl_win_percentage']:.1f}%)")

# Determine overall winner
if eval_summary.get('gticl_win_percentage', 0) > eval_summary.get('ticl_win_percentage', 0):
    print("\nOverall winner: GTICL")
elif eval_summary.get('ticl_win_percentage', 0) > eval_summary.get('gticl_win_percentage', 0):
    print("\nOverall winner: TICL")
else:
    print("\nResult: Tie")

In [None]:
# Function to print detailed results with formatted outputs
def print_detailed_comparison(eval_result):
    print(f"\n===== Detailed Comparison: {eval_result['name']} =====\n")
    print(f"Task: {eval_result['task']}")
    print(f"Input: {eval_result['input']}")
    
    # Print the winner's output first
    if "evaluation" in eval_result and "winner" in eval_result["evaluation"]:
        winner = eval_result["evaluation"]["winner"]
        loser = eval_result["evaluation"]["loser"]
        
        print(f"\n🏆 {winner.upper()} Output (WINNER):")
        print(f"{eval_result[winner + '_output']}")
        
        print(f"\n{loser.upper()} Output:")
        print(f"{eval_result[loser + '_output']}")
        
        # Print evaluation details
        print("\nEvaluation Details:")
        for key, explanation in eval_result["evaluation"].get("explanation", {}).items():
            print(f"  • {key.title()}: {explanation}")
    else:
        # If no evaluation available, just print both outputs
        print("\nGTICL Output:")
        print(eval_result["gticl_output"])
        print("\nTICL Output:")
        print(eval_result["ticl_output"])

# Print detailed comparison for each task
for eval_result in official_evaluations:
    print_detailed_comparison(eval_result)

In [None]:
# Generate a comprehensive final report
def generate_final_report(summary, significance, examples_count=12, eval_tasks_count=30):
    """Generate a final report on the GTICL vs TICL comparison."""
    
    report = f"""# GTICL vs TICL Comparison Final Report

## Experiment Setup
- Training examples: {examples_count} casual email writing samples with consistent author style
- Evaluation tasks: {eval_tasks_count} email writing tasks across three categories
- Categories: Work-related, Social/team, and Personal/external communications
- LLM Model: {os.getenv('LLM_MODEL_ID')} via OpenRouter
- GTICL Configuration: k_candidates={gticl_config.k_candidates}, iterations={gticl_config.iterations}, examples_per_prompt={gticl_config.examples_per_prompt}
- TICL Configuration: iterations={ticl.iterations}, examples_per_prompt={ticl.examples_per_prompt}

## Overall Results
- Valid evaluations: {summary['valid_evaluations']} out of {summary['total_tasks']}
- GTICL wins: {summary['gticl_wins']} ({summary['gticl_win_percentage']:.1f}%)
- TICL wins: {summary['ticl_wins']} ({summary['ticl_win_percentage']:.1f}%)

## Results by Category
"""
    
    for category_name, results in summary.get('categories', {}).items():
        report += f"### {category_name}\n"
        report += f"- Tasks: {results['total']}\n"
        report += f"- GTICL wins: {results['gticl_wins']} ({results['gticl_win_percentage']:.1f}%)\n"
        report += f"- TICL wins: {results['ticl_wins']} ({results['ticl_win_percentage']:.1f}%)\n\n"
    
    report += "## Statistical Significance\n"
    
    if "error" not in significance:
        report += f"- Sample size: {significance['sample_size']} evaluations\n"
        report += f"- GTICL win rate: {significance['gticl_win_rate']:.2f}\n"
        report += f"- 95% Confidence interval: ({significance['confidence_interval_95'][0]:.2f}, {significance['confidence_interval_95'][1]:.2f})\n"
        report += f"- P-value: {significance['p_value']:.4f}\n\n"
        
        if significance["significant_05"]:
            if significance["significant_01"]:
                report += "Result is statistically significant at the 1% level.\n"
            else:
                report += "Result is statistically significant at the 5% level.\n"
        else:
            report += "Result is not statistically significant at the 5% level.\n"
    else:
        report += significance["error"] + "\n"
    
    report += "\n## Conclusion\n"
    
    # Determine the winner for the conclusion
    if summary['gticl_win_percentage'] > summary['ticl_win_percentage']:
        if significance.get("significant_05", False):
            report += "GTICL significantly outperforms TICL in matching the author's casual email writing style. "
        else:
            report += "GTICL appears to outperform TICL in matching the author's casual email writing style, but the difference is not statistically significant. "
    elif summary['ticl_win_percentage'] > summary['gticl_win_percentage']:
        if significance.get("significant_05", False):
            report += "TICL significantly outperforms GTICL in matching the author's casual email writing style. "
        else:
            report += "TICL appears to outperform GTICL in matching the author's casual email writing style, but the difference is not statistically significant. "
    else:
        report += "Both frameworks perform equally in matching the author's casual email writing style. "
    
    # Add category-specific insights
    category_insights = []
    for category_name, results in summary.get('categories', {}).items():
        if results['gticl_win_percentage'] > results['ticl_win_percentage'] + 10:
            category_insights.append(f"GTICL performed notably better for {category_name.lower()} emails")
        elif results['ticl_win_percentage'] > results['gticl_win_percentage'] + 10:
            category_insights.append(f"TICL performed notably better for {category_name.lower()} emails")
    
    if category_insights:
        report += "Category-specific insights: " + "; ".join(category_insights) + ".\n\n"
    
    # Add final recommendation
    report += "\n## Recommendation\n"
    
    if "error" not in significance and significance.get("significant_05", False):
        if summary['gticl_win_percentage'] > summary['ticl_win_percentage']:
            report += "Based on statistically significant results, GTICL is recommended for style matching tasks in casual email writing."
        else:
            report += "Based on statistically significant results, TICL is recommended for style matching tasks in casual email writing."
    else:
        report += "Further testing with a larger sample size is recommended to establish a statistically significant difference between the frameworks."
    
    return report

# Generate and display final report
final_report = generate_final_report(eval_summary, significance_results)
print(final_report)

# Save the report to a file
report_path = "gticl_vs_ticl_comparison_report.md"
with open(report_path, "w") as f:
    f.write(final_report)

print(f"Final report saved to '{report_path}'")