In [None]:
# For licensing see accompanying LICENSE file.
# Copyright (C) 2025 Apple Inc. All Rights Reserved.

# Creating "synthetic" or "virtual" agents

When creating experiments there are typically two types of annotators: agents (our agent) and baselines (e.g. AlpacaEval 2.0 annotator). 
Often we want to test the agent with *multiple* underlying baseline annotators. However, there is a lot of duplicate code that is run if we
run all of these agent and baseline results separately. The agent code stays the same, only annotations by the baseline change. 

Thus, instead of (expensively and slowly) re-running the agent code for each baseline, we can actually swap out the baseline inside 
the agent and re-use the results we got from running the agent once (or x number of seeds). We refer to this
as a "synthetic" agent. It is "synthetic" in the sense that we didn't run this configuration seperately, but rather
combine two existing results into a new one. Note that this is equivalent to actually having run the configuration.

The results are only different when comparing between different agents and baselines, then differences between agents
configurations are not due the inherent noise of the agent itself (as that code uses the same seeds) but rather just
the baseline annotator used. This caveat is worth pointing out, but should affect the validity of our results.

In [1]:
import ageval.analysis.post_processing
import ageval.analysis.data_loader


# first we set some multirun paths (ADAPT THIS TO YOUR MULTIRUN PATHS)
multirun_paths =[
    "../../../project-agent-evaluator-results/2024_08_19_presentation_v5/0410_gsm8k_baseline_alpacaeval",
    "../../../project-agent-evaluator-results/2024_08_19_presentation_v5/0400_gsm8k_agent",
]
multirun_path = ", ".join(multirun_paths)

# we load the results using the data_loader module
# (this module is built for loading experiment run data)
annotation_dfs, metric_dfs, results_dict = (
        ageval.analysis.data_loader.load_experiments_from_multirun(multirun_path)
    )

# name of the agent we want to base our synthetic agents on
# Not sure what your agent is called? Check the print some of the metric_dfs above
# to show the available agent names in your runs.
AGENT_NAME = "agent_gpt-4o-2024-05-13_math_checker_base-basic"

# save path to save the results to
BASE_SAVE_DIR = "../../../project-agent-evaluator-results/2024_08_19_presentation_v5/0950_gsm8k_agent_synth_tutorial"

# variable to save output paths in
paths = []

# iterate over annotation dfs 
# (one dict entry per dataset, with annotations of multiple annotator(s) (seeds))
for data_path, df_an in annotation_dfs.items():
    # get metric df for given dataset
    df_met = metric_dfs[data_path]
    short_data_path = data_path.split("/")[-1].split(".csv")[0]

    # create save path (so the synth agents are easy to find by dataset)
    save_dir = f"{BASE_SAVE_DIR}/{short_data_path}"

    # generate the actual synthetic agents, for every baseline 
    # inside the annotation df (df_an).
    ageval.analysis.post_processing.generate_synth_agent_results(
        agent_name=AGENT_NAME, 
        annotation_df=df_an,
        metric_df=df_met, # used to get the unique model names
        original_data_path=data_path,
        save_dir=save_dir,
    )
    paths.append(save_dir)

# and now you have synthetic agents in BASE_SAVE_DIR path.
# To plot the results, use tutorial notebook 0003.