# Performing Model Comparison Query and Query Results Approximation in Task-Me-Anything

In this notebook, we will show how to perform a “Model Comparison Query” in Task-Me-Anything. We’ll compare the performance of `llavav1.5-7b` with the baseline model `instructblip-flant5xl` over 3200+ task plans on “2D sticker how many” task type, by finding the task plan that performance  of `llavav1.5-7b` is significant higher than `instructblip-flant5xl`. After that, we willl using `Fit` and `Active` query results approximation algorithms to approximate the performance of tasks plan within only 500 budgets.

## Generate tasks

These are the process of task plans generation, illustrations on these part will be in the `generate` part of demo.

In this step, we generate 3,249 “how many” task plans in 2D scenarios. Each task plan contains all the configuration and content needed to generate an image-question pair (test instance).

In [1]:
import sys
# set the working directory to the root of the project
sys.path.append("../..")

from tma.mathqa.math_metadata import MathTemplateMetaData
from tma.task_store import TaskStore
from tma.mathqa.geometry_task import CircleGenerator, AngleGenerator, IntersectionGenerator, MidpointGenerator, PerimeterGenerator

template_path = "../../annotations/math_annotations/circle_templates.json"
metadata = MathTemplateMetaData(template_path)
generator = CircleGenerator(metadata)
task_store = TaskStore(CircleGenerator.schema)
generator.enumerate_task_plans(task_store)
df = task_store.return_df()

df

Enumerating templates with 1 params: 100%|██████████| 100/100 [00:00<00:00, 269556.81it/s]
Enumerating templates with 1 params: 100%|██████████| 100/100 [00:00<00:00, 702563.48it/s]
Enumerating templates with 1 params: 100%|██████████| 100/100 [00:00<00:00, 747647.77it/s]
Enumerating templates with 1 params: 100%|██████████| 100/100 [00:00<00:00, 721911.19it/s]


Unnamed: 0,question_template,radius,circumference,area
0,What is the perimeter (circumference) of a cir...,1.0,6.283185307179586,3.141592653589793
1,What is the perimeter (circumference) of a cir...,2.0,12.566370614359172,12.566370614359172
2,What is the perimeter (circumference) of a cir...,3.0,18.84955592153876,28.274333882308138
3,What is the perimeter (circumference) of a cir...,4.0,25.132741228718345,50.26548245743669
4,What is the perimeter (circumference) of a cir...,5.0,31.41592653589793,78.53981633974483
...,...,...,...,...
95,What is the perimeter (circumference) of a cir...,96.0,603.1857894892403,28952.917895483533
96,What is the perimeter (circumference) of a cir...,97.0,609.4689747964198,29559.245277626364
97,What is the perimeter (circumference) of a cir...,98.0,615.7521601035994,30171.855845076374
98,What is the perimeter (circumference) of a cir...,99.0,622.0353454107791,30790.74959783356


## Embedding the tasks and create VQATaskEvaluator


Task evaluator takes the model and the tasks as input, and evaluate and query the model's performance on the tasks generated by task plans. 



<!-- Because we want to fit a performance regressor, we need to embed the tasks. We will use the Cohere API to embed the tasks. First you need to set the `api_key` parameter to your Cohere API key. You can also using other embedding API or models to embed the tasks. (e.g Openai embedding API, BERT, etc.)

Then you should create a `VQATaskEvaluator` object. `VQATaskEvaluator` is a class designed to evaluate a model's performance on task. It can handle the details in evaluate the model such as create the embedding of the tasks, fit the performance regressor, etc.

Notice that `VQATaskEvaluator` can cache the embeddings to avoid redundant requests to the OpenAI API. You can change the path of the cache file by setting the `cache_path` parameter. -->

In [2]:
from tma.task_evaluator import QATaskEvaluator

task_evaluator = QATaskEvaluator(
    task_plan_df=df, # data frames task plans to evaluate
    task_generator=generator, # task generator, used to generate test instances for each task plan
    embedding_name='st',  # using sentence transformer (st) to embedding questions
    embedding_batch_size=10000,  # batch size for embedding
    n_instance_per_task=5,  # number of test instances generated per task plan
    n_trials_per_instance=3,  # number of trials per test instance
    cache_path_root=".cache",  # enter your path for cache
    seed=42  # random seed
)

## Evaluating the model on all the task plans

In this steps, we will start to get the ground truth of the query. We will not use query approximation algorithms in this step. Instead, we will evaluate the model on all the tasks and get the top 10 worst-performing tasks as the ground truth. 

You can call tma.models.qa_model.list_vqa_models() to find all the available VQA models.

In [3]:
from tma.models.qa_model.text_qa_model import list_textqa_models

# list all available models
list_textqa_models()

['Meta-Llama-3-8B-Instruct', 'gemma-2-9b-it', 'Qwen2-7B-Instruct']

We will use `instructblip-flant5xl` as baseline model and `llavav1.5-7b` as model for comparing for showcasing, you can use other models you like or using multi-models.

In [4]:
%load_ext autoreload
%autoreload 2
from tma.models.qa_model.text_qa_model import TextQAModel
from tma.models.qa_model import prompt
import torch

# single model
baseline_model = TextQAModel(model_name='Meta-Llama-3-8B-Instruct', precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")
model_to_compare = TextQAModel(model_name='gemma-2-9b-it', precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")


# # multiple models
# # Notice: If you have multiple GPUs, you can set the torch_device for each model to avoid running out of GPU memory.
# model1 = ImageQAModel(model_name='llavav1.5-7b', torch_device=0, precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")
# model2 = ImageQAModel(model_name='llavav1.5-13b', torch_device=1, precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")
 
# baseline_models = [model1, model2]


# model3 = ImageQAModel(model_name='qwenvl', torch_device=3, precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")
# model4 = ImageQAModel(model_name='qwenvl-chat', torch_device=4, precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")
 
# models_to_compare = [model3, model4]

[IMPORTANT] model cache is enabled, cache path: .cache/
Loading Meta-Llama-3-8B-Instruct...
HuggingFace meta-llama/Meta-Llama-3-8B-Instruct


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00,  5.02it/s]


Finish loading Meta-Llama-3-8B-Instruct
[IMPORTANT] model cache is enabled, cache path: .cache/
Loading gemma-2-9b-it...
HuggingFace google/gemma-2-9b-it


Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00,  5.01it/s]


Finish loading gemma-2-9b-it


After loading model, we can start evaluating all the task plans.

In [9]:
%load_ext autoreload
%autoreload 2
import numpy as np

# find the task plan that the model_to_compare performs better than the baseline_model above 30%

ground_truth_results = task_evaluator.model_compare(
    x_indices=np.arange(len(df)),
    greater_than=True,
    threshold = 0.3,
    baselines=[baseline_model],
    model = model_to_compare,
    fit_function_approximator=False
)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload





[A[A[A

{'question': 'What is the perimeter (circumference) of a circle with radius 1.0?', 'answer': '6.283185307179586', 'task_plan': '{"question_template": "What is the perimeter (circumference) of a circle with radius {param1}?", "radius": "1.0", "circumference": "6.283185307179586", "area": "3.141592653589793"}', 'math_metadata': <tma.mathqa.math_metadata.MathTemplateMetaData object at 0x7fe6844e4090>}


KeyError: 'options'

In [None]:
def display_results(results):
    pattern_stats = results[0]
    # Determine the headers
    headers = ["Pattern", "Times"]
    
    # Calculate the maximum length for formatting
    max_pattern_length = max(len(str(plan[1])) for plan in pattern_stats)
    
    # Print the headers
    print(f"{headers[0]:<{max_pattern_length}} {headers[1]}")
    print("-" * (max_pattern_length + len(headers[1]) + 1))
    
    # Iterate over the task plans and print each plan
    for plan in pattern_stats:
        task_id, attributes = plan
        pattern = ', '.join([f"{attr[0]}: {attr[1]}" for attr in attributes])
        print(f"{pattern:<{max_pattern_length}} {task_id}")
        
display_results(ground_truth_results)

# Apply query approximation algorithms
Query approximation algorithms means only evaluate model on a subset of tasks and use the result to approximate the performance on the whole task plans.

We will use the `Fit` algorithm and `Active` algorithm to approximate the top k worst query, and compare the performance of these two methods with the ground truth. For each algorithm, we will give 500 budgets, which means the approximation algorithm can only evaluate 500 task plans.

* In the `Fit` approach, we randomly select 500 task plans and fit the function approximator.
* In the `Active` approach, we start with 200 task plans and then gradually add more task plans to the training set based on the function approximator's predictions.

In [None]:
# here are the functions to evaluate the approximation results with the ground truth
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def compare_metric(gt, pred):

    gt_selection = gt[1]
    if len(gt_selection) == 0:
        a = 1
    pred_selection = pred[1]

    # Determine the maximum index for array sizing
    max_index = max(max(gt_selection, default=0), max(pred_selection, default=0))

    # Initialize the labels based on the maximum index
    gt_label = np.zeros(max_index + 1)
    pred_label = np.zeros(max_index + 1)

    for k in gt_selection:
        gt_label[k] = 1

    for k in pred_selection:
        pred_label[k] = 1

    f1 = f1_score(gt_label, pred_label) * 100
    acc = accuracy_score(gt_label, pred_label) * 100
    precision = precision_score(gt_label, pred_label) * 100
    recall = recall_score(gt_label, pred_label) * 100

    return precision, recall, f1, acc

def print_metrics(precision, recall, f1, acc):
    print(f"{'Metric':<15} {'Value':<10}")
    print("-" * 25)
    print(f"{'Precision:':<15} {precision:.2f}%")
    print(f"{'Recall:':<15} {recall:.2f}%")
    print(f"{'F1 Score:':<15} {f1:.2f}%")

### Use "Fit" approximation algorithm

In [None]:
budget = 500
np.random.seed(42)
perm = np.random.permutation(len(df))
x_indices = perm[:budget]

fit_results = task_evaluator.model_compare(
    x_indices=x_indices,
    greater_than=True,
    threshold = 0.2,
    baselines=[baseline_model],
    model = model_to_compare,
    fit_function_approximator=True
)
precision, recall, f1, acc = compare_metric(ground_truth_results, fit_results)
print_metrics(precision, recall, f1, acc)
display_results(fit_results)

### Use "Active" approximation algorithm

In [None]:
warmup_budget=200
active_results = task_evaluator.active_model_compare(
    k=10,
    warmup_budget=warmup_budget,
    budget=budget-warmup_budget,
    greater_than=True,
    threshold = 0.2,
    baselines=[baseline_model],
    model = model_to_compare,
)

precision, recall, f1, acc = compare_metric(ground_truth_results, active_results)
print_metrics(precision, recall, f1, acc)
display_results(active_results[0])