# Baseline model without RAG

This is a baseline model for the QA task without the RAG pipline.

In order to compare, we test four models as the backbone model in the RAG pipeline: `meta-llama/Llama-3.2-3B-Instruct`, `meta-llama/Llama-3.1-8B-Instruct`, `gemini-2.0-flash`, and `gemini-2.0-flash-thinking-exp-01-21`. We also adopt the same data type (fp16) and the same config for setting up the tokenizer. We use the same prompt format as the one in the RAG pipeline.

### meta-llama/Llama-3.2-3B-Instruct model

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from huggingface_hub import login


model_name = "meta-llama/Llama-3.2-3B-Instruct"

login(token=os.getenv("HUGGINGFACE_TOKEN", "your_huggingface_token"))


model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

generation_pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
# Step 3: load qa annotation test set
import pandas as pd
# qa_df = pd.read_csv("../data/annotated/QA_pairs_1.csv")
qa_df = pd.read_csv("./data/test/test_questions.csv")

# doc_ids = qa_df["Doc_id"].tolist()
questions = qa_df["Question"].tolist()
# answers = qa_df["Reference_Answers"].tolist()

# # random sample 10 qa pairs
# import random
# sample_size = 10
# random.seed(747)
# sample_indices = random.sample(range(len(questions)), sample_size)
# sample_doc_ids = [doc_ids[i] for i in sample_indices]
# sample_questions = [questions[i] for i in sample_indices]
# sample_answers = [answers[i] for i in sample_indices]

In [None]:
template = """
You are an expert assistant answering factual questions about various aspects of Pittsburgh or Carnegie Mellon University (CMU), including history, policy, culture, events, and more.
If you do not know the answer, just say "I don't know."

Important Instructions:
- Answer concisely without repeating the question.
- Do **not** use complete sentences. Provide only the word, name, date, or phrase that directly answers the question. For example, given the question "When was Carnegie Mellon University founded?", you should only answer "1900".

Examples:
Question: Who is Pittsburgh named after?
Answer: William Pitt
Question: What famous machine learning venue had its first conference in Pittsburgh in 1980?
Answer: ICML
Question: What musical artist is performing at PPG Arena on October 13?
Answer: Billie Eilish

Question: {question} \n\n
Answer:
"""

In [None]:
# use the template to generate the answers
from tqdm import tqdm
generated_answers = []
for question in tqdm(questions):
    full_prompt = template.format(question=question)
    messages = [
        {"role": "user", "content": full_prompt},
        ]
    output = generation_pipe(messages, max_new_tokens=50)
    generated_answers.append(output[0]["generated_text"][1]['content'])

  2%|▏         | 10/574 [00:03<02:11,  4.29it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 574/574 [02:04<00:00,  4.60it/s]


In [None]:
# write all columns to a csv file
# results_df = pd.DataFrame({
#         "Doc_id": doc_ids,
#         "Question": questions,
#         "Reference_Answers": answers,
#         "Generated_Answer": generated_answers,
#     })

results_df = pd.DataFrame({
        "Question": questions,
        "Generated_Answer": generated_answers,
    })

# save the results to a csv file
results_df.to_csv("./output/closebook_baseline.csv", index=False)

In [None]:
results_df

Unnamed: 0,Question,Generated_Answer
0,"What bank, which is the 5th largest in the US,...",PNC Bank
1,How many bridges does Pittsburgh have?,403
2,Who named the city of Pittsburgh?,General Robert Moore
3,At what park do the three rivers converge in P...,Point State Park
4,How many neighborhoods does Pittsburgh have?,19
...,...,...
569,What is the primary focus of the event at the ...,Pittsburgh JazzLive
570,Where and when is the Pittsburgh Veg Fair held...,Pennsylvania State Farm Show Complex
571,How can restaurants get involved with Pittsbur...,Register online through VisitPittsburgh
572,What are the benefits of sponsoring the Pittsb...,I don't know



### meta-llama/Llama-3.1-8B-Instruct

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from huggingface_hub import login


model_name = "meta-llama/Llama-3.1-8B-Instruct"

login(token=os.getenv("HUGGINGFACE_TOKEN", "your_huggingface_token"))

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

generation_pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16
)

In [None]:
# Step 3: load qa annotation test set
import pandas as pd
qa_df = pd.read_csv("./data/annotated/QA_pairs_1.csv")
#qa_df = pd.read_csv("./data/test/test_questions.csv")

doc_ids = qa_df["Doc_id"].tolist()
questions = qa_df["Question"].tolist()
answers = qa_df["Reference_Answers"].tolist()

# random sample 10 qa pairs
import random
sample_size = 100
random.seed(747)
sample_indices = random.sample(range(len(questions)), sample_size)
sample_doc_ids = [doc_ids[i] for i in sample_indices]
sample_questions = [questions[i] for i in sample_indices]
sample_answers = [answers[i] for i in sample_indices]

In [None]:
template = """
You are an expert assistant answering factual questions about various aspects of Pittsburgh or Carnegie Mellon University (CMU), including history, policy, culture, events, and more.
If you do not know the answer, just say "I don't know."

Important Instructions:
- Answer concisely without repeating the question.
- Do **not** use complete sentences. Provide only the word, name, date, or phrase that directly answers the question. For example, given the question "When was Carnegie Mellon University founded?", you should only answer "1900".

Examples:
Question: Who is Pittsburgh named after?
Answer: William Pitt
Question: What famous machine learning venue had its first conference in Pittsburgh in 1980?
Answer: ICML
Question: What musical artist is performing at PPG Arena on October 13?
Answer: Billie Eilish

Question: {question} \n\n
Answer:
"""

In [None]:
# use the template to generate the answers
from tqdm import tqdm
generated_answers = []
for question in tqdm(questions):
    full_prompt = template.format(question=question)
    messages = [
        {"role": "user", "content": full_prompt},
        ]
    output = generation_pipe(messages, max_new_tokens=50)
    generated_answers.append(output[0]["generated_text"][1]['content'])

In [None]:
# write all columns to a csv file
results_df = pd.DataFrame({
        "Doc_id": doc_ids,
        "Question": questions,
        "Reference_Answers": answers,
        "Generated_Answer": generated_answers,
    })

# results_df = pd.DataFrame({
#         "Question": questions,
#         "Generated_Answer": generated_answers,
#     })

# save the results to a csv file
results_df.to_csv("./output/llama31_8B_baseline.csv", index=False)

In [None]:
results_df

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from huggingface_hub import login
import pandas as pd
import random
from tqdm import tqdm

def run_llama_baseline(model_name, output_file, qa_file="./data/annotated/QA_pairs_1.csv", sample_size=100):
    """
    Generate QA baseline answers using a specified model and save the results to a CSV file.

    Args:
        model_name (str): Name of the Hugging Face model to use.
        output_file (str): Path to save the generated answers.
        qa_file (str): Path to the QA annotation test set CSV file.
        sample_size (int): Number of QA pairs to sample for evaluation.
    """
    # Login to Hugging Face Hub
    login(token=os.getenv("HUGGINGFACE_TOKEN", "your_huggingface_token"))

    # Load the model and tokenizer
    print(f"Loading model: {model_name}")
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "left"

    # Initialize the text generation pipeline
    generation_pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16
    )

    # Load QA annotation test set
    print(f"Loading QA file: {qa_file}")
    # Randomly sample QA pairs
    random.seed(42)
    qa_df = pd.read_csv(qa_file)
    qa_df = qa_df.sample(sample_size, random_state=221)
    sample_doc_ids = qa_df["Doc_id"].tolist()
    sample_questions = qa_df["Question"].tolist()
    sample_answers = qa_df["Reference_Answers"].tolist()

    #sample_indices = random.sample(range(len(questions)), sample_size)

    #sample_doc_ids = [doc_ids[i] for i in sample_indices]
    #sample_questions = [questions[i] for i in sample_indices]
    #sample_answers = [answers[i] for i in sample_indices]

    # Define the prompt template
    template = """
    You are an expert assistant answering factual questions about various aspects of Pittsburgh or Carnegie Mellon University (CMU), including history, policy, culture, events, and more.
    If you do not know the answer, just say "I don't know."

    Important Instructions:
    - Answer concisely without repeating the question.
    - Do **not** use complete sentences. Provide only the word, name, date, or phrase that directly answers the question. For example, given the question "When was Carnegie Mellon University founded?", you should only answer "1900".

    Examples:
    Question: Who is Pittsburgh named after?
    Answer: William Pitt
    Question: What famous machine learning venue had its first conference in Pittsburgh in 1980?
    Answer: ICML
    Question: What musical artist is performing at PPG Arena on October 13?
    Answer: Billie Eilish

    Question: {question} \n\n
    Answer:
    """

    # Generate answers
    print("Generating answers...")
    generated_answers = []
    for question in tqdm(sample_questions):
        full_prompt = template.format(question=question)
        messages = [
            {"role": "user", "content": full_prompt},
        ]
        output = generation_pipe(messages, max_new_tokens=50)
        generated_answers.append(output[0]["generated_text"][1]['content'])

    # Save results to a CSV file
    results_df = pd.DataFrame({
        "Doc_id": sample_doc_ids,
        "Question": sample_questions,
        "Reference_Answers": sample_answers,
        "Generated_Answer": generated_answers,
    })
    results_df.to_csv(output_file, index=False)
    print(f"Results saved to {output_file}")

In [4]:
# Define the models and output file names
models = [
    ("meta-llama/Llama-3.2-3B-Instruct", "./output/llama32_3B_baseline.csv"),
    ("meta-llama/Llama-3.1-8B-Instruct", "./output/llama31_8B_baseline.csv")#,
    #("meta-llama/Llama-3.3-70B-Instruct", "./output/llama33_70B_baseline.csv")
]

# Run the function for each model
for model_name, output_file in models:
    run_llama_baseline(model_name, output_file)

Loading model: meta-llama/Llama-3.1-8B-Instruct


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Device set to use cuda:0


Loading QA file: ./data/annotated/QA_pairs_1.csv
Generating answers...


 10%|█         | 10/100 [00:04<00:40,  2.21it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 100/100 [00:24<00:00,  4.09it/s]


Results saved to ./output/llama31_8B_baseline.csv


### meta-llama/Llama-3.3-70B-Instruct

In [None]:
run_baseline("meta-llama/Llama-3.3-70B-Instruct", "./output/llama33_70B_baseline.csv")

Loading model: meta-llama/Llama-3.3-70B-Instruct


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/879 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/59.6k [00:00<?, ?B/s]

Fetching 30 files:   0%|          | 0/30 [00:00<?, ?it/s]

model-00002-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00008-of-00030.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00007-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00004-of-00030.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00001-of-00030.safetensors:   0%|          | 0.00/4.58G [00:00<?, ?B/s]

model-00003-of-00030.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00005-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00006-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00009-of-00030.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00010-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00011-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00012-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00013-of-00030.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00014-of-00030.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00015-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00016-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00017-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00018-of-00030.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00019-of-00030.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00020-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00021-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00022-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00023-of-00030.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00024-of-00030.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00025-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00026-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00027-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00028-of-00030.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00029-of-00030.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00030-of-00030.safetensors:   0%|          | 0.00/2.10G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/68.0 [00:00<?, ?B/s]

Device set to use cuda:0


Loading QA file: ./data/annotated/QA_pairs_1.csv
Generating answers...


  0%|          | 0/100 [00:00<?, ?it/s]

Goocle Colab Pro A100 can't run llama3.3 70B, maybe Pro+ can. IDK

### gemini-2.0-flash

In [None]:
import os
import pandas as pd
import random
from tqdm import tqdm
from google import genai
import time
from google import genai

import time
from google import genai

def call_gemini_with_interval(client, model, contents, config, interval=20, retries=3, backoff_factor=2):
    """
    Call Google Gemini API with a fixed interval between requests and retry logic for handling errors.

    Args:
        client (genai.Client): The Google Gemini client.
        model (str): The model name (e.g., "gemini-2.0-flash").
        contents (str): The input prompt or question.
        config (dict): Configuration for the API call (e.g., tools, temperature, etc.).
        interval (int): Time in seconds to wait between API requests.
        retries (int): Number of retries for handling quota exhaustion or other errors.
        backoff_factor (int): Factor for exponential backoff in case of retries.

    Returns:
        response: The response from the Google Gemini API.
    """
    for attempt in range(retries):
        try:
            # Send the API request
            response = client.models.generate_content(
                model=model,
                contents=contents,
                config=config,
            )
            print(f"Request succeeded. Waiting {interval} seconds before the next request...")
            time.sleep(interval)  # Wait for the specified interval
            return response
        except Exception as e:
            if "RESOURCE_EXHAUSTED" in str(e):
                print(f"Quota exceeded. Retrying in {backoff_factor ** attempt} seconds...")
                time.sleep(backoff_factor ** attempt)
            else:
                raise e
    raise Exception("Max retries exceeded. Quota still exhausted.")



def run_gemini_baseline(api_key, output_file, qa_file="./data/annotated/QA_pairs_1.csv", sample_size=100):
    """
    Generate QA baseline answers using Google's Gemini model and save the results to a CSV file.

    Args:
        api_key (str): API key for accessing Google's Gemini model.
        output_file (str): Path to save the generated answers.
        qa_file (str): Path to the QA annotation test set CSV file.
        sample_size (int): Number of QA pairs to sample for evaluation.
    """
    # Initialize the Gemini client
    client = genai.Client(api_key=api_key)

    # Load QA annotation test set
    print(f"Loading QA file: {qa_file}")
    # Randomly sample QA pairs
    random.seed(42)
    qa_df = pd.read_csv(qa_file)
    qa_df = qa_df.sample(sample_size, random_state=221)
    sample_doc_ids = qa_df["Doc_id"].tolist()
    sample_questions = qa_df["Question"].tolist()
    sample_answers = qa_df["Reference_Answers"].tolist()

    # Define the prompt template
    template = """
    You are an expert assistant answering factual questions about various aspects of Pittsburgh or Carnegie Mellon University (CMU), including history, policy, culture, events, and more.
    If you do not know the answer, just say "I don't know."

    Important Instructions:
    - Answer concisely without repeating the question.
    - Do **not** use complete sentences. Provide only the word, name, date, or phrase that directly answers the question. For example, given the question "When was Carnegie Mellon University founded?", you should only answer "1900".

    Examples:
    Question: Who is Pittsburgh named after?
    Answer: William Pitt
    Question: What famous machine learning venue had its first conference in Pittsburgh in 1980?
    Answer: ICML
    Question: What musical artist is performing at PPG Arena on October 13?
    Answer: Billie Eilish

    Question: {question} \n\n
    Answer:
    """

    # Generate answers
    print("Generating answers using Google Gemini...")
    generated_answers = []
    for question in tqdm(sample_questions):
        full_prompt = template.format(question=question)
        response = call_gemini_with_interval(
          client,
          model="gemini-2.0-flash-thinking-exp-01-21",
          contents=full_prompt,
          config={}, # {"tools": [{"google_search": {}}]},
          interval=4,  # Wait 4 seconds between requests
          retries=3,    # Retry up to 3 times if quota is exceeded
          backoff_factor=2  # Exponential backoff for retries
        )

        # Extract the generated answer
        generated_answers.append(response.text)

    # Save results to a CSV file
    results_df = pd.DataFrame({
        "Doc_id": sample_doc_ids,
        "Question": sample_questions,
        "Reference_Answers": sample_answers,
        "Generated_Answer": generated_answers,
    })
    results_df.to_csv(output_file, index=False)
    print(f"Results saved to {output_file}")

In [None]:
# Example usage
api_key = "your gemini API key"
output_file = "./output/gemini_2_flash_thinking_baseline.csv"
qa_file = "./data/annotated/QA_pairs_1.csv"
sample_size = 100

run_gemini_baseline(api_key, output_file, qa_file, sample_size)

Loading QA file: ./data/annotated/QA_pairs_1.csv
Generating answers using Google Gemini...


  0%|          | 0/100 [00:00<?, ?it/s]

Request succeeded. Waiting 4 seconds before the next request...


  1%|          | 1/100 [00:05<09:48,  5.94s/it]

Request succeeded. Waiting 4 seconds before the next request...


  2%|▏         | 2/100 [00:11<09:01,  5.52s/it]

Request succeeded. Waiting 4 seconds before the next request...


  3%|▎         | 3/100 [00:16<08:32,  5.28s/it]

Request succeeded. Waiting 4 seconds before the next request...


  4%|▍         | 4/100 [00:22<09:16,  5.79s/it]

Request succeeded. Waiting 4 seconds before the next request...


  5%|▌         | 5/100 [00:28<08:52,  5.60s/it]

Request succeeded. Waiting 4 seconds before the next request...


  6%|▌         | 6/100 [00:35<09:36,  6.14s/it]

Request succeeded. Waiting 4 seconds before the next request...


  7%|▋         | 7/100 [00:39<08:47,  5.68s/it]

Request succeeded. Waiting 4 seconds before the next request...


  8%|▊         | 8/100 [00:45<08:39,  5.65s/it]

Request succeeded. Waiting 4 seconds before the next request...


  9%|▉         | 9/100 [00:50<08:11,  5.41s/it]

Request succeeded. Waiting 4 seconds before the next request...


 10%|█         | 10/100 [00:55<08:04,  5.39s/it]

Request succeeded. Waiting 4 seconds before the next request...


 11%|█         | 11/100 [01:01<08:08,  5.49s/it]

Request succeeded. Waiting 4 seconds before the next request...


 12%|█▏        | 12/100 [01:06<08:03,  5.49s/it]

Request succeeded. Waiting 4 seconds before the next request...


 13%|█▎        | 13/100 [01:13<08:22,  5.78s/it]

Request succeeded. Waiting 4 seconds before the next request...


 14%|█▍        | 14/100 [01:21<09:23,  6.55s/it]

Request succeeded. Waiting 4 seconds before the next request...


 15%|█▌        | 15/100 [01:31<10:34,  7.47s/it]

Request succeeded. Waiting 4 seconds before the next request...


 16%|█▌        | 16/100 [01:37<09:54,  7.08s/it]

Request succeeded. Waiting 4 seconds before the next request...


 17%|█▋        | 17/100 [01:42<08:54,  6.44s/it]

Request succeeded. Waiting 4 seconds before the next request...


 18%|█▊        | 18/100 [01:47<08:17,  6.07s/it]

Request succeeded. Waiting 4 seconds before the next request...


 19%|█▉        | 19/100 [01:52<07:43,  5.72s/it]

Request succeeded. Waiting 4 seconds before the next request...


 20%|██        | 20/100 [01:58<07:47,  5.85s/it]

Request succeeded. Waiting 4 seconds before the next request...


 21%|██        | 21/100 [02:04<07:43,  5.86s/it]

Request succeeded. Waiting 4 seconds before the next request...


 22%|██▏       | 22/100 [02:10<07:37,  5.86s/it]

Request succeeded. Waiting 4 seconds before the next request...


 23%|██▎       | 23/100 [02:15<07:12,  5.62s/it]

Request succeeded. Waiting 4 seconds before the next request...


 24%|██▍       | 24/100 [02:20<06:57,  5.50s/it]

Request succeeded. Waiting 4 seconds before the next request...


 25%|██▌       | 25/100 [02:25<06:46,  5.41s/it]

Request succeeded. Waiting 4 seconds before the next request...


 26%|██▌       | 26/100 [02:30<06:30,  5.27s/it]

Request succeeded. Waiting 4 seconds before the next request...


 27%|██▋       | 27/100 [02:35<06:13,  5.11s/it]

Request succeeded. Waiting 4 seconds before the next request...


 28%|██▊       | 28/100 [02:40<06:01,  5.02s/it]

Request succeeded. Waiting 4 seconds before the next request...


 29%|██▉       | 29/100 [02:48<06:53,  5.82s/it]

Request succeeded. Waiting 4 seconds before the next request...


 30%|███       | 30/100 [02:54<06:53,  5.90s/it]

Request succeeded. Waiting 4 seconds before the next request...


 31%|███       | 31/100 [02:59<06:42,  5.83s/it]

Request succeeded. Waiting 4 seconds before the next request...


 32%|███▏      | 32/100 [03:05<06:26,  5.68s/it]

Request succeeded. Waiting 4 seconds before the next request...


 33%|███▎      | 33/100 [03:13<07:19,  6.56s/it]

Request succeeded. Waiting 4 seconds before the next request...


 34%|███▍      | 34/100 [03:18<06:38,  6.04s/it]

Request succeeded. Waiting 4 seconds before the next request...


 35%|███▌      | 35/100 [03:23<06:19,  5.84s/it]

Request succeeded. Waiting 4 seconds before the next request...


 36%|███▌      | 36/100 [03:31<06:39,  6.24s/it]

Request succeeded. Waiting 4 seconds before the next request...


 37%|███▋      | 37/100 [03:38<06:55,  6.59s/it]

Request succeeded. Waiting 4 seconds before the next request...


 38%|███▊      | 38/100 [03:43<06:12,  6.02s/it]

Request succeeded. Waiting 4 seconds before the next request...


 39%|███▉      | 39/100 [03:48<05:49,  5.73s/it]

Request succeeded. Waiting 4 seconds before the next request...


 40%|████      | 40/100 [03:53<05:27,  5.45s/it]

Request succeeded. Waiting 4 seconds before the next request...


 41%|████      | 41/100 [03:59<05:37,  5.73s/it]

Request succeeded. Waiting 4 seconds before the next request...


 42%|████▏     | 42/100 [04:05<05:43,  5.92s/it]

Request succeeded. Waiting 4 seconds before the next request...


 43%|████▎     | 43/100 [04:10<05:15,  5.54s/it]

Request succeeded. Waiting 4 seconds before the next request...


 44%|████▍     | 44/100 [04:16<05:15,  5.64s/it]

Request succeeded. Waiting 4 seconds before the next request...


 45%|████▌     | 45/100 [04:21<05:06,  5.58s/it]

Request succeeded. Waiting 4 seconds before the next request...


 46%|████▌     | 46/100 [04:47<10:29, 11.66s/it]

Request succeeded. Waiting 4 seconds before the next request...


 47%|████▋     | 47/100 [04:55<09:14, 10.46s/it]

Request succeeded. Waiting 4 seconds before the next request...


 48%|████▊     | 48/100 [05:00<07:38,  8.82s/it]

Request succeeded. Waiting 4 seconds before the next request...


 49%|████▉     | 49/100 [05:13<08:32, 10.05s/it]

Request succeeded. Waiting 4 seconds before the next request...


 50%|█████     | 50/100 [05:18<07:03,  8.46s/it]

Request succeeded. Waiting 4 seconds before the next request...


 51%|█████     | 51/100 [05:23<06:06,  7.49s/it]

Request succeeded. Waiting 4 seconds before the next request...


 52%|█████▏    | 52/100 [05:29<05:38,  7.06s/it]

Request succeeded. Waiting 4 seconds before the next request...


 53%|█████▎    | 53/100 [05:34<05:02,  6.44s/it]

Request succeeded. Waiting 4 seconds before the next request...


 54%|█████▍    | 54/100 [05:42<05:27,  7.11s/it]

Request succeeded. Waiting 4 seconds before the next request...


 55%|█████▌    | 55/100 [05:51<05:43,  7.64s/it]

Request succeeded. Waiting 4 seconds before the next request...


 56%|█████▌    | 56/100 [06:02<06:13,  8.49s/it]

Request succeeded. Waiting 4 seconds before the next request...


 57%|█████▋    | 57/100 [06:08<05:32,  7.72s/it]

Request succeeded. Waiting 4 seconds before the next request...


 58%|█████▊    | 58/100 [06:13<04:52,  6.97s/it]

Request succeeded. Waiting 4 seconds before the next request...


 59%|█████▉    | 59/100 [06:18<04:24,  6.45s/it]

Request succeeded. Waiting 4 seconds before the next request...


 60%|██████    | 60/100 [06:23<04:01,  6.04s/it]

Request succeeded. Waiting 4 seconds before the next request...


 61%|██████    | 61/100 [06:29<03:50,  5.92s/it]

Request succeeded. Waiting 4 seconds before the next request...


 62%|██████▏   | 62/100 [06:37<04:08,  6.53s/it]

Request succeeded. Waiting 4 seconds before the next request...


 63%|██████▎   | 63/100 [06:46<04:25,  7.18s/it]

Request succeeded. Waiting 4 seconds before the next request...


 64%|██████▍   | 64/100 [06:51<04:02,  6.75s/it]

Request succeeded. Waiting 4 seconds before the next request...


 65%|██████▌   | 65/100 [07:00<04:15,  7.29s/it]

Request succeeded. Waiting 4 seconds before the next request...


 66%|██████▌   | 66/100 [07:05<03:47,  6.69s/it]

Request succeeded. Waiting 4 seconds before the next request...


 67%|██████▋   | 67/100 [07:11<03:32,  6.43s/it]

Request succeeded. Waiting 4 seconds before the next request...


 68%|██████▊   | 68/100 [07:16<03:12,  6.03s/it]

Request succeeded. Waiting 4 seconds before the next request...


 69%|██████▉   | 69/100 [07:22<03:03,  5.91s/it]

Request succeeded. Waiting 4 seconds before the next request...


 70%|███████   | 70/100 [07:27<02:49,  5.65s/it]

Request succeeded. Waiting 4 seconds before the next request...


 71%|███████   | 71/100 [07:32<02:36,  5.40s/it]

Request succeeded. Waiting 4 seconds before the next request...


 72%|███████▏  | 72/100 [07:37<02:29,  5.34s/it]

Request succeeded. Waiting 4 seconds before the next request...


 73%|███████▎  | 73/100 [07:42<02:25,  5.40s/it]

Request succeeded. Waiting 4 seconds before the next request...


 74%|███████▍  | 74/100 [07:48<02:20,  5.41s/it]

Request succeeded. Waiting 4 seconds before the next request...


 75%|███████▌  | 75/100 [07:53<02:11,  5.28s/it]

Request succeeded. Waiting 4 seconds before the next request...


 76%|███████▌  | 76/100 [07:58<02:04,  5.19s/it]

Request succeeded. Waiting 4 seconds before the next request...


 77%|███████▋  | 77/100 [08:04<02:07,  5.53s/it]

Request succeeded. Waiting 4 seconds before the next request...


 78%|███████▊  | 78/100 [08:09<01:56,  5.31s/it]

Request succeeded. Waiting 4 seconds before the next request...


 79%|███████▉  | 79/100 [08:17<02:08,  6.14s/it]

Request succeeded. Waiting 4 seconds before the next request...


 80%|████████  | 80/100 [08:22<01:57,  5.86s/it]

Request succeeded. Waiting 4 seconds before the next request...


 81%|████████  | 81/100 [08:28<01:49,  5.74s/it]

Request succeeded. Waiting 4 seconds before the next request...


 82%|████████▏ | 82/100 [08:32<01:38,  5.49s/it]

Request succeeded. Waiting 4 seconds before the next request...


 83%|████████▎ | 83/100 [08:37<01:29,  5.28s/it]

Request succeeded. Waiting 4 seconds before the next request...


 84%|████████▍ | 84/100 [08:43<01:28,  5.55s/it]

Request succeeded. Waiting 4 seconds before the next request...


 85%|████████▌ | 85/100 [08:53<01:40,  6.71s/it]

Request succeeded. Waiting 4 seconds before the next request...


 86%|████████▌ | 86/100 [08:58<01:26,  6.18s/it]

Request succeeded. Waiting 4 seconds before the next request...


 87%|████████▋ | 87/100 [09:02<01:14,  5.71s/it]

Request succeeded. Waiting 4 seconds before the next request...


 88%|████████▊ | 88/100 [09:09<01:10,  5.90s/it]

Request succeeded. Waiting 4 seconds before the next request...


 89%|████████▉ | 89/100 [09:15<01:04,  5.89s/it]

Request succeeded. Waiting 4 seconds before the next request...


 90%|█████████ | 90/100 [09:23<01:05,  6.56s/it]

Request succeeded. Waiting 4 seconds before the next request...


 91%|█████████ | 91/100 [09:29<00:58,  6.49s/it]

Request succeeded. Waiting 4 seconds before the next request...


 92%|█████████▏| 92/100 [09:35<00:50,  6.35s/it]

Request succeeded. Waiting 4 seconds before the next request...


 93%|█████████▎| 93/100 [09:46<00:53,  7.62s/it]

Request succeeded. Waiting 4 seconds before the next request...


 94%|█████████▍| 94/100 [09:50<00:40,  6.76s/it]

Request succeeded. Waiting 4 seconds before the next request...


 95%|█████████▌| 95/100 [09:55<00:30,  6.17s/it]

Request succeeded. Waiting 4 seconds before the next request...


 96%|█████████▌| 96/100 [10:00<00:23,  5.88s/it]

Request succeeded. Waiting 4 seconds before the next request...


 97%|█████████▋| 97/100 [10:07<00:18,  6.01s/it]

Request succeeded. Waiting 4 seconds before the next request...


 98%|█████████▊| 98/100 [10:12<00:11,  5.92s/it]

Request succeeded. Waiting 4 seconds before the next request...


 99%|█████████▉| 99/100 [10:17<00:05,  5.62s/it]

Request succeeded. Waiting 4 seconds before the next request...


100%|██████████| 100/100 [10:28<00:00,  6.28s/it]

Results saved to ./output/gemini_2_flash_thinking_baseline.csv





In [None]:
import os
import pandas as pd
import random
from tqdm import tqdm
from google import genai
def call_gemini_with_interval(client, model, contents, config, interval=20, retries=3, backoff_factor=2):
    """
    Call Google Gemini API with a fixed interval between requests and retry logic for handling errors.

    Args:
        client (genai.Client): The Google Gemini client.
        model (str): The model name (e.g., "gemini-2.0-flash").
        contents (str): The input prompt or question.
        config (dict): Configuration for the API call (e.g., tools, temperature, etc.).
        interval (int): Time in seconds to wait between API requests.
        retries (int): Number of retries for handling quota exhaustion or other errors.
        backoff_factor (int): Factor for exponential backoff in case of retries.

    Returns:
        response: The response from the Google Gemini API.
    """
    for attempt in range(retries):
        try:
            # Send the API request
            response = client.models.generate_content(
                model=model,
                contents=contents,
                config=config,
            )
            print(f"Request succeeded. Waiting {interval} seconds before the next request...")
            time.sleep(interval)  # Wait for the specified interval
            return response
        except Exception as e:
            if "RESOURCE_EXHAUSTED" in str(e):
                print(f"Quota exceeded. Retrying in {backoff_factor ** attempt} seconds...")
                time.sleep(backoff_factor ** attempt)
            else:
                raise e
    raise Exception("Max retries exceeded. Quota still exhausted.")



def run_gemini_baseline_with_search(api_key, output_file, qa_file="./data/annotated/QA_pairs_1.csv", sample_size=100):
    """
    Generate QA baseline answers using Google's Gemini model with Google Search grounding and save the results to a CSV file.

    Args:
        api_key (str): API key for accessing Google's Gemini model.
        output_file (str): Path to save the generated answers.
        qa_file (str): Path to the QA annotation test set CSV file.
        sample_size (int): Number of QA pairs to sample for evaluation.
    """
    # Initialize the Gemini client
    client = genai.Client(api_key=api_key)

    # Load QA annotation test set
    print(f"Loading QA file: {qa_file}")
    # Randomly sample QA pairs
    random.seed(42)
    qa_df = pd.read_csv(qa_file)
    qa_df = qa_df.sample(sample_size, random_state=221)
    sample_doc_ids = qa_df["Doc_id"].tolist()
    sample_questions = qa_df["Question"].tolist()
    sample_answers = qa_df["Reference_Answers"].tolist()

    # Define the prompt template
    template = """
    You are an expert assistant answering factual questions about various aspects of Pittsburgh or Carnegie Mellon University (CMU), including history, policy, culture, events, and more.
    If you do not know the answer, just say "I don't know."

    Important Instructions:
    - Answer concisely without repeating the question.
    - Do **not** use complete sentences. Provide only the word, name, date, or phrase that directly answers the question. For example, given the question "When was Carnegie Mellon University founded?", you should only answer "1900".

    Examples:
    Question: Who is Pittsburgh named after?
    Answer: William Pitt
    Question: What famous machine learning venue had its first conference in Pittsburgh in 1980?
    Answer: ICML
    Question: What musical artist is performing at PPG Arena on October 13?
    Answer: Billie Eilish

    Question: {question} \n\n
    Answer:
    """

    # Generate answers
    print("Generating answers using Google Gemini with Google Search grounding...")
    generated_answers = []
    search_queries = []
    search_pages = []

    for question in tqdm(sample_questions):
        full_prompt = template.format(question=question)
        response = call_gemini_with_interval(
          client,
          model="gemini-2.0-flash-thinking-exp-01-21",
          contents=full_prompt,
          config={"tools": [{"google_search": {}}]}, # Enable Google Search grounding
          interval=4,  # Wait 4 seconds between requests
          retries=3,    # Retry up to 3 times if quota is exceeded
          backoff_factor=2  # Exponential backoff for retries
        )

        # Extract the generated answer
        generated_answers.append(response.text)

    # Save results to a CSV file
    results_df = pd.DataFrame({
        "Doc_id": sample_doc_ids,
        "Question": sample_questions,
        "Reference_Answers": sample_answers,
        "Generated_Answer": generated_answers
    })
    results_df.to_csv(output_file, index=False)
    print(f"Results saved to {output_file}")

In [None]:
# Example usage
api_key = "your gemini API key"
output_file = "./output/gemini_baseline_with_search.csv"
qa_file = "./data/annotated/QA_pairs_1.csv"
sample_size = 100

run_gemini_baseline_with_search(api_key, output_file, qa_file, sample_size)

Loading QA file: ./data/annotated/QA_pairs_1.csv
Generating answers using Google Gemini with Google Search grounding...


  0%|          | 0/100 [00:00<?, ?it/s]

Request succeeded. Waiting 4 seconds before the next request...


  1%|          | 1/100 [00:06<10:48,  6.55s/it]

Request succeeded. Waiting 4 seconds before the next request...


  2%|▏         | 2/100 [00:11<08:45,  5.36s/it]

Request succeeded. Waiting 4 seconds before the next request...


  3%|▎         | 3/100 [00:15<08:05,  5.01s/it]

Request succeeded. Waiting 4 seconds before the next request...


  4%|▍         | 4/100 [00:21<08:15,  5.16s/it]

Request succeeded. Waiting 4 seconds before the next request...


  5%|▌         | 5/100 [00:25<07:50,  4.95s/it]

Request succeeded. Waiting 4 seconds before the next request...


  6%|▌         | 6/100 [00:30<07:33,  4.83s/it]

Request succeeded. Waiting 4 seconds before the next request...


  7%|▋         | 7/100 [00:34<07:21,  4.74s/it]

Request succeeded. Waiting 4 seconds before the next request...


  8%|▊         | 8/100 [00:39<07:12,  4.70s/it]

Request succeeded. Waiting 4 seconds before the next request...


  9%|▉         | 9/100 [00:43<07:03,  4.65s/it]

Request succeeded. Waiting 4 seconds before the next request...


 10%|█         | 10/100 [00:48<06:56,  4.63s/it]

Request succeeded. Waiting 4 seconds before the next request...


 11%|█         | 11/100 [00:53<07:11,  4.85s/it]

Request succeeded. Waiting 4 seconds before the next request...


 12%|█▏        | 12/100 [00:59<07:23,  5.04s/it]

Request succeeded. Waiting 4 seconds before the next request...


 13%|█▎        | 13/100 [01:04<07:21,  5.08s/it]

Request succeeded. Waiting 4 seconds before the next request...


 14%|█▍        | 14/100 [01:09<07:02,  4.91s/it]

Request succeeded. Waiting 4 seconds before the next request...


 15%|█▌        | 15/100 [01:13<06:46,  4.79s/it]

Request succeeded. Waiting 4 seconds before the next request...


 16%|█▌        | 16/100 [01:18<06:36,  4.72s/it]

Request succeeded. Waiting 4 seconds before the next request...


 17%|█▋        | 17/100 [01:23<06:54,  4.99s/it]

Request succeeded. Waiting 4 seconds before the next request...


 18%|█▊        | 18/100 [01:28<06:55,  5.07s/it]

Request succeeded. Waiting 4 seconds before the next request...


 19%|█▉        | 19/100 [01:33<06:38,  4.92s/it]

Request succeeded. Waiting 4 seconds before the next request...


 20%|██        | 20/100 [01:38<06:42,  5.03s/it]

Request succeeded. Waiting 4 seconds before the next request...


 21%|██        | 21/100 [01:44<06:42,  5.10s/it]

Request succeeded. Waiting 4 seconds before the next request...


 22%|██▏       | 22/100 [01:48<06:23,  4.92s/it]

Request succeeded. Waiting 4 seconds before the next request...


 23%|██▎       | 23/100 [01:53<06:10,  4.81s/it]

Request succeeded. Waiting 4 seconds before the next request...


 24%|██▍       | 24/100 [01:57<05:59,  4.74s/it]

Request succeeded. Waiting 4 seconds before the next request...


 25%|██▌       | 25/100 [02:02<06:06,  4.89s/it]

Request succeeded. Waiting 4 seconds before the next request...


 26%|██▌       | 26/100 [02:08<06:11,  5.01s/it]

Request succeeded. Waiting 4 seconds before the next request...


 27%|██▋       | 27/100 [02:12<05:55,  4.88s/it]

Request succeeded. Waiting 4 seconds before the next request...


 28%|██▊       | 28/100 [02:18<06:02,  5.03s/it]

Request succeeded. Waiting 4 seconds before the next request...


 29%|██▉       | 29/100 [02:23<06:10,  5.23s/it]

Request succeeded. Waiting 4 seconds before the next request...


 30%|███       | 30/100 [02:29<06:11,  5.31s/it]

Request succeeded. Waiting 4 seconds before the next request...


 31%|███       | 31/100 [02:34<06:06,  5.31s/it]

Request succeeded. Waiting 4 seconds before the next request...


 32%|███▏      | 32/100 [02:39<05:45,  5.08s/it]

Request succeeded. Waiting 4 seconds before the next request...


 33%|███▎      | 33/100 [02:44<05:46,  5.17s/it]

Request succeeded. Waiting 4 seconds before the next request...


 34%|███▍      | 34/100 [02:50<05:46,  5.26s/it]

Request succeeded. Waiting 4 seconds before the next request...


 35%|███▌      | 35/100 [02:54<05:27,  5.03s/it]

Request succeeded. Waiting 4 seconds before the next request...


 36%|███▌      | 36/100 [03:00<05:34,  5.23s/it]

Request succeeded. Waiting 4 seconds before the next request...


 37%|███▋      | 37/100 [03:05<05:30,  5.25s/it]

Request succeeded. Waiting 4 seconds before the next request...


 38%|███▊      | 38/100 [03:11<05:32,  5.36s/it]

Request succeeded. Waiting 4 seconds before the next request...


 39%|███▉      | 39/100 [03:16<05:28,  5.39s/it]

Request succeeded. Waiting 4 seconds before the next request...


 40%|████      | 40/100 [03:21<05:07,  5.13s/it]

Request succeeded. Waiting 4 seconds before the next request...


 41%|████      | 41/100 [03:27<05:26,  5.54s/it]

Request succeeded. Waiting 4 seconds before the next request...


 42%|████▏     | 42/100 [03:32<05:17,  5.47s/it]

Request succeeded. Waiting 4 seconds before the next request...


 43%|████▎     | 43/100 [03:37<04:56,  5.20s/it]

Request succeeded. Waiting 4 seconds before the next request...


 44%|████▍     | 44/100 [03:42<04:54,  5.27s/it]

Request succeeded. Waiting 4 seconds before the next request...


 45%|████▌     | 45/100 [03:47<04:38,  5.07s/it]

Request succeeded. Waiting 4 seconds before the next request...


 46%|████▌     | 46/100 [03:52<04:38,  5.15s/it]

Request succeeded. Waiting 4 seconds before the next request...


 47%|████▋     | 47/100 [03:57<04:23,  4.97s/it]

Request succeeded. Waiting 4 seconds before the next request...


 48%|████▊     | 48/100 [04:02<04:25,  5.11s/it]

Request succeeded. Waiting 4 seconds before the next request...


 49%|████▉     | 49/100 [04:07<04:12,  4.94s/it]

Request succeeded. Waiting 4 seconds before the next request...


 50%|█████     | 50/100 [04:11<04:00,  4.81s/it]

Request succeeded. Waiting 4 seconds before the next request...


 51%|█████     | 51/100 [04:17<04:01,  4.94s/it]

Request succeeded. Waiting 4 seconds before the next request...


 52%|█████▏    | 52/100 [04:22<04:03,  5.06s/it]

Request succeeded. Waiting 4 seconds before the next request...


 53%|█████▎    | 53/100 [04:27<03:51,  4.93s/it]

Request succeeded. Waiting 4 seconds before the next request...


 54%|█████▍    | 54/100 [04:32<03:55,  5.13s/it]

Request succeeded. Waiting 4 seconds before the next request...


 55%|█████▌    | 55/100 [04:38<03:55,  5.23s/it]

Request succeeded. Waiting 4 seconds before the next request...


 56%|█████▌    | 56/100 [04:42<03:40,  5.02s/it]

Request succeeded. Waiting 4 seconds before the next request...


 57%|█████▋    | 57/100 [04:47<03:30,  4.89s/it]

Request succeeded. Waiting 4 seconds before the next request...


 58%|█████▊    | 58/100 [04:52<03:30,  5.00s/it]

Request succeeded. Waiting 4 seconds before the next request...


 59%|█████▉    | 59/100 [04:57<03:20,  4.89s/it]

Request succeeded. Waiting 4 seconds before the next request...


 60%|██████    | 60/100 [05:02<03:21,  5.04s/it]

Request succeeded. Waiting 4 seconds before the next request...


 61%|██████    | 61/100 [05:07<03:11,  4.90s/it]

Request succeeded. Waiting 4 seconds before the next request...


 62%|██████▏   | 62/100 [05:11<03:03,  4.82s/it]

Request succeeded. Waiting 4 seconds before the next request...


 63%|██████▎   | 63/100 [05:17<03:03,  4.95s/it]

Request succeeded. Waiting 4 seconds before the next request...


 64%|██████▍   | 64/100 [05:21<02:53,  4.82s/it]

Request succeeded. Waiting 4 seconds before the next request...


 65%|██████▌   | 65/100 [05:27<03:00,  5.15s/it]

Request succeeded. Waiting 4 seconds before the next request...


 66%|██████▌   | 66/100 [05:32<02:49,  4.97s/it]

Request succeeded. Waiting 4 seconds before the next request...


 67%|██████▋   | 67/100 [05:36<02:39,  4.85s/it]

Request succeeded. Waiting 4 seconds before the next request...


 68%|██████▊   | 68/100 [05:42<02:40,  5.03s/it]

Request succeeded. Waiting 4 seconds before the next request...


 69%|██████▉   | 69/100 [05:46<02:31,  4.89s/it]

Request succeeded. Waiting 4 seconds before the next request...


 70%|███████   | 70/100 [05:52<02:35,  5.19s/it]

Request succeeded. Waiting 4 seconds before the next request...


 71%|███████   | 71/100 [05:58<02:33,  5.29s/it]

Request succeeded. Waiting 4 seconds before the next request...


 72%|███████▏  | 72/100 [06:03<02:28,  5.29s/it]

Request succeeded. Waiting 4 seconds before the next request...


 73%|███████▎  | 73/100 [06:08<02:25,  5.37s/it]

Request succeeded. Waiting 4 seconds before the next request...


 74%|███████▍  | 74/100 [06:14<02:18,  5.34s/it]

Request succeeded. Waiting 4 seconds before the next request...


 75%|███████▌  | 75/100 [06:18<02:07,  5.09s/it]

Request succeeded. Waiting 4 seconds before the next request...


 76%|███████▌  | 76/100 [06:24<02:04,  5.17s/it]

Request succeeded. Waiting 4 seconds before the next request...


 77%|███████▋  | 77/100 [06:28<01:54,  4.99s/it]

Request succeeded. Waiting 4 seconds before the next request...


 78%|███████▊  | 78/100 [06:34<01:52,  5.11s/it]

Request succeeded. Waiting 4 seconds before the next request...


 79%|███████▉  | 79/100 [06:39<01:49,  5.22s/it]

Request succeeded. Waiting 4 seconds before the next request...


 80%|████████  | 80/100 [06:44<01:45,  5.27s/it]

Request succeeded. Waiting 4 seconds before the next request...


 81%|████████  | 81/100 [06:49<01:35,  5.05s/it]

Request succeeded. Waiting 4 seconds before the next request...


 82%|████████▏ | 82/100 [06:54<01:32,  5.14s/it]

Request succeeded. Waiting 4 seconds before the next request...


 83%|████████▎ | 83/100 [06:59<01:24,  4.96s/it]

Request succeeded. Waiting 4 seconds before the next request...


 84%|████████▍ | 84/100 [07:04<01:21,  5.07s/it]

Request succeeded. Waiting 4 seconds before the next request...


 85%|████████▌ | 85/100 [07:09<01:13,  4.93s/it]

Request succeeded. Waiting 4 seconds before the next request...


 86%|████████▌ | 86/100 [07:13<01:07,  4.80s/it]

Request succeeded. Waiting 4 seconds before the next request...


 87%|████████▋ | 87/100 [07:18<01:01,  4.71s/it]

Request succeeded. Waiting 4 seconds before the next request...


 88%|████████▊ | 88/100 [07:22<00:55,  4.65s/it]

Request succeeded. Waiting 4 seconds before the next request...


 89%|████████▉ | 89/100 [07:27<00:50,  4.62s/it]

Request succeeded. Waiting 4 seconds before the next request...


 90%|█████████ | 90/100 [07:32<00:49,  4.95s/it]

Request succeeded. Waiting 4 seconds before the next request...


 91%|█████████ | 91/100 [07:37<00:43,  4.83s/it]

Request succeeded. Waiting 4 seconds before the next request...


 92%|█████████▏| 92/100 [07:43<00:40,  5.12s/it]

Request succeeded. Waiting 4 seconds before the next request...


 93%|█████████▎| 93/100 [07:47<00:34,  4.96s/it]

Request succeeded. Waiting 4 seconds before the next request...


 94%|█████████▍| 94/100 [07:52<00:29,  4.85s/it]

Request succeeded. Waiting 4 seconds before the next request...


 95%|█████████▌| 95/100 [07:58<00:25,  5.08s/it]

Request succeeded. Waiting 4 seconds before the next request...


 96%|█████████▌| 96/100 [08:02<00:19,  4.93s/it]

Request succeeded. Waiting 4 seconds before the next request...


 97%|█████████▋| 97/100 [08:07<00:14,  4.80s/it]

Request succeeded. Waiting 4 seconds before the next request...


 98%|█████████▊| 98/100 [08:11<00:09,  4.73s/it]

Request succeeded. Waiting 4 seconds before the next request...


 99%|█████████▉| 99/100 [08:16<00:04,  4.68s/it]

Request succeeded. Waiting 4 seconds before the next request...


100%|██████████| 100/100 [08:21<00:00,  5.02s/it]

Results saved to ./output/gemini_baseline_with_search.csv





Add Search to Google Gemini didn't improve the accuracy

### gemini-2.0-flash-thinking-exp-01-21 and gemini-2.0-flash

run llama or gemini

In [None]:
import os
import pandas as pd
import random
from tqdm import tqdm
from google import genai
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from huggingface_hub import login


def call_gemini_with_interval(client, model, contents, config, interval=20, retries=3, backoff_factor=2):
    """
    Call Google Gemini API with a fixed interval between requests and retry logic for handling errors.

    Args:
        client (genai.Client): The Google Gemini client.
        model (str): The model name (e.g., "gemini-2.0-flash").
        contents (str): The input prompt or question.
        config (dict): Configuration for the API call (e.g., tools, temperature, etc.).
        interval (int): Time in seconds to wait between API requests.
        retries (int): Number of retries for handling quota exhaustion or other errors.
        backoff_factor (int): Factor for exponential backoff in case of retries.

    Returns:
        response: The response from the Google Gemini API.
    """
    for attempt in range(retries):
        try:
            # Send the API request
            response = client.models.generate_content(
                model=model,
                contents=contents,
                config=config,
            )
            print(f"Request succeeded. Waiting {interval} seconds before the next request...")
            time.sleep(interval)  # Wait for the specified interval
            return response
        except Exception as e:
            if "RESOURCE_EXHAUSTED" in str(e):
                print(f"Quota exceeded. Retrying in {backoff_factor ** attempt} seconds...")
                time.sleep(backoff_factor ** attempt)
            else:
                raise e
    raise Exception("Max retries exceeded. Quota still exhausted.")


def run_baseline(
    model_name,
    output_file,
    qa_file="./data/annotated/QA_pairs_1.csv",
    sample_size=100
):
    """
    Generate QA baseline answers using either Hugging Face LLaMA or Google Gemini and save the results to a CSV file.

    Args:
        model_name (str): Model name for LLaMA or Gemini (e.g., "gemini-2.0-flash" or "meta-llama/Llama-3.1-8B-Instruct").
        output_file (str): Path to save the generated answers.
        qa_file (str): Path to the QA annotation test set CSV file.
        sample_size (int): Number of QA pairs to sample for evaluation.
    """
    # Load QA annotation test set
    print(f"Loading QA file: {qa_file}")
    qa_df = pd.read_csv(qa_file)
    qa_df = qa_df.sample(sample_size, random_state=221)
    sample_doc_ids = qa_df["Doc_id"].tolist()
    sample_questions = qa_df["Question"].tolist()
    sample_answers = qa_df["Reference_Answers"].tolist()

    # Define the prompt template
    template = """
    You are an expert assistant answering factual questions about various aspects of Pittsburgh or Carnegie Mellon University (CMU), including history, policy, culture, events, and more.
    If you do not know the answer, just say "I don't know."

    Important Instructions:
    - Answer concisely without repeating the question.
    - Do **not** use complete sentences. Provide only the word, name, date, or phrase that directly answers the question. For example, given the question "When was Carnegie Mellon University founded?", you should only answer "1900".

    Examples:
    Question: Who is Pittsburgh named after?
    Answer: William Pitt
    Question: What famous machine learning venue had its first conference in Pittsburgh in 1980?
    Answer: ICML
    Question: What musical artist is performing at PPG Arena on October 13?
    Answer: Billie Eilish

    Question: {question} \n\n
    Answer:
    """

    # Determine the model type based on the model name
    if "gemini" in model_name.lower():
        # Initialize the Gemini client
        print("Initializing Google Gemini client...")
        client = genai.Client(api_key=os.getenv("GEMINI_API_KEY", "your_api_key_here")) #  

        # Generate answers using Google Gemini
        print("Generating answers using Google Gemini...")
        generated_answers = []
        for question in tqdm(sample_questions):
            full_prompt = template.format(question=question)
            response = call_gemini_with_interval(
                client,
                model=model_name,  # Use the provided Gemini model name
                contents=full_prompt,
                config={},  # {"tools": [{"google_search": {}}]},
                interval=4,  # Wait 4 seconds between requests
                retries=3,  # Retry up to 3 times if quota is exceeded
                backoff_factor=2  # Exponential backoff for retries
            )
            # Extract the generated answer
            generated_answers.append(response.text)

    elif "llama" in model_name.lower():
        # Login to Hugging Face Hub
        print("Initializing Hugging Face LLaMA model...")
        login(token=os.getenv("HUGGINGFACE_TOKEN", "your_huggingface_token_here")) # 

        # Load the model and tokenizer
        print(f"Loading LLaMA model: {model_name}")
        model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.padding_side = "left"

        # Initialize the text generation pipeline
        generation_pipe = pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            torch_dtype=torch.float16
        )

        # Generate answers using LLaMA
        print("Generating answers using LLaMA...")
        generated_answers = []
        for question in tqdm(sample_questions):
            full_prompt = template.format(question=question)
            messages = [
                {"role": "user", "content": full_prompt},
            ]
            output = generation_pipe(messages, max_new_tokens=50)
            generated_answers.append(output[0]["generated_text"][1]['content'])

    else:
        raise ValueError("Invalid model_name. Please ensure it contains 'gemini' or 'llama'.")

    # Save results to a CSV file
    results_df = pd.DataFrame({
        "Doc_id": sample_doc_ids,
        "Question": sample_questions,
        "Reference_Answers": sample_answers,
        "Generated_Answer": generated_answers,
    })
    results_df.to_csv(output_file, index=False)
    print(f"Results saved to {output_file}")

In [None]:
run_baseline(
    model_name="meta-llama/Llama-3.2-3B-Instruct",
    output_file="./output/llama32_3B_baseline.csv",
    qa_file="./data/annotated/QA_pairs_1.csv",
    sample_size=100
)

In [4]:
run_baseline(
    model_name="gemini-2.0-flash-thinking-exp-01-21",
    output_file="./output/gemini_2_flash_thinking_baseline.csv",
    #qa_file="./data/annotated/QA_pairs_1.csv",
    #sample_size=100
)

Loading QA file: ./data/annotated/QA_pairs_1.csv
Initializing Google Gemini client...
Generating answers using Google Gemini...


  0%|          | 0/100 [00:00<?, ?it/s]

Request succeeded. Waiting 4 seconds before the next request...


  1%|          | 1/100 [00:09<15:28,  9.38s/it]

Request succeeded. Waiting 4 seconds before the next request...


  2%|▏         | 2/100 [00:17<14:17,  8.75s/it]

Request succeeded. Waiting 4 seconds before the next request...


  3%|▎         | 3/100 [00:22<11:31,  7.13s/it]

Request succeeded. Waiting 4 seconds before the next request...


  4%|▍         | 4/100 [00:28<10:15,  6.41s/it]

Request succeeded. Waiting 4 seconds before the next request...


  5%|▌         | 5/100 [00:33<09:25,  5.96s/it]

Request succeeded. Waiting 4 seconds before the next request...


  6%|▌         | 6/100 [00:41<10:45,  6.87s/it]

Request succeeded. Waiting 4 seconds before the next request...


  7%|▋         | 7/100 [00:46<09:30,  6.14s/it]

Request succeeded. Waiting 4 seconds before the next request...


  8%|▊         | 8/100 [00:52<09:11,  5.99s/it]

Request succeeded. Waiting 4 seconds before the next request...


  9%|▉         | 9/100 [00:57<08:32,  5.63s/it]

Request succeeded. Waiting 4 seconds before the next request...


 10%|█         | 10/100 [01:03<08:42,  5.81s/it]

Request succeeded. Waiting 4 seconds before the next request...


 11%|█         | 11/100 [01:12<09:56,  6.70s/it]

Request succeeded. Waiting 4 seconds before the next request...


 12%|█▏        | 12/100 [01:17<09:21,  6.38s/it]

Request succeeded. Waiting 4 seconds before the next request...


 13%|█▎        | 13/100 [01:25<09:41,  6.68s/it]

Request succeeded. Waiting 4 seconds before the next request...


 14%|█▍        | 14/100 [01:32<09:59,  6.97s/it]

Request succeeded. Waiting 4 seconds before the next request...


 15%|█▌        | 15/100 [01:46<12:49,  9.05s/it]

Request succeeded. Waiting 4 seconds before the next request...


 16%|█▌        | 16/100 [01:51<11:07,  7.94s/it]

Request succeeded. Waiting 4 seconds before the next request...


 17%|█▋        | 17/100 [01:56<09:42,  7.01s/it]

Request succeeded. Waiting 4 seconds before the next request...


 18%|█▊        | 18/100 [02:02<08:58,  6.57s/it]

Request succeeded. Waiting 4 seconds before the next request...


 19%|█▉        | 19/100 [02:07<08:21,  6.19s/it]

Request succeeded. Waiting 4 seconds before the next request...


 20%|██        | 20/100 [02:12<07:44,  5.80s/it]

Request succeeded. Waiting 4 seconds before the next request...


 21%|██        | 21/100 [02:17<07:19,  5.56s/it]

Request succeeded. Waiting 4 seconds before the next request...


 22%|██▏       | 22/100 [02:27<08:45,  6.74s/it]

Request succeeded. Waiting 4 seconds before the next request...


 23%|██▎       | 23/100 [02:32<08:14,  6.42s/it]

Request succeeded. Waiting 4 seconds before the next request...


 24%|██▍       | 24/100 [02:37<07:31,  5.95s/it]

Request succeeded. Waiting 4 seconds before the next request...


 25%|██▌       | 25/100 [02:42<07:05,  5.68s/it]

Request succeeded. Waiting 4 seconds before the next request...


 26%|██▌       | 26/100 [02:47<06:51,  5.56s/it]

Request succeeded. Waiting 4 seconds before the next request...


 27%|██▋       | 27/100 [02:52<06:28,  5.32s/it]

Request succeeded. Waiting 4 seconds before the next request...


 28%|██▊       | 28/100 [02:57<06:10,  5.15s/it]

Request succeeded. Waiting 4 seconds before the next request...


 29%|██▉       | 29/100 [03:03<06:34,  5.55s/it]

Request succeeded. Waiting 4 seconds before the next request...


 30%|███       | 30/100 [03:10<06:56,  5.95s/it]

Request succeeded. Waiting 4 seconds before the next request...


 31%|███       | 31/100 [03:16<06:36,  5.75s/it]

Request succeeded. Waiting 4 seconds before the next request...


 32%|███▏      | 32/100 [03:22<06:53,  6.08s/it]

Request succeeded. Waiting 4 seconds before the next request...


 33%|███▎      | 33/100 [03:33<08:21,  7.48s/it]

Request succeeded. Waiting 4 seconds before the next request...


 34%|███▍      | 34/100 [03:38<07:23,  6.71s/it]

Request succeeded. Waiting 4 seconds before the next request...


 35%|███▌      | 35/100 [03:44<07:01,  6.49s/it]

Request succeeded. Waiting 4 seconds before the next request...


 36%|███▌      | 36/100 [03:54<07:53,  7.39s/it]

Request succeeded. Waiting 4 seconds before the next request...


 37%|███▋      | 37/100 [04:00<07:18,  6.96s/it]

Request succeeded. Waiting 4 seconds before the next request...


 38%|███▊      | 38/100 [04:05<06:37,  6.41s/it]

Request succeeded. Waiting 4 seconds before the next request...


 39%|███▉      | 39/100 [04:10<06:04,  5.97s/it]

Request succeeded. Waiting 4 seconds before the next request...


 40%|████      | 40/100 [04:14<05:39,  5.65s/it]

Request succeeded. Waiting 4 seconds before the next request...


 41%|████      | 41/100 [04:21<05:52,  5.97s/it]

Request succeeded. Waiting 4 seconds before the next request...


 42%|████▏     | 42/100 [04:27<05:37,  5.81s/it]

Request succeeded. Waiting 4 seconds before the next request...


 43%|████▎     | 43/100 [04:32<05:17,  5.57s/it]

Request succeeded. Waiting 4 seconds before the next request...


 44%|████▍     | 44/100 [04:38<05:25,  5.82s/it]

Request succeeded. Waiting 4 seconds before the next request...


 45%|████▌     | 45/100 [04:44<05:25,  5.91s/it]

Request succeeded. Waiting 4 seconds before the next request...


 46%|████▌     | 46/100 [04:50<05:10,  5.75s/it]

Request succeeded. Waiting 4 seconds before the next request...


 47%|████▋     | 47/100 [04:57<05:37,  6.36s/it]

Request succeeded. Waiting 4 seconds before the next request...


 48%|████▊     | 48/100 [05:03<05:14,  6.06s/it]

Request succeeded. Waiting 4 seconds before the next request...


 49%|████▉     | 49/100 [05:08<04:53,  5.76s/it]

Request succeeded. Waiting 4 seconds before the next request...


 50%|█████     | 50/100 [05:12<04:32,  5.45s/it]

Request succeeded. Waiting 4 seconds before the next request...


 51%|█████     | 51/100 [05:23<05:38,  6.92s/it]

Request succeeded. Waiting 4 seconds before the next request...


 52%|█████▏    | 52/100 [05:30<05:41,  7.12s/it]

Request succeeded. Waiting 4 seconds before the next request...


 53%|█████▎    | 53/100 [05:36<05:07,  6.53s/it]

Request succeeded. Waiting 4 seconds before the next request...


 54%|█████▍    | 54/100 [05:48<06:15,  8.16s/it]

Request succeeded. Waiting 4 seconds before the next request...


 55%|█████▌    | 55/100 [05:55<05:58,  7.96s/it]

Request succeeded. Waiting 4 seconds before the next request...


 56%|█████▌    | 56/100 [06:02<05:33,  7.58s/it]

Request succeeded. Waiting 4 seconds before the next request...


 57%|█████▋    | 57/100 [06:10<05:40,  7.93s/it]

Request succeeded. Waiting 4 seconds before the next request...


 58%|█████▊    | 58/100 [06:16<04:57,  7.07s/it]

Request succeeded. Waiting 4 seconds before the next request...


 59%|█████▉    | 59/100 [06:21<04:29,  6.57s/it]

Request succeeded. Waiting 4 seconds before the next request...


 60%|██████    | 60/100 [06:26<04:04,  6.11s/it]

Request succeeded. Waiting 4 seconds before the next request...


 61%|██████    | 61/100 [06:32<03:53,  5.98s/it]

Request succeeded. Waiting 4 seconds before the next request...


 62%|██████▏   | 62/100 [06:38<03:50,  6.06s/it]

Request succeeded. Waiting 4 seconds before the next request...


 63%|██████▎   | 63/100 [06:48<04:28,  7.26s/it]

Request succeeded. Waiting 4 seconds before the next request...


 64%|██████▍   | 64/100 [06:54<04:06,  6.86s/it]

Request succeeded. Waiting 4 seconds before the next request...


 65%|██████▌   | 65/100 [07:01<03:57,  6.79s/it]

Request succeeded. Waiting 4 seconds before the next request...


 66%|██████▌   | 66/100 [07:06<03:35,  6.33s/it]

Request succeeded. Waiting 4 seconds before the next request...


 67%|██████▋   | 67/100 [07:13<03:36,  6.57s/it]

Request succeeded. Waiting 4 seconds before the next request...


 68%|██████▊   | 68/100 [07:22<03:55,  7.36s/it]

Request succeeded. Waiting 4 seconds before the next request...


 69%|██████▉   | 69/100 [07:27<03:26,  6.65s/it]

Request succeeded. Waiting 4 seconds before the next request...


 70%|███████   | 70/100 [07:32<03:03,  6.10s/it]

Request succeeded. Waiting 4 seconds before the next request...


 71%|███████   | 71/100 [07:44<03:50,  7.94s/it]

Request succeeded. Waiting 4 seconds before the next request...


 72%|███████▏  | 72/100 [07:49<03:16,  7.01s/it]

Request succeeded. Waiting 4 seconds before the next request...


 73%|███████▎  | 73/100 [07:58<03:23,  7.52s/it]

Request succeeded. Waiting 4 seconds before the next request...


 74%|███████▍  | 74/100 [08:03<02:55,  6.74s/it]

Request succeeded. Waiting 4 seconds before the next request...


 75%|███████▌  | 75/100 [08:07<02:33,  6.13s/it]

Request succeeded. Waiting 4 seconds before the next request...


 76%|███████▌  | 76/100 [08:15<02:35,  6.47s/it]

Request succeeded. Waiting 4 seconds before the next request...


 77%|███████▋  | 77/100 [08:24<02:50,  7.42s/it]

Request succeeded. Waiting 4 seconds before the next request...


 78%|███████▊  | 78/100 [08:29<02:28,  6.73s/it]

Request succeeded. Waiting 4 seconds before the next request...


 79%|███████▉  | 79/100 [08:36<02:20,  6.68s/it]

Request succeeded. Waiting 4 seconds before the next request...


 80%|████████  | 80/100 [08:41<02:03,  6.20s/it]

Request succeeded. Waiting 4 seconds before the next request...


 81%|████████  | 81/100 [08:48<02:05,  6.58s/it]

Request succeeded. Waiting 4 seconds before the next request...


 82%|████████▏ | 82/100 [08:54<01:51,  6.18s/it]

Request succeeded. Waiting 4 seconds before the next request...


 83%|████████▎ | 83/100 [08:59<01:40,  5.93s/it]

Request succeeded. Waiting 4 seconds before the next request...


 84%|████████▍ | 84/100 [09:05<01:33,  5.82s/it]

Request succeeded. Waiting 4 seconds before the next request...


 85%|████████▌ | 85/100 [09:10<01:24,  5.64s/it]

Request succeeded. Waiting 4 seconds before the next request...


 86%|████████▌ | 86/100 [09:15<01:15,  5.38s/it]

Request succeeded. Waiting 4 seconds before the next request...


 87%|████████▋ | 87/100 [09:20<01:08,  5.24s/it]

Request succeeded. Waiting 4 seconds before the next request...


 88%|████████▊ | 88/100 [09:27<01:09,  5.78s/it]

Request succeeded. Waiting 4 seconds before the next request...


 89%|████████▉ | 89/100 [09:35<01:13,  6.65s/it]

Request succeeded. Waiting 4 seconds before the next request...


 90%|█████████ | 90/100 [09:41<01:04,  6.43s/it]

Request succeeded. Waiting 4 seconds before the next request...


 91%|█████████ | 91/100 [09:47<00:56,  6.32s/it]

Request succeeded. Waiting 4 seconds before the next request...


 92%|█████████▏| 92/100 [09:52<00:47,  5.88s/it]

Request succeeded. Waiting 4 seconds before the next request...


 93%|█████████▎| 93/100 [10:03<00:52,  7.46s/it]

Request succeeded. Waiting 4 seconds before the next request...


 94%|█████████▍| 94/100 [10:08<00:39,  6.64s/it]

Request succeeded. Waiting 4 seconds before the next request...


 95%|█████████▌| 95/100 [10:15<00:33,  6.73s/it]

Request succeeded. Waiting 4 seconds before the next request...


 96%|█████████▌| 96/100 [10:20<00:25,  6.30s/it]

Request succeeded. Waiting 4 seconds before the next request...


 97%|█████████▋| 97/100 [10:28<00:20,  6.85s/it]

Request succeeded. Waiting 4 seconds before the next request...


 98%|█████████▊| 98/100 [10:41<00:17,  8.60s/it]

Request succeeded. Waiting 4 seconds before the next request...


 99%|█████████▉| 99/100 [10:46<00:07,  7.52s/it]

Request succeeded. Waiting 4 seconds before the next request...


100%|██████████| 100/100 [10:51<00:00,  6.52s/it]


Results saved to ./output/gemini_2_flash_thinking_baseline.csv


In [None]:
# Define the models and output file names
models = [
    ("meta-llama/Llama-3.2-3B-Instruct", "./output/llama32_3B_baseline.csv"),
    ("meta-llama/Llama-3.1-8B-Instruct", "./output/llama31_8B_baseline.csv"),
    ("gemini-2.0-flash", "./output/gemini_2_flash_baseline.csv"),
    ("gemini-2.0-flash-thinking-exp-01-21", "./output/gemini_2_flash_thinking_baseline.csv")
]

# Run the function for each model
for model_name, output_file in models:
    print(f"Running baseline for model: {model_name}")
    run_baseline(
        model_name=model_name,
        output_file=output_file#,
        #qa_file="./data/annotated/QA_pairs_1.csv",
        #sample_size=100
    )
    print(f"Completed baseline for model: {model_name}\n")

Running baseline for model: meta-llama/Llama-3.2-3B-Instruct
Loading QA file: ./data/annotated/QA_pairs_1.csv
Initializing Hugging Face LLaMA model...
Loading LLaMA model: meta-llama/Llama-3.2-3B-Instruct


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Device set to use cuda:0


Generating answers using LLaMA...


 10%|█         | 10/100 [00:03<00:18,  4.78it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 100/100 [00:21<00:00,  4.60it/s]


Results saved to ./output/llama32_3B_baseline.csv
Completed baseline for model: meta-llama/Llama-3.2-3B-Instruct

Running baseline for model: meta-llama/Llama-3.1-8B-Instruct
Loading QA file: ./data/annotated/QA_pairs_1.csv
Initializing Hugging Face LLaMA model...
Loading LLaMA model: meta-llama/Llama-3.1-8B-Instruct


config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Device set to use cuda:0


Generating answers using LLaMA...


100%|██████████| 100/100 [00:21<00:00,  4.55it/s]


Results saved to ./output/llama31_8B_baseline.csv
Completed baseline for model: meta-llama/Llama-3.1-8B-Instruct

