# Baseline model without RAG

This is a baseline model for the QA task without the RAG pipline.

In order to compare, we choose the same backbone model as the one in the RAG pipeline: the `meta/llama3.1-8b-Instruct` model. We also adopt the same data type (fp16) and the same config for setting up the tokenizer. We use the same prompt format as the one in the RAG pipeline.

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from huggingface_hub import login


model_name = "meta-llama/Llama-3.1-8B-Instruct"

# Log in to Hugging Face Hub using your token
# Replace 'your_token_here' with your actual Hugging Face token
login(token = os.getenv('HUGGINGFACE_TOKEN'))

# Load the pre-trained model with specified configurations
# - `torch_dtype=torch.float16`: Use half-precision floating-point (FP16) for faster computation and less memory usage
# - `device_map="auto"`: Automatically map the model to the available device (e.g., GPU if available)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

# Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Set the padding token to the end-of-sequence (EOS) token
# This is necessary for handling variable-length inputs during batch processing
tokenizer.pad_token = tokenizer.eos_token
# Set the padding side to "left" to ensure padding is added to the left side of the input
tokenizer.padding_side = "left"

# Create a text-generation pipeline using the loaded model and tokenizer
# - `torch_dtype=torch.float16`: Use FP16 for the pipeline as well
generation_pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Device set to use cuda:0


In [14]:
# Step 3: Load the QA annotation test set
import pandas as pd
# qa_df = pd.read_csv("../data/annotated/QA_pairs_1.csv")
qa_df = pd.read_csv("./data/test/test_questions.csv")

# doc_ids = qa_df["Doc_id"].tolist()
questions = qa_df["Question"].tolist()
# answers = qa_df["Reference_Answers"].tolist()

# # random sample 10 qa pairs
# import random
# sample_size = 10
# random.seed(747)
# sample_indices = random.sample(range(len(questions)), sample_size)
# sample_doc_ids = [doc_ids[i] for i in sample_indices]
# sample_questions = [questions[i] for i in sample_indices]
# sample_answers = [answers[i] for i in sample_indices]

In [15]:
# Define a template for generating answers
# The model will generate answers based on the template

template = """
You are an expert assistant answering factual questions about various aspects of Pittsburgh or Carnegie Mellon University (CMU), including history, policy, culture, events, and more.
If you do not know the answer, just say "I don't know."

Important Instructions:
- Answer concisely without repeating the question.
- Do **not** use complete sentences. Provide only the word, name, date, or phrase that directly answers the question. For example, given the question "When was Carnegie Mellon University founded?", you should only answer "1900".

Examples:
Question: Who is Pittsburgh named after?
Answer: William Pitt
Question: What famous machine learning venue had its first conference in Pittsburgh in 1980?
Answer: ICML
Question: What musical artist is performing at PPG Arena on October 13?
Answer: Billie Eilish

Question: {question} \n\n
Answer:
"""

In [16]:
# Use the template to generate answers for each question
from tqdm import tqdm

# Initialize an empty list to store the generated answers
generated_answers = []

# Loop through each question in the list with a progress bar
for question in tqdm(questions):

    # Format the template with the current question
    full_prompt = template.format(question=question)
    
    # Prepare the input message for the model
    messages = [
        {"role": "user", "content": full_prompt},
        ]

    # Generate the answer using the text-generation pipeline
    # - `max_new_tokens=50`: Limit the response to 50 tokens to keep answers concise
    output = generation_pipe(messages, max_new_tokens=50)

    # Extract the generated answer from the output and append it to the list
    generated_answers.append(output[0]["generated_text"][1]['content'])

  2%|▏         | 10/574 [00:03<02:20,  4.02it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 574/574 [02:16<00:00,  4.20it/s]


In [18]:
# Write all columns to a CSV file
# results_df = pd.DataFrame({
#         "Doc_id": doc_ids,
#         "Question": questions,
#         "Reference_Answers": answers,
#         "Generated_Answer": generated_answers,
#     })

# Create a DataFrame to store the results
results_df = pd.DataFrame({
        "Question": questions,
        "Generated_Answer": generated_answers,
    })

# Save the results to a CSV file
results_df.to_csv("./output/submission/closebook_baseline.csv", index=False)

In [19]:
# Display the results DataFrame
results_df

Unnamed: 0,Question,Generated_Answer
0,"What bank, which is the 5th largest in the US,...",PNC Bank
1,How many bridges does Pittsburgh have?,46
2,Who named the city of Pittsburgh?,William Pitt
3,At what park do the three rivers converge in P...,Point State Park
4,How many neighborhoods does Pittsburgh have?,90
...,...,...
569,What is the primary focus of the event at the ...,I don't know.
570,Where and when is the Pittsburgh Veg Fair held...,I don't know
571,How can restaurants get involved with Pittsbur...,Pittsburgh Restaurant Week is a self-directed ...
572,What are the benefits of sponsoring the Pittsb...,"Increased exposure for local businesses, reven..."
