# Question Generator - AP Art History

Authors: Oliver Long and Shaamil Karim


Description: In this code, we attempt at evaluating different LLMs at generating questions that are most relevant to the AP art history style. We use a dataset of the last 25 years with the artwork and questions relevant to the artwork. We've collected 688 questions and propose this dataset for open use and further work [here](https://huggingface.co/datasets/shaamil101/AP_ArtHistory_Questions_With_Choices)  through hugging face. The hugging face dataset includes access to the multiple choices per questions but our dataset only deals with the artwork and the list of questions relevant to the artwork.

Additionally, we split the dataset into a train and test split and train a Llama 1b model on the existing data.

We then used the Gemini 2.0 flash model to judge the relevancy of the generated questions by feeding in the generated question set and the ground truth. We tally these scores at the end and find the Llama 1b model to be outperforming models with 22x the parameter size.

# Setting up packages and API keys

In [None]:
!pip install langchain_dartmouth
!pip install datasets
!pip install bitsandbytes
!pip install accelerate
!pip install transformers
!pip install pandas
!pip install torch
!pip install peft
# Follow instructions here to get API keys: https://pypi.org/project/langchain-dartmouth/

Collecting langchain_dartmouth
  Downloading langchain_dartmouth-0.2.13-py3-none-any.whl.metadata (7.0 kB)
Collecting dartmouth-auth (from langchain_dartmouth)
  Downloading dartmouth_auth-0.0.3-py3-none-any.whl.metadata (1.4 kB)
Collecting huggingface-hub==0.26.3 (from langchain_dartmouth)
  Downloading huggingface_hub-0.26.3-py3-none-any.whl.metadata (13 kB)
Collecting langchain (from langchain_dartmouth)
  Downloading langchain-0.3.20-py3-none-any.whl.metadata (7.7 kB)
Collecting langchain-community (from langchain_dartmouth)
  Downloading langchain_community-0.3.19-py3-none-any.whl.metadata (2.4 kB)
Collecting langchain-huggingface (from langchain_dartmouth)
  Downloading langchain_huggingface-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Collecting langchain-openai (from langchain_dartmouth)
  Downloading langchain_openai-0.3.8-py3-none-any.whl.metadata (2.3 kB)
Collecting python-dotenv (from langchain_dartmouth)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Colle

In [None]:
import os
os.environ['DARTMOUTH_CHAT_API_KEY'] = 'sk-d15355a8cc7b48cc9449a900c9834f00'
os.environ['DARTMOUTH_API_KEY'] = 'T02IFhDy3VD4LVD3nagcT5EZn97J8ySShLVCWTi7AJJkhjpncLqxhzoLhgLgdR1Ih7LotiilaztY5ltMwrMLcRKwpFslagRKjmXprToSNr38cBDrgYeieUX9SDbdOhU1Snt1AdPtkrrWfhLr7hr8y19QZmFdblWpLFBSllLRpESCRdWuLMEbvYfwAOftOMJdwTU63QiHqCclXDWtKmgVJseRIWTyYxNOoJL0T4jV2UalkxvGp5Ev9SQSzQVXBeXoJhEq6VR0YznICk14s2E8DIE4Go3KyMmBLXn0gXYfpwX1hYmeXshdPxmO4o03sv5Ij1hCDBOJNQkugmGru8TfyZhn'

In [None]:
!pip install --upgrade huggingface_hub
from huggingface_hub import notebook_login

notebook_login()

Collecting huggingface_hub
  Downloading huggingface_hub-0.29.2-py3-none-any.whl.metadata (13 kB)
Downloading huggingface_hub-0.29.2-py3-none-any.whl (468 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.1/468.1 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.26.3
    Uninstalling huggingface-hub-0.26.3:
      Successfully uninstalled huggingface-hub-0.26.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-dartmouth 0.2.13 requires huggingface-hub==0.26.3, but you have huggingface-hub 0.29.2 which is incompatible.[0m[31m
[0mSuccessfully installed huggingface_hub-0.29.2


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

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

# Preparing the dataset

In [None]:
import pandas as pd

df = pd.read_csv('final_df_with_new_descriptions.csv')

In [None]:
# Format data for instruction fine-tuning
from datasets import Dataset

def format_instruction(row):
    return {
        'instruction':"""Using the artwork information provided below, generate five multiple-choice style questions that are relevant to the artwork.

              Return a list of the five questions alone without the options, style of question or other information.
            Ensure the questions reflect the nuances of the artwork's background, and align with AP art history question standards.""",

        'input': f"""Artwork title: {row['artwork_title']}
              Artwork descripttion: {row['artwork_description']}""",
        'output': row['question_prompt']
    }

formatted_data = df.apply(format_instruction, axis=1).tolist()
dataset = Dataset.from_pandas(pd.DataFrame(formatted_data))

# Split into train and evaluation sets
dataset = dataset.train_test_split(test_size=0.2)



In [None]:
test_df = pd.read_csv('final_test_df.csv')

Unnamed: 0,instruction,input,output
0,"Using the artwork information provided below, ...","Artwork title: Portrait of a husband and wife,...",1. The wall painting on the right was located ...
1,"Using the artwork information provided below, ...",Artwork title: M√©rode Altarpiece\n ...,1. The medium of the work on the right is\n2. ...
2,"Using the artwork information provided below, ...","Artwork title: Sarcophagus of Junius Bassus, R...",1. The sculptural work originally functioned a...
3,"Using the artwork information provided below, ...",Artwork title: The Scream\n Artwo...,1. The artist of the work on the left is\n2. T...
4,"Using the artwork information provided below, ...",Artwork title: Virgin of Jeanne d'Evreux\n ...,1. Both sculptures are from which art-historic...


# The Question Relevancy Evaluator LLM

In [None]:
test_df['FinetunedLlama_1b_bad_questions'] = None
test_df['FinetunedLlama_1b_questions'] = None
test_df['FinetunedLlama_1b_score'] = None
test_df['Llama_1b_judging'] = None
test_df['Llama_1b_questions'] = None
test_df['Llama_1b_score'] = None

In [None]:
import json
import re
import pandas as pd
from langchain_dartmouth.llms import ChatDartmouthCloud

gemini = ChatDartmouthCloud(model_name="google_genai.gemini-2.0-flash-001")

def judge(row, question_set_col, ground_truth_col):
    prompt = f"""You are an evaluator tasked with assessing the relevance of the question. Read through each question in the question set and determine if it is relevant to the ground truth.
Instructions:

Assess Relevance:
For each question in the set, determine if it is relevant to atleast one question in the ground truth.

Score Each Question:
Assign a binary score for each question:
1: If the question is relevant.
0: If the question is not relevant.


Question set: {row[question_set_col]}
Ground truth set: {row[ground_truth_col]}

Stick to the strict json format below.
{{
  "reasoning": "Question 1: Is relevant because... Question 2: Is not relevant because... Question 3: Is relevant because... Question 4: Is not relevant because... Question 5: Is not relevant because...",
  "final_answer": "2"
}}

The "reasoning" tag is your quick and brief reasoning behind list of the questions that are deemed relevant.
The "final_answer" tag is your final aggregated score (the sum of the binary scores of the five questions) for the set.

Be strict about how relevant the question set is to the ground truth. Ensure you follow the json format and no other text.
"""
    response = gemini.invoke(prompt)
    json_str = re.sub(r'^```json\n|```$', '', response.content).strip()
    result = json.loads(json_str)
    return pd.Series(result)

# Base Model - Llama 1b

In [None]:
test_df = pd.read_csv('final_test_df_with_questions_and_judging (2).csv')

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", trust_remote_code=True, device_map="auto")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

In [None]:
def base_generate_questions(row):
    # Format the prompt following the same template used during training
    prompt = f"<s>[INST] {row['instruction']}.  {row['input']}. Return only the list of five questions and nothing else[/INST]"

    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt").to(base_model.device)

    # Generate the response
    with torch.no_grad():
        outputs = base_model.generate(
            input_ids=inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_new_tokens=512,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
        )

    # Decode the generated text
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract just the generated part (after the instruction)
    response = generated_text.split("[/INST]")[-1].strip()
    print(f"Generated response for row {row.name}")
    return response

In [None]:
import torch
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Generating the questions for Llama 1b
test_df['Llama_1b_questions'] = test_df.apply(base_generate_questions, axis=1)

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Generated response for row 0
Generated response for row 1
Generated response for row 2
Generated response for row 3
Generated response for row 4
Generated response for row 5
Generated response for row 6
Generated response for row 7
Generated response for row 8
Generated response for row 9
Generated response for row 10
Generated response for row 11
Generated response for row 12
Generated response for row 13
Generated response for row 14
Generated response for row 15
Generated response for row 16
Generated response for row 17
Generated response for row 18
Generated response for row 19
Generated response for row 20
Generated response for row 21


In [None]:
# Passing through the questions to the judge LLM

test_df[['Llama_1b_judging', 'Llama_1b_score']] = test_df.apply(
    lambda row: judge(row, 'Llama_1b_questions', 'output'), axis=1)
column_totals = test_df['Llama_1b_score'].astype(int).sum()
print(column_totals)

# Finetuning Llama 1b

In [None]:
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", trust_remote_code=True, device_map="auto")


In [None]:

# 3. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# 5. Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# 6. Apply LoRA to the model
model = get_peft_model(model, lora_config)

# 7. Print trainable parameters to verify setup
model.print_trainable_parameters()

In [None]:

# 8. Data collator for instruction fine-tuning
def data_collator(features):
    # Format examples for Llama instruction format
    formatted_examples = []

    for f in features:
        # Format as instruction following Llama's template
        prompt = f"<s>[INST] {f['instruction']}\n\n{f['input']} [/INST]"
        response = f"{f['output']}</s>"
        formatted_examples.append(prompt + response)

    # Tokenize
    batch = tokenizer(
        formatted_examples,
        padding="longest",
        max_length=1024,
        truncation=True,
        return_tensors="pt"
    )

    # Create labels: -100 for prompt tokens
    labels = batch["input_ids"].clone()

    for i, text in enumerate(formatted_examples):
        # Find where the response starts in each example
        prompt = f"<s>[INST] {features[i]['instruction']}\n\n{features[i]['input']} [/INST]"
        prompt_len = len(tokenizer(prompt, add_special_tokens=False)["input_ids"])
)
        labels[i, :prompt_len] = -100

    batch["labels"] = labels
    return batch

# 9. Training arguments
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./llama-11b-4bit-ap-history",
    learning_rate=2e-4,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    logging_dir="./logs",
    logging_steps=10,
    fp16=True,
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    weight_decay=0.01,
    report_to="tensorboard",
    remove_unused_columns=False,
)

In [None]:

# 10. Initialize trainer and train
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()

# 11. Save the fine-tuned model (LoRA weights only)
model.save_pretrained("./final-llama-ap-history-lora")

In [None]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the LoRA adapter weights
adapter_path = "./final-llama-ap-history-lora"
model = PeftModel.from_pretrained(model, adapter_path)

# Merge adapter with base model for faster inference (optional)
# model = model.merge_and_unload()

In [None]:
def generate_questions(row):
    # Format the prompt following the same template used during training
    prompt = f"<s>[INST] {row['instruction']}.  {row['input']}. Return only the list of five questions and nothing else[/INST]"

    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Generate the response
    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_new_tokens=512,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
        )

    # Decode the generated text
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract just the generated part (after the instruction)
    response = generated_text.split("[/INST]")[-1].strip()
    print(f"Generated response for row {row.name}")
    return response

In [None]:
# Check if the model has the LoRA adapters
if hasattr(model, 'peft_config'):
    print("LoRA adapters are loaded!")
    print(f"Active adapters: {model.active_adapters}")
else:
    print("WARNING: Model does not have LoRA adapters loaded!")

In [None]:
# Generating questions for Finetuned Llama 1b
test_df['FinetunedLlama_1b_questions'] = test_df.apply(generate_questions, axis=1)

In [None]:
# Scoring relevancy of questions generated by Finetuned Llama 1b

test_df[['FinetunedLlama_1b_judging', 'FinetunedLlama_1b_score']] = test_df.apply(
    lambda row: judge(row, 'FinetunedLlama_1b_questions', 'output'), axis=1)
column_totals = test_df['FinetunedLlama_1b_score'].astype(int).sum()
print(column_totals)

70


# Testing other large language models

In [None]:
test_df['Llama_11b_questions'] = None
test_df['Llama_11b_score'] = None
test_df['Llama_11b_judging'] = None
test_df['Mistral_small_questions'] = None
test_df['Mistral_small_score'] = None
test_df['Mistral_small_judging'] = None
test_df['Claude_Haiku_questions'] = None
test_df['Claudel_Haiku_score'] = None
test_df['Claude_Haiku_judging'] = None

In [None]:
#Generating questions using Claude Haiku

claude = ChatDartmouthCloud(model_name="anthropic.claude-3-5-haiku-20241022")

def generate_questions(row):
    prompt = f"{row['instruction']}.  {row['input']}. Return only the list of five questions and nothing else[/INST]"

    response = claude.invoke(prompt)
    if isinstance(response.content, tuple):
        return "\n\n".join(response.content)
    return f"{response.content}"


test_df['Claude_Haiku_questions'] = test_df.apply(generate_questions, axis=1)

In [None]:
#Generating questions using Llama 11b vision instruct

from langchain_dartmouth.llms import ChatDartmouth

llama_11b = ChatDartmouth(model_name="llama-3-2-11b-vision-instruct")

def generate_questions(row):
    prompt = f"{row['instruction']}.  {row['input']}. Return only the list of five questions and nothing else[/INST]"

    response = llama_11b.invoke(prompt)
    if isinstance(response.content, tuple):
        return "\n\n".join(response.content)
    return f"{response.content}"


test_df['Llama_11b_questions'] = test_df.apply(generate_questions, axis=1)

In [None]:
#Generating questions using Mistral small

mistral_small = ChatDartmouthCloud(model_name="mistral.mistral-small-2409")

def generate_questions(row):
    prompt = f"{row['instruction']}.  {row['input']}. Return only the list of five questions and nothing else[/INST]"

    response =  mistral_small.invoke(prompt)
    if isinstance(response.content, tuple):
        return "\n\n".join(response.content)
    return f"{response.content}"


test_df['Mistral_small_questions'] = test_df.apply(generate_questions, axis=1)

In [None]:
#Judging questions generate by Mistral Small

test_df[['Mistral_small_judging', 'Mistral_small_score']] = test_df.apply(
    lambda row: judge(row, 'Mistral_small_questions', 'output'), axis=1)
column_totals = test_df['Mistral_small_score'].astype(int).sum()
print(f"Mistral small score: {column_totals}")


In [None]:
#Judging questions generate by Llama 11b vision instruct

test_df[['Llama_11b_judging', 'Llama_11b_score']] = test_df.apply(
    lambda row: judge(row, 'Llama_11b_questions', 'output'), axis=1)
column_totals = test_df['Llama_11b_score'].astype(int).sum()
print(f"Llama 11b score: {column_totals}")

In [None]:
#Judging questions generate by Claude Haiku

test_df[['Claude_Haiku_judging', 'Claude_Haiku_score']] = test_df.apply(
    lambda row: judge(row, 'Claude_Haiku_questions', 'output'), axis=1)
column_totals = test_df['Claude_Haiku_score'].astype(int).sum()
print(f"Claude Haiku score: {column_totals}")

In [None]:
test_df.to_csv('final_test_df_with_questions_and_judging.csv', index=False)