# TODO Table of Content

# Google Colab Setup
The following is the technical setup required to run the exercises 1, 2, and 3 content, when running the notebook on Google Colab or Kaggle. 

For the fine-tuning we have additional technical setup later.

In [None]:
# additional colab/kaggle setup

# Importing the sys module to access system-specific parameters and functions
import sys
# Importing the os module to interact with the operating system
import os

# Downloads the data from the GitHub repository of this notebook
def download_data():
    !git clone https://github.com/tobihol/aapor-finetuning.git
    %mv aapor-finetuning/data/ .
    %rm -rf aapor-finetuning/
    return

# Checks if the code is running in a Kaggle environment
def is_running_in_kaggle():
    return 'KAGGLE_KERNEL_RUN_TYPE' in os.environ

# Checks if the code is running in a Google Colab environment
def is_running_in_colab():
    return "google.colab" in sys.modules

# Check if running in Colab or Kaggle and download data accordingly
if is_running_in_colab() or is_running_in_kaggle():
    print("Running on Colab/Kaggle")
    download_data()
else:
    print("Not running in Colab/Kaggle")

# Exercise 1: Model Selection

Choose the right model based on your use case.

Before selecting a model for your task, it’s important to reflect on your goals, resources, and technical environment. Here are key questions to guide your decision:

Do you want to run the model on your own hardware?
- Open-source models (like LLaMA, Mistral, or Qwen) can often be downloaded and run locally. This gives you more control over data privacy and fine-tuning, but also requires sufficient computing power (especially GPU memory) and setup effort.

Do you have funds to use a proprietary model?
- Commercial models like OpenAI’s GPT-4, Anthropic’s Claude, or Google’s Gemini offer strong performance and easy-to-use APIs, but they come with usage costs. This can be worthwhile if you want strong performance without managing infrastructure.

Do you have strong programming skills?
- Open-source models typically require setting up environments (e.g., using PyTorch, HuggingFace Transformers), handling tokenization, managing memory usage, and sometimes fine-tuning. If you’re comfortable with Python and command-line tools, these are manageable. If not, hosted APIs might be easier to work with.

Think about your use case and pick a model. 

**Resources**

The following resources give an overview of available models and how they compare on different metrics of performance and cost.

- The popular benchmark **Chatbot Arena** _(Chiang et al., 2024)_  provides a live updated elo score based on crowd-sourced pairwise comparisons of their responses.
    - Trade-off Plot: https://lmarena.ai/price
- The **Stanford HELM Benchmark** evaluates language models on over 50 tasks, including understanding and reasoning. It provides insights into model performance in terms of accuracy, fairness, and robustness.
    - https://crfm.stanford.edu/helm/
- The **Artificial Analysis** platform provides independent analysis of AI models and API providers. The main metrics are 'intelligence', speed, and price. 
    - https://artificialanalysis.ai/
- The **Open LLM Leaderboard** on Hugging Face runs multiple well established benchmarks on for open-source models hosted on the site. It is most useful to compare smaller open-source LLMs that are not featured on Chatbot Arena or compare different fine-tunes of models. 
    - https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/

**Notes**

Feel free to make notes here, etc.

**What we are picking**

Our use case is to predict vote choices in the US presidential election based on ANES survey answer. We want to fine-tune a model for _free_ and therefore do not have a lot of computing resources (Google Colab free tier gives access to one T4 GPU). We therefor use the smallest model in the Llama 3 family, **Llama 3.2 1B**, as our model of choice.


# Exercise 2: Excel to JSON

What would the corresponding JSON entry look like?

|   Age | Gender   | State     | Vote Choice   |
|------:|:---------|:----------|:--------------|
|    27 | female   | Louisiana | Trump         |
|    45 | male     | nan       | Clinton       |
|   nan | female   | Ohio      | Non-voter     |


**Example**

Excel
|   Age | Gender | Education | Occupation |
|------:|:-------|:----------|:-----------|
|    30 | male   | bachelor  | accountant |
|    24 | female | master    | engineer   |

JSON
```json
{
    "participants": [
        {
            "age": 30,
            "gender": "male",
            "education": "bachelor",
            "occupation": "accountant"
        },
        {
            "age": 24,
            "gender": "female",
            "education": "master",
            "occupation": "engineer"
        }
    ]
}
```

**Extra Question:**
Are there different ways to construct the JSON?

In [22]:
# type up your answer here

json_data = {
    "participants": [
        # TODO
    ]
}

In [None]:
# In practice this would be done automatically,
# here is are two different examples of how to do so

import pandas as pd

data = pd.read_excel("data/json_task.xlsx", index_col=0)

print("--- Records Orientation ---")
print(data.to_json(orient="records"))
print()

print("--- Columns Orientation ---")
print(data.to_json(orient="columns"))
print()

# Exercise 3: JSON Prompt

Build a prompt for your task.

**Questions to consider:**
- What should the output look like?
- Should the model act as an expert in a certain field? Who does it substitute?


**Example use-case:**
- Predicting vote choice in the US presidential election based on ANES survey answer

**Example Conversion**

SURVEY DATA
```json
{
    "participants": [
        {
            "age": 30,
            "gender": "male",
            "education": "bachelor",
            "occupation": "accountant"
        },
    ]
}
```

PROMPT
```json
{
    "prompt": [
        {
            "role": "system",
            "content": "You are a survey participant. Reply to the user's question with a short answer."
        },
        {
            "role": "user",
            "content": "What is your age?"
        },
        {
            "role": "assistant",
            "content": "I am 30 years old."
        },
        {
            "role": "user",
            "content": "What is your gender?"
        },
        {
            "role": "assistant",
            "content": "I am a male."
        },
        {
            "role": "user",
            "content": "What is your occupation?"
        },
        {
            "role": "assistant",
            "content": "I am an accountant."
        }
    ]
}
```

In [30]:
# type up your answer here

json_prompt = {
    "prompt": [
        # TODO
    ]
}

We can test our prompt with a small open-source model.

In [28]:
from transformers import pipeline

# id of a model hosted on Hugging Face
model_id = "unsloth/Llama-3.2-1B-Instruct"

# Create a pipeline for an instruct model using the json_prompt as input
instruct_pipeline = pipeline("text-generation", model=model_id, device_map="auto")

In [None]:
# Generate a response using the instruct model pipeline
responses = instruct_pipeline(json_prompt["prompt"], num_return_sequences=10)

# Print the generated response
print("10 different generated responses:")
[resp["generated_text"][-1] for resp in responses]

# Fine-tuning Pipeline
In the following we go through the steps required to setup a fine-tuning pipeline to impute missing survey data.

In [None]:
# additional colab/kaggle setup
import sys
import os

def install_dependencies():
    import torch
    if not torch.cuda.is_available():
      print("CUDA is not available. \nPick a GPU before running this notebook. \nGo to 'Runtime' -> 'Change runtime type' to do this. (Colab)")
      return 
    %pip install bitsandbytes
    %pip install accelerate
    %pip install transformers
    %pip install datasets
    %pip install evaluate
    %pip install peft
    %pip install trl
    %pip install evaluate
    %pip install scikit-learn
    %pip install wandb
    return



def is_running_in_kaggle():
    return 'KAGGLE_KERNEL_RUN_TYPE' in os.environ

def is_running_in_colab():
    return "google.colab" in sys.modules

if is_running_in_colab() or is_running_in_kaggle():
    print("Running on Colab/Kaggle")
    install_dependencies()
else:
    print("Not running in Colab/Kaggle")

In [1]:
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, TaskType
import pandas as pd
import numpy as np
from datasets import Dataset, DatasetDict

seed = 24 # Please set your own favorite seed!
transformers.set_seed(seed)

## Data preperation
We use 2016 American National Election Studies survey data. Specifically, the subset of data Argyle et al. (2022) used in study 2 (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/JPV20K).

In [None]:
df_survey = pd.read_csv("data/2016_anes_argyle.csv")
df_survey

In [None]:
# null values and data types
df_survey.info()

In [4]:
features = [
    "race",
    "discuss_politics",
    "ideology",
    "party",
    "church_goer",
    "age",
    "gender",
    "political_interest",
    "patriotism",
    "state",
]
label = "ground_truth" # this is the vote choice in this dataset

In [None]:
# we tread missing values as a category 
df_survey_processed = (
    df_survey
    .astype({"age": str})
    .fillna("missing")
)
df_survey_processed

### Train/Test split

Any manipulation of the training data should be done in the step.

In [None]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df_survey_processed, test_size=0.2, random_state=seed)

# we can modify the training data here to do different experiments
# for example excluding republican voters
# leans_republican = df_train["party"].apply(lambda x: "Republican" in x)
# df_train = df_train[~leans_republican]

dataset = DatasetDict({
    "train": Dataset.from_pandas(df_train, preserve_index=False),
    "test": Dataset.from_pandas(df_test, preserve_index=False),
})
dataset

### Prompt Design

We will use an instruction-tuned model and will therefore define an instruction prompt here. Having distinct `user` and `assistant` text. Where the `user` prompt includes all conditioning of the model and the `assistant` text is the expected answer.

In [None]:
instruction = (
    "Please perform a classification task. "
    "Given the 2016 survey answers from the American National Election Studies, "
    "return which candidate the person voted for. "
    "Return a label from ['Trump', 'Clinton', 'Non-voter'] only without any other text.\n"
)
print(instruction)

In [None]:
column_name_map = {
    "race": "Race",
    "discuss_politics": "Discusses politics",
    "ideology": "Ideology",
    "party": "Party",
    "church_goer": "Church",
    "age": "Age",
    "gender": "Gender",
    "political_interest": "Political interest",
    "patriotism": "American Flag",
    "state": "State",
    "ground_truth": "Vote",
}


def build_prompt_completion(
    row: dict,
    system_prompt: str = instruction,
) -> list[list[dict]]:
    user_prompt = "\n".join(
        [f"{column_name_map[k]}: {v}" for k, v in row.items() if k != label]
    )
    assistant_prompt = row[label]
    return {
        "prompt": [
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": user_prompt,
            },
        ],
        "completion": [
            {
                "role": "assistant",
                "content": assistant_prompt,
            },
        ],
    }


build_prompt_completion(
    row=dataset["train"][0],
)

In [None]:
dataset_llm = dataset.map(build_prompt_completion).remove_columns(features+[label])
dataset_llm

## Fine-tuning the model

### Model Setup

We use a very small (1B parameters) model here for demonstration proposes.

In [10]:
# model selection
model_id = "unsloth/Llama-3.2-1B-Instruct"

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    # revision=revision, # NOTE: revision should be set for an reproducible experiment 
    trust_remote_code=True,
)

### Quantization
Quantization reduces the memory required to store the model (Dettmers et al., 2022). Typically a model is stored in 16-bit precision, therefore for a 70B parameter model:

$$\frac{16 \text{ bits}}{8 \text{ bits/byte}} \times 70 \times 10^9 \text{ parameters} = 140 \text{ GB of VRAM}$$

With 4-bit quantization, all parameters are stored in 4-bit precision, reducing the memory requirement to:

$$\frac{4 \text{ bits}}{8 \text{ bits/byte}} \times 70 \times 10^9 \text{ parameters} = 35 \text{ GB of VRAM}$$



In [None]:
# load model in 4bit
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
    ),
    trust_remote_code=True,
    device_map="auto",
)

model

### LoRA
Low-Rank Adapters (LoRA) are a parameter efficient fine-tuning method (Hu et al., 2021). Instead of finetuning all model weights, LoRA finetunes the weights of the adapter layers only. This requires less memory and allows for faster finetuning.

![LoRA Diagram](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/lora_diagram.png) 


In [None]:
lora_rank = 8
lora_alpha = 8

lora_config = LoraConfig(
    r=lora_rank,
    lora_alpha=lora_alpha,
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules="all-linear",
)

lora_config

## Metrics

For simplicity we evaluate on the _first token_ of the LLM generated answer only. If the first token with the highest probability matches the true first token we have classified correctly. We therefore can use typical classification metrics, i.e., _Accuracy_ and _F1 score_.

In [None]:
import evaluate

hf_metrics = [
    evaluate.load("accuracy"),
    # evaluate.load("f1"),
]

for metric in hf_metrics:
    print(metric.description)

## Additional Training Setup
This is python code is not covered in the workshop and such sets up some additional details for fine-tuning and evaluation.

In [33]:
import wandb

# set pad token id
if getattr(tokenizer, "pad_token_id") is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id
if getattr(model.config, "pad_token_id") is None:
    model.config.pad_token_id = tokenizer.pad_token_id

labels = df_survey_processed.ground_truth.unique()

first_token_ids = [
    tokenizer.encode(label, add_special_tokens=False)[0]
    for label in labels
]

def preprocess_logits_for_metrics(logits, labels):
    if isinstance(logits, tuple):
        # Depending on the model and config, logits may contain extra tensors,
        # like past_key_values, but logits always come first
        logits = logits[0]
    # Return highest probability token in answer options
    # return logits[:, :, first_token_ids]
    return logits.argmax(dim=-1)

y_probas = []

def compute_metrics(eval_pred):
    preds, labels = eval_pred
    # preds have the same shape as the labels, after the argmax(-1) has been calculated
    # by preprocess_logits_for_metrics but we need to shift the labels
    labels = labels[:, 1:]
    preds = preds[:, :-1]

    # -100 is a default value for ignore_index used by DataCollatorForCompletionOnlyLM
    mask = labels == -100
    labels[mask] = tokenizer.pad_token_id
    preds[mask] = tokenizer.pad_token_id

    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    results = {}
    for metric in hf_metrics:
        results |= metric.compute(predictions=preds[~mask], references=labels[~mask])

    return results

## Training the model
To run the training without wandb logging, set `wandb.init(mode='disabled')`.

In [None]:
from trl import SFTConfig, SFTTrainer
from datetime import datetime

# key hyperparameters
learning_rate = 2e-5
batch_size = 8
epochs = 1 # NOTE you should run multiple epochs in practice

now = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
dataset_name = "argyle_anes_2016"
run_name = f"{model_id}_{dataset_name}_seed_{seed}_{now}"

# wandb.init(
#     mode='disabled',
# )
wandb.init(
    project="aapor-finetuning",
    name=run_name,
)

training_args = SFTConfig(
    # training
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=epochs, 

    # evaluation
    do_eval=True,
    eval_strategy="steps",
    eval_steps=1 / 3,  # after each third of an epoch

    # logging
    logging_steps=10,
    report_to="wandb",
    run_name=run_name,

    output_dir="./results",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset_llm["train"],
    eval_dataset=dataset_llm["test"],
    args=training_args,
    peft_config=lora_config,
    compute_metrics=compute_metrics,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics,
)

trainer.evaluate()
trainer.train()
trainer.evaluate()

wandb.finish()

All fine-tuning results can be found live at: https://wandb.ai/tobihol/aapor-finetuning