# Exercise 1: Model Selection

Think about you use case an pick a model.

**Questions to consider:**

- Do you want to run it on your own hardware?
- Do you have funds to use for a proprietary model?
- Do you have strong programming skills?

**Example use-case:**

- Predicting vote choice in the US presidential election based on ANES survey answer.

**Resources**

- The popular benchmark **Chatbot Arena** _(Chiang et al., 2024)_  prives a live updated elo score based on crowdsourced pariwise comparisons of their responses.
    - Trade-off Plot: https://lmarena.ai/price
- The **Artificial Analysis** platform provides independent analysis of AI models and API providers. The main metrics are 'intelligence', speed, and price. 
    - https://artificialanalysis.ai/
- The **Open LLM Leaderboard** on Hugging Face runs mutiple well established benchmarks on for open-source models hosted on the site. It is most usefull to compare smaller open-source LLMs that are not featured on Chatbot Arena or compare different fine-tunes of models. 
    - https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/

**Notes**

Feel free to make notes here, etc.

# Dependencies Setup

In [None]:
# additional colab/kaggle setup
import sys
import os

def install_dependencies():
    import torch
    if not torch.cuda.is_available():
      print("CUDA is not available. \nPick a GPU before running this notebook. \nGo to 'Runtime' -> 'Change runtime type' to do this. (Colab)")
      return 
    %pip install numpy==1.* # ligtheval is not compatible with 2.0
    %pip install lighteval
    %pip install transformers
    %pip install datasets
    %pip install peft
    %pip install bitsandbytes
    %pip install evaluate
    %pip install wandb
    return



def is_running_in_kaggle():
    return 'KAGGLE_KERNEL_RUN_TYPE' in os.environ

def is_running_in_colab():
    return "google.colab" in sys.modules

if is_running_in_colab() or is_running_in_kaggle():
    print("Running on Colab/Kaggle")
    install_dependencies()
else:
    print("Not running in Colab/Kaggle")

In [None]:
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, TaskType, prepare_model_for_kbit_training, get_peft_model
import torch
import pandas as pd
import numpy as np
from datasets import Dataset, DatasetDict

seed = 24 # Please set your own favorite seed!
transformers.set_seed(seed)

# Data preperation
We use 2016 American National Election Studies survey data. Specifically, the subset of data Argyle et al. (2022) used in study 2 (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/JPV20K).

In [None]:
# download dataset
!curl -L -o 2016_anes_argyle.pkl https://github.com/tobihol/survai-finetuning/raw/main/2016_anes_argyle.pkl

In [None]:
df_survey = pd.read_pickle("2016_anes_argyle.pkl")
df_survey

In [None]:
# null values and data types
df_survey.info()

In [None]:
features = [
    "race",
    "discuss_politics",
    "ideology",
    "party",
    "church_goer",
    "age",
    "gender",
    "political_interest",
    "patriotism",
    "state",
]
label = "ground_truth"

In [None]:
# we tread missing values as a category 
df_survey_processed = (
    df_survey
    .astype({"age": str})
    .fillna("missing")
)
df_survey_processed

### Train/Test split

Any manipulation of the training data should be done in the step.

In [None]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df_survey_processed, test_size=0.2, random_state=seed)

# we can modify the training data here to do different experiments
# for example excluding republican voters
# leans_republican = df_train["party"].apply(lambda x: "Republican" in x)
# df_train = df_train[~leans_republican]

dataset = DatasetDict({
    "train": Dataset.from_pandas(df_train, preserve_index=False),
    "test": Dataset.from_pandas(df_test, preserve_index=False),
})
dataset

## Prompt Design

We will use an instruction-tuned model and will therefore define an instruction prompt here. Having distinct `user` and `assistant` text. Where the `user` prompt includes all conditioning of the model and the `assistant` text is the expected answer.

In [None]:
instruction = (
    "Please perform a classification task. "
    "Given the 2016 survey answers from the American National Election Studies, "
    "return which candiate the person voted for. "
    "Return a label from ['Trump', 'Clinton', 'Non-voter'] only without any other text.\n"
)
print(instruction)

In [None]:
column_name_map = {
    "race": "Race",
    "discuss_politics": "Discusses politics",
    "ideology": "Ideology",
    "party": "Party",
    "church_goer": "Church",
    "age": "Age",
    "gender": "Gender",
    "political_interest": "Political interest",
    "patriotism": "American Flag",
    "state": "State",
    "ground_truth": "Vote",
}

def map_to_prompt(row):
    user_prompt = instruction
    user_prompt += "\n".join([f"{column_name_map[k]}: {v}" for k, v in row.items() if k != label])
    assistant_prompt = row[label]
    return {
        "text": user_prompt, 
        "label": assistant_prompt,
        }

map_to_prompt(dataset['train'][0])

In [None]:
print(map_to_prompt(dataset['train'][0])['text'])

In [None]:
dataset_llm = dataset.map(map_to_prompt).remove_columns(features+[label])
dataset_llm

# Fine-tuning the model

## Model Selection

### Which model should I fine-tune? 

https://huggingface.co/models

State-of-the-Art open-source model: **Llama 3 model family** *(Dubey et al., 2024)*
- Best performance, useful for testing the best possible performance to data
- First-party instruction-tuned models available

Research model: **Pythia model family** *(Biderman et al., 2023)*
- Openly available training data
- Multiple smaller model sizes available
- Enables testing your finetuning pipeline more efficiently
- Enables comparing the effects of model size on performance
- Easy to test for data contamination
- Drawback: May not give a good representation of what is possible with SOTA models



### Which models currently perform best?
Popular benchmarks:
- https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
- https://lmarena.ai/
- https://crfm.stanford.edu/helm/
    - Imputation Benchmark: https://crfm.stanford.edu/helm/classic/latest/#/groups/entity_data_imputation

In [None]:
model_id = "unsloth/Llama-3.2-1B-Instruct"

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    # revision=revision, # NOTE: revision should be set for an reproducible experiment 
    padding_side="left",
    trust_remote_code=True,
)

if getattr(tokenizer, "pad_token_id") is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

### Problem with model versioning

The problem I encounted during my pipeline implementations using: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/tree/main

In [None]:
# skip this if you don't have a huggingface account
if False:
    from transformers import AutoTokenizer

    chat = [
        {"role": "user", "content": "Hello world"},
        {"role": "assistant", "content": "Hello"},
    ]

    tokenizer_mistral_old = AutoTokenizer.from_pretrained(
        "mistralai/Mistral-7B-Instruct-v0.2",
        revision="41b61a33a2483885c981aa79e0df6b32407ed873",
    )

    untokenized_output_mistral_old = tokenizer_mistral_old.apply_chat_template(
        chat,
        tokenize=False,
    )
    print(f"Untokenized output: {untokenized_output_mistral_old}")

    tokenized_output_mistral_old = tokenizer_mistral_old.apply_chat_template(
        chat,
        tokenize=True,
    )
    print(f"Tokenized output: {tokenized_output_mistral_old}")

Cell Output

Untokenized output: \<s\>[INST] Hello world [/INST]Hello\</s\>

Tokenized output: [1, 733, 16289, 28793, 22557, 1526, 733, 28748, 16289, 28793, 16230, 2]

In [None]:
# skip this if you don't have a huggingface account
if False:
    tokenizer_mistral_new = AutoTokenizer.from_pretrained(
        "mistralai/Mistral-7B-Instruct-v0.2", revision="main"
    )

    untokenized_output_mistral_new = tokenizer_mistral_new.apply_chat_template(
        chat,
        tokenize=False,
    )
    print(f"Untokenized output: {untokenized_output_mistral_new}")

    tokenized_output_mistral_new = tokenizer_mistral_new.apply_chat_template(
        chat,
        tokenize=True,
    )
    print(f"Tokenized output: {tokenized_output_mistral_new}")

Cell Output

Untokenized output: \<s\> [INST] Hello world [/INST] Hello\</s\>

Tokenized output: [1, 733, 16289, 28793, 22557, 1526, 733, 28748, 16289, 28793, 22557, 2]

-> The token for the `Hello` answer of the assitant is different!

Not only dependency versions should be reported, but also the model version! As even small changes in the tokenizer can cause major changes in the output and make a finding not reproducable.

## Tokenization

In [None]:
def basic_tokenize_function(examples):
    prompt = f"{examples['text']} \nVote: {examples['label']} {tokenizer.eos_token}"
    return tokenizer(prompt)


def instruct_tokenize_function(examples):
    prompt = [
        {
            "role": "user",
            "content": examples["text"],
        },
        {
            "role": "assistant",
            "content": examples["label"],
        },
    ]
    inputs_ids = tokenizer.apply_chat_template(
        prompt,
        add_generation_prompt=False,
    )
    attention_mask = np.ones_like(inputs_ids)
    return {
        "input_ids": inputs_ids,
        "attention_mask": attention_mask,
    }


tokenized_dataset_llm = dataset_llm.map(instruct_tokenize_function).remove_columns(
    ["text", "label"]
)
tokenized_dataset_llm

In [None]:
print(tokenizer.decode(tokenized_dataset_llm["train"][0]['input_ids']))

## Quantization
Quantization reduces the memory required to store the model (Dettmers et al., 2022). Typically a model is stored in 16-bit precision, therefore for a 70B parameter model:

$$\frac{16 \text{ bits}}{8 \text{ bits/byte}} \times 70 \times 10^9 \text{ parameters} = 140 \text{ GB of VRAM}$$

With 4-bit quantization, all parameters are stored in 4-bit precision, reducing the memory requirement to:

$$\frac{4 \text{ bits}}{8 \text{ bits/byte}} \times 70 \times 10^9 \text{ parameters} = 35 \text{ GB of VRAM}$$



In [None]:
# load model in 4bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    trust_remote_code=True,
    device_map="auto",
)

model = prepare_model_for_kbit_training(model)

if getattr(model.config, "pad_token_id") is None:
    model.config.pad_token_id = tokenizer.pad_token_id

## LoRA
Low-Rank Adapters (LoRA) are a parameter efficient fine-tuning method (Hu et al., 2021). Instead of finetuning all model weights, LoRA finetunes the weights of the adapter layers only. This requires less memory and allows for faster finetuning.

![LoRA Diagram](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/lora_diagram.png) 


In [None]:
lora_rank = 8
lora_alpha = 8

lora_config = LoraConfig(
    r=lora_rank,
    lora_alpha=lora_alpha,
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules="all-linear",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
model.config.use_cache = False

## The Answer Extraction Problem

<!-- - https://arxiv.org/pdf/2307.09702, https://github.com/dottxt-ai/outlines -->
Different models for answer extraction:
- https://blog.eleuther.ai/multiple-choice-normalization/
- https://github.com/huggingface/lighteval

Problem 1: How many tokens are needed to answer the question:
- One-token solutions:
    - less compute intensive
    - do not require normalization
    - only works if all first tokens are distinct
- Multi-token solutions:
    - more compute intensive (multiplied by number of labels)
    - might require normalization
    - does not require all first tokens to be distinct
    
Problem 2: How to evaluate multi token extraction (see code below)

In [None]:
from lighteval.metrics.metrics_sample import LoglikelihoodAcc
from lighteval.metrics.normalizations import (
    LogProbCharNorm,
    # LogProbTokenNorm,
    # LogProbPMINorm,
)
from lighteval.tasks.requests import Doc
import numpy as np

choices = ["Trump", "Clinton", "Non-voter"]
log_prob_predictions = np.log([0.34, 0.33, 0.32])
correct_choice = "Non-voter"

doc = Doc(query="...", choices=choices, gold_index=[choices.index(correct_choice)])

In [None]:
acc_without_normalization = LoglikelihoodAcc(
    # LogProbCharNorm(ignore_first_space=False),
).compute(
    gold_ixs=doc.gold_index,
    choices_logprob=log_prob_predictions,
    unconditioned_logprob=None,
    choices_tokens=None,
    formatted_doc=doc,
)
print(f"Accuracy score without normalization: {acc_without_normalization}")

In [None]:
acc_with_normalization = LoglikelihoodAcc(
    LogProbCharNorm(ignore_first_space=False),
).compute(
    gold_ixs=doc.gold_index,
    choices_logprob=log_prob_predictions,
    unconditioned_logprob=None,
    choices_tokens=None,
    formatted_doc=doc,
)
print(f"Accuracy score with normalization: {acc_with_normalization}")

## Metrics

In [None]:
import evaluate
from functools import partial

hf_metrics = [
    evaluate.load("accuracy"),
    # additional metrics can be added here
]

## Training helper functions

In [None]:
from typing import Tuple


def instruct_tokenization(
    data: DatasetDict,
    tokenizer: AutoTokenizer,
) -> Tuple[DatasetDict, Dataset]:
    def tokenize_function(examples, is_inference=False):
        prompt = [
            {"role": "user", "content": examples["text"]},
        ]
        if not is_inference:
            prompt.append(
                {
                    "role": "assistant",
                    "content": examples["label"],
                }
            )
        inputs_ids = tokenizer.apply_chat_template(
            prompt,
            add_generation_prompt=is_inference,
        )
        attention_mask = np.ones_like(inputs_ids)
        return {
            "input_ids": inputs_ids,
            "attention_mask": attention_mask,
        }

    column_names = list(data.column_names.values())[0]
    training_data = data.map(tokenize_function, remove_columns=column_names)
    from functools import partial

    inference_data = data.map(
        partial(tokenize_function, is_inference=True), remove_columns=column_names
    )

    answer_tokens = list(
        {
            training_ids[len(inference_ids)]
            for inference_ids, training_ids in zip(
                inference_data["train"]["input_ids"]
                + inference_data["test"]["input_ids"],
                training_data["train"]["input_ids"]
                + training_data["test"]["input_ids"],
            )
        }
    )
    assert len(answer_tokens) == len(
        set(data["test"]["label"] + data["train"]["label"])
    )

    return training_data, inference_data, answer_tokens


training_data, inference_data, answer_tokens = instruct_tokenization(
    dataset_llm, tokenizer
)

In [None]:
def preprocess_logits_for_metrics(logits, labels):
    if isinstance(logits, tuple):
        # Depending on the model and config, logits may contain extra tensors,
        # like past_key_values, but logits always come first
        logits = logits[0]
    logits = logits[:, :, answer_tokens].argmax(dim=-1)

    return torch.tensor(
        answer_tokens,
        device="cuda",
    )[logits]

# we use the first answer tokens of the assitant as the ground truth
ground_truth = [training_ids[len(inference_ids)] for training_ids, inference_ids in zip(training_data["test"]["input_ids"], inference_data["test"]["input_ids"])]

def compute_metrics_inference(eval_preds):
    preds, labels = eval_preds

    # NOTE: we assume that the eval dataset does not get shuffled, and we can therefor directly compare with the ground truth
    y_true = ground_truth
    # we calculate the prediction by taking the last non-padding token
    y_pred = [[token for token in row if token != -100][-1] for row in preds]

    results = {}
    for metric in hf_metrics:
        results |= metric.compute(predictions=y_pred, references=y_true)
    return results

## Training the model
To run the training without wandb logging, set `wandb.init(mode='disabled')`.

In [None]:
import wandb
from datetime import datetime

now = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
dataset_name = "argyle_anes_2016" # change the name here to your dataset
run_name = f"{model_id}_{dataset_name}_seed_{seed}_{now}"

# wandb.init(
#     mode='disabled',
# )
wandb.init(
    project="survai-finetuning",
    name=run_name,
)

trainer = transformers.Trainer(
    model=model,
    train_dataset=training_data["train"],
    eval_dataset=inference_data["test"],
    args=transformers.TrainingArguments(
        output_dir="./results",
        gradient_checkpointing=True,
        gradient_checkpointing_kwargs={"use_reentrant": False},
        fp16=True,
        optim="paged_adamw_8bit",

        # train/eval settings
        num_train_epochs=1, # NOTE you should run multiple epochs in practice
        do_eval=True,
        eval_strategy="steps",
        eval_steps=1 / 3,  # after each third of an epoch

        # logging
        logging_steps=10,
        report_to="wandb",
        run_name=run_name,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(
        tokenizer, mlm=False
    ),
    preprocess_logits_for_metrics=preprocess_logits_for_metrics,
    compute_metrics=compute_metrics_inference,
)

trainer.evaluate()
trainer.train()
trainer.evaluate()

wandb.finish()

All fine-tuning results can be found live at: https://wandb.ai/tobihol/survai-finetuning

## Systematic Non-responses Experiment

The party affiliation is (obviously) a strong predictor of vote choice. In the Argyle et al. (2022) study, the GPT-3 mainly used the party affiliation and ideology of a person to predict the vote choice.

In this experiment we remove Repulican voters from the train set. We therefore only train on democrats and independents and see if the model can still perform well.

In [None]:
df_train["party"].value_counts()

In [None]:
leans_republican = df_train["party"].apply(lambda x: "Republican" in x)
df_train[~leans_republican]

Rerun the notebook with modified training data split to do this experiment.

# Things we did not cover

Some parts of the pipeline we did not do, because of time constraints, but should be done in pratice:
- Hyperparameter search
- Cross validation