# AAPOR Short Course: Fine-tuning LLMs for Data Augmentation

The following contains exercises 1, 2, and 3, and the fine-tuning pipeline we use during this short course.

This Jupyter Notebook is designed to be worked from top to bottom. However, the code of each section can be run independently, i.e., you can run the notebook starting from the [Exercises](#Exercises) or the [Fine-tuning Pipeline](#Fine-tuning-Pipeline) section.

To run some parts of the pipeline a GPU server is required. To select a GPU server on Google Colab, go to `Runtime` -> `Change runtime type` to do this.

# Exercises

The following contains the exercises 1, 2, and 3 of the short-course. The `TODO` marks places where the answer should be written/coded. 

## Google Colab Setup
Before starting the exercises, some technical setup is required, when running the notebook on Google Colab or Kaggle. 

For the fine-tuning, we have additional technical setup later.

In [None]:
# Importing the sys module to access system-specific parameters and functions
import sys
# Importing the os module to interact with the operating system
import os

# Downloads the data from the GitHub repository of this notebook
def download_data():
    !git clone https://github.com/tobihol/aapor-finetuning.git
    %mv aapor-finetuning/data/ .
    %rm -rf aapor-finetuning/
    return

# This function checks if CUDA (GPU support) is available and installs necessary dependencies for fine-tuning
# If CUDA is not available, it warns the user to select a GPU runtime
# The function installs the following packages:
# - transformers: Hugging Face's library for working with pre-trained models
# - openpyxl: Library for reading and writing Excel files (xlsx/xlsm/xltx/xltm)
def install_dependencies():
    import torch
    if not torch.cuda.is_available():
      print("CUDA is not available. \nPick a GPU before running this notebook. \nGo to 'Runtime' -> 'Change runtime type' to do this. (Colab)")
      return 
    %pip install transformers
    %pip install openpyxl
    return

# Checks if the code is running in a Kaggle environment
def is_running_in_kaggle():
    return 'KAGGLE_KERNEL_RUN_TYPE' in os.environ

# Checks if the code is running in a Google Colab environment
def is_running_in_colab():
    return "google.colab" in sys.modules

# Check if running in Colab or Kaggle and download data accordingly
if is_running_in_colab() or is_running_in_kaggle():
    print("Running on Colab/Kaggle")
    download_data()
    install_dependencies()
else:
    print("Not running in Colab/Kaggle")

## Exercise 1: Model Selection

Before selecting a model for your task, it’s important to reflect on your goals, resources, and technical environment. Here are key questions to guide your decision:

Do you want to run the model on your own hardware?
- Open-source models (like LLaMA, Mistral, or Qwen) can often be downloaded and run locally. This gives you more control over data privacy and fine-tuning but also requires sufficient computing power (especially GPU memory) and setup effort.

Do you have funds to use a proprietary model?
- Commercial models like OpenAI’s GPT-4, Anthropic’s Claude, or Google’s Gemini offer strong performance and easy-to-use APIs, but they come with usage costs. This can be worthwhile if you want strong performance without managing infrastructure.

Do you have strong programming skills?
- Open-source models typically require setting up environments (e.g., using PyTorch, HuggingFace Transformers), handling tokenization, managing memory usage, and sometimes fine-tuning. If you’re comfortable with Python and command-line tools, these are manageable. If not, hosted APIs might be easier to work with.

**Resources**

The following resources give an overview of available models and how they compare on different metrics of performance and cost.

- The popular benchmark **Chatbot Arena** _(Chiang et al., 2024)_ provides a live updated elo score based on crowd-sourced pairwise comparisons of their responses.
    - Trade-off Plot: https://lmarena.ai/price
- The **Stanford HELM Benchmark** evaluates language models on over 50 tasks, including understanding and reasoning. It provides insights into model performance in terms of accuracy, fairness, and robustness.
    - https://crfm.stanford.edu/helm/
- The **Artificial Analysis** platform provides independent analysis of AI models and API providers. The main metrics are 'intelligence', speed, and price. 
    - https://artificialanalysis.ai/
- The **Open LLM Leaderboard** on Hugging Face runs multiple well-established benchmarks for open-source models hosted on the site. It is most useful to compare smaller open-source LLMs that are not featured on Chatbot Arena or compare different fine-tunes of models. 
    - https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/

**Your Task**

Choose the right model based on your use case. Think about what constraints you have and use the resources provided above. 

Feel free to make notes in this markdown cell.

- TODO note

**What we are picking**

For the fine-tuning pipeline later we have the use case of predicting vote choices in the US presidential election based on ANES survey answer. We want to fine-tune a model for _free_ and therefore do not have a lot of computing resources (Google Colab free tier gives access to one T4 GPU). We therefore use the smallest model in the Llama 3 family, **Llama 3.2 1B**, as our model of choice.

## Exercise 2: Excel to JSON

Typical data that is processed by LLMs is assumed to be in JSON format. JSON is a standardized, lightweight, and language-independent format that allows structured data—like text, numbers, lists, or nested objects—to be easily transmitted between systems. Most APIs for LLMs expect JSON because it's both human-readable and machine-readable, making it ideal for sending configuration settings, user input, or model instructions in a clear and organized way.

To get a feeling for how data is structured in JSON as opposed to tabular data (Excel, CSV) we want to convert a small table to JSON in this task.

**Example**

Excel participants table:
|   Age | Gender | Education | Occupation |
|------:|:-------|:----------|:-----------|
|    30 | male   | bachelor  | accountant |
|    24 | female | master    | engineer   |

Converted to JSON:
```json
{
    "participants": [
        {
            "age": 30,
            "gender": "male",
            "education": "bachelor",
            "occupation": "accountant"
        },
        {
            "age": 24,
            "gender": "female",
            "education": "master",
            "occupation": "engineer"
        }
    ]
}
```

**Your Task**

Given the following table of participant answers in a survey:

|   Age | Gender   | State     | Vote Choice   |
|------:|:---------|:----------|:--------------|
|    27 | female   | Louisiana | Trump         |
|    45 | male     | NaN       | Clinton       |
|   NaN | female   | Ohio      | Non-voter     |

What would a corresponding JSON entry look like?

In [22]:
# type up your answer here

json_data = {
    "participants": [
        # TODO
    ]
}

**What is done in practice**

Of course, tables are not manually converted into JSON data in practice Typically python libraries are used to handle data manipulation tasks. In this short course, we will use the popular library _pandas_ for parsing tabular data. In the following, two ways of transforming Excel data into JSON data are shown using the library.

Records orientation
- Each row of the table is converted into a JSON object, with the column names as keys and the row values as the corresponding values. This results in a list of JSON objects, where each object represents a row in the table.

Columns orientation
- The table is converted into a JSON object where each key is a column name, and the value is an object containing the index-value pairs for that column. This orientation is useful when you want to access data by columns rather than by rows.


In [None]:
# Import the popular data manipulation library pandas
import pandas as pd

# Read the Excel file into a DataFrame, using the first column as the index
data = pd.read_excel("data/json_task.xlsx", index_col=0)

# Convert the DataFrame to JSON using records orientation
# Each row is converted into a JSON object with column names as keys
print("--- Records Orientation ---")
print(data.to_json(orient="records"))
print()

# Convert the DataFrame to JSON using columns orientation
# Each column is converted into a JSON object with index-value pairs
print("--- Columns Orientation ---")
print(data.to_json(orient="columns"))
print()

## Exercise 3: Prompt Design

In this exercise, we want to convert our JSON survey participant data, into the structure we will give to the LLM. We will use an _instruction-tuned_ LLM, as they are specifically trained to follow human instructions, as opposed to base models, which predict the next token in a continuous text. Instruction-tuned models use a _chat-format_ as input. These _instruction-tuned_ LLMs often follow a similar prompt format, described in the following.

**The OpenAI Chat Format**

The OpenAI JSON chat format has become the standard way to structure conversations with AI models. This format is used not only by OpenAI's models but also by many open-source models available on Hugging Face. This standardized format makes it easy to switch between different models (OpenAI, open-source models on Hugging Face, etc.), maintain conversation history in a structured way, and control the behaviour of the model through system messages.

The format consists of a list of messages, each with a "role" and "content":
- **role**: Identifies who is speaking (system, user, or assistant)
- **content**: Contains the actual message text

Overview of the three roles:
- The **system** role provides initial instructions to the model. It sets the behaviour and context for the AI assistant. It defines the assistant's role, capabilities, and constraints. The system message is not shown to the end user, when using chatbots like ChatGPT, but guides how the model responds.
- The **user** role represents inputs from the human.
- The **assistant** role contains the model's responses.

**Example Conversion**

An example of how to construct a prompt based on survey data in JSON. Here we only consider one participant to make the example more manageable.

Survey data:
```json
{
    "participants": [
        {
            "age": 30,
            "gender": "male",
            "education": "bachelor",
            "occupation": "accountant"
        },
    ]
}
```

An example prompting setup where we want the LLM to be a survey participant that answers questions.
```json
{
    "prompt": [
        {
            "role": "system",
            "content": "You are a survey participant. Reply to the user's question with a short answer."
        },
        {
            "role": "user",
            "content": "What is your age?"
        },
        {
            "role": "assistant",
            "content": "I am 30 years old."
        },
        {
            "role": "user",
            "content": "What is your gender?"
        },
        {
            "role": "assistant",
            "content": "I am a male."
        },
        {
            "role": "user",
            "content": "What is your occupation?"
        },
        {
            "role": "assistant",
            "content": "I am an accountant."
        }
    ]
}
```

**Your Task**

Using the same survey data as provided in the example above. Create a prompting setup that tasks the LLM with predicting the vote choice of the participant instead.

Survey data:
```json
{
    "participants": [
        {
            "age": 30,
            "gender": "male",
            "education": "bachelor",
            "occupation": "accountant"
        },
    ]
}
```

Questions to consider:
- What should the assistant answer look like?
- Should the model act as an expert in a certain field? Who does it substitute?


In [30]:
# type up your answer here

json_prompt = {
    "prompt": [
        # TODO
    ]
}

**Using our prompt with an LLM**

We can test our prompt with a small open-source model. To do this do _not_ include the "assistant" answer JSON object you want the LLM to answer in the prompt. The assistant answer will instead be generated by the LLM, which we chose back in Exercise 1 and we use the instruction-tuned version of the model, as described earlier in this exercise. 

We load the model _Llama-3.2-1B-Instruct_ with the `transformers` library. The transformers library is a popular open-source library developed by Hugging Face that provides pre-trained models for natural language processing (NLP) and other machine learning tasks. It includes tools for fine-tuning, inference optimization, and model deployment. In this notebook, we'll use transformers to load and run the Llama-3.2-1B-Instruct model for text generation.

In [28]:
# Import the pipeline module from the transformers library
# This provides a simple way to use pre-trained models for various NLP tasks
# such as text generation, without having to manually handle model loading and inference
from transformers import pipeline

# id of a model hosted on Hugging Face
model_id = "unsloth/Llama-3.2-1B-Instruct"  # NOTE: we use the unsloth version of the model, because it does not require an API key, as opposed to the official meta-llama version

# Create a pipeline for an instruct model using the json_prompt as input
instruct_pipeline = pipeline(
    "text-generation",  # Specifies the task type for the pipeline (generating text)
    model=model_id,     # Sets which model to use (Llama-3.2-1B-Instruct in this case)
    device_map="auto",  # Automatically determines the best device (CPU/GPU) for running the model
)

After loading the model, we now generate 10 responses to our prompt `json_prompt`. The responses can differ, as they are sampled probabilistically.

In [None]:
# Generate a response using the instruct model pipeline
responses = instruct_pipeline(json_prompt["prompt"], num_return_sequences=10)

# Print the generated response
print("10 different generated responses:")
[resp["generated_text"][-1] for resp in responses]

# Fine-tuning Pipeline

In the following we go through the steps required to setup a fine-tuning pipeline to impute missing survey data.

## Google Colab Setup

The following is the technical setup required for the fine-tuning pipeline, when running the notebook on Google Colab or Kaggle. 

This setup is build to be runnable even when the exercise part of this notebook has not been run. Therefore some code is repeated.

In [None]:
# additional colab/kaggle setup

# Importing the sys module to access system-specific parameters and functions
import sys
# Importing the os module to interact with the operating system
import os

# Downloads the data from the GitHub repository of this notebook
def download_data():
    !git clone https://github.com/tobihol/aapor-finetuning.git
    %mv aapor-finetuning/data/ .
    %rm -rf aapor-finetuning/
    return

# This function checks if CUDA (GPU support) is available and installs necessary dependencies for fine-tuning
# If CUDA is not available, it warns the user to select a GPU runtime
# The function installs the following packages:
# - bitsandbytes: Library for quantization and memory-efficient operations
# - accelerate: Library for distributed training and mixed precision
# - transformers: Hugging Face's library for working with pre-trained models
# - datasets: Library for accessing and processing datasets
# - evaluate: Library for model evaluation
# - peft: Parameter-Efficient Fine-Tuning library
# - trl: Transformer Reinforcement Learning library
# - scikit-learn: Machine learning library for evaluation metrics
# - wandb: Weights & Biases for experiment tracking
def install_dependencies():
    import torch
    if not torch.cuda.is_available():
      print("CUDA is not available. \nPick a GPU before running this notebook. \nGo to 'Runtime' -> 'Change runtime type' to do this. (Colab)")
      return 
    %pip install bitsandbytes
    %pip install accelerate
    %pip install transformers
    %pip install datasets
    %pip install evaluate
    %pip install peft
    %pip install trl
    %pip install evaluate
    %pip install scikit-learn
    %pip install wandb
    return



def is_running_in_kaggle():
    return 'KAGGLE_KERNEL_RUN_TYPE' in os.environ

def is_running_in_colab():
    return "google.colab" in sys.modules

if is_running_in_colab() or is_running_in_kaggle():
    print("Running on Colab/Kaggle")
    install_dependencies()
else:
    print("Not running in Colab/Kaggle")

We can now import the installed dependencies. Before we import any libraries we will set a seed.
Setting a consistent seed ensures that:
- Model initialization will be the same across runs
- Data shuffling will follow the same order
- Random operations like dropout will behave consistently
- This helps in debugging and comparing model performance across experiments

The transformers.set_seed() function sets a random seed for reproducibility across multiple libraries:
- PyTorch (both CPU and GPU operations)
- NumPy (for numerical operations)
- Python's random module
- TensorFlow (if used)

In [2]:
import transformers

seed = 24 # Please set your own favorite seed!

# set the seed
transformers.set_seed(seed)

## Data preperation
We use 2016 American National Election Studies survey data. Specifically, the subset of data Argyle et al. (2022) used in study 2 (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/JPV20K).

The code below loads the 2016 American National Election Studies survey data from a CSV file,
displays it as a DataFrame, and then performs exploratory analysis on it. This dataset contains
demographic information, political views, and voting behavior of respondents from the 2016 US
presidential election, which will be used to build a predictive model for voting choices
(Trump, Clinton, or Non-voter).

In [None]:
# Import pandas for data manipulation and analysis
# This library provides data structures like DataFrames that make working with structured data easy
import pandas as pd

# Read the CSV file into a pandas DataFrame
df_survey = pd.read_csv("data/2016_anes_argyle.csv")

# Display some rows of the DataFrame
df_survey

**Data Exploration**

In the cell below, we're examining the structure and content of our survey dataset:

1. We use `df_survey.info()` to get a summary of the DataFrame, which will show:
   - Total number of entries (rows)
   - Column names and their data types
   - Number of non-null values in each column
   - Memory usage of the DataFrame

This information helps us understand what kind of data we're working with and identify any potential issues like missing values that need to be addressed during preprocessing. It's an essential step before building our model to predict voting behavior.


In [None]:
df_survey.info()

**Feature Selection and Target Variable**

In the next cell, we're defining which columns from our dataset will be used:

1. We create a list called `features` containing the demographic and political variables we'll use as predictors:
   - Demographic information: race, age, gender, state
   - Political attributes: discussion habits, ideology, party affiliation
   - Religious behavior: church attendance
   - Attitudes: political interest, patriotism

2. We define `label` as "ground_truth", which represents the actual voting choice of respondents (Trump, Clinton, or Non-voter)

This selection of features will be used to train our model to predict voting behavior based on these characteristics.

In [5]:
features = [
    "race",
    "discuss_politics",
    "ideology",
    "party",
    "church_goer",
    "age",
    "gender",
    "political_interest",
    "patriotism",
    "state",
]
label = "ground_truth" # this is the vote choice in this dataset

**Data Preprocessing**

In the next cell, we prepare our data for modeling by:

1. Converting the 'age' column to string type to treat it as a categorical variable
2. Filling all missing values (NaN) with the string "missing" so they can be treated as their own category
3. This preprocessing step ensures our data is ready for the machine learning model and handles the missing values appropriately

This approach allows us to include all data points in our analysis without dropping rows that have missing information.

In [None]:
df_survey_processed = (
    df_survey
    .astype({"age": str})
    .fillna("missing")
)
df_survey_processed

### Train/Test Split

Any manipulation of the training data should be done in the step.

We split our processed survey data into training and testing sets:

1. We use scikit-learn's train_test_split function to divide our data, with 80% for training and 20% for testing
2. We set a random seed for reproducibility
3. There's commented code showing how we could modify the training data for different experiments (e.g., excluding Republican voters)
4. We convert our pandas DataFrames to Hugging Face Dataset objects for compatibility with transformer models
5. The datasets are organized in a DatasetDict with "train" and "test" keys


In [None]:
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict

df_train, df_test = train_test_split(
    df_survey_processed,  # The processed survey dataframe to split into train and test sets
    test_size=0.2,        # Allocate 20% of the data to the test set
    random_state=seed,    # Set a random seed for reproducibility of the split
)

# we can modify the training data here to do different experiments
# for example excluding republican voters
# leans_republican = df_train["party"].apply(lambda x: "Republican" in x)
# df_train = df_train[~leans_republican]

dataset = DatasetDict({
    "train": Dataset.from_pandas(df_train, preserve_index=False),
    "test": Dataset.from_pandas(df_test, preserve_index=False),
})

dataset

### Prompt Design

We will use an instruction-tuned model and will therefore define an instruction prompt here, like we did in exercise 3.

We design a prompt for our instruction-tuned model to classify voter behavior.
The prompt will include:
1. A clear instruction for the classification task
2. Context about the data (2016 survey answers from American National Election Studies)
3. Expected output format (one of three labels: 'Trump', 'Clinton', or 'Non-voter')
4. A structured format to present the survey data to the model


In [None]:
instruction = (
    "Please perform a classification task. "
    "Given the 2016 survey answers from the American National Election Studies, "
    "return which candidate the person voted for. "
    "Return a label from ['Trump', 'Clinton', 'Non-voter'] only without any other text.\n"
)
print(instruction)

We'll also create a mapping between our internal column names and more human-readable labels
to make the prompts more understandable for the model.

In [9]:
column_name_map = {
    "race": "Race",
    "discuss_politics": "Discusses politics",
    "ideology": "Ideology",
    "party": "Party",
    "church_goer": "Church",
    "age": "Age",
    "gender": "Gender",
    "political_interest": "Political interest",
    "patriotism": "American Flag",
    "state": "State",
    "ground_truth": "Vote",
}

We define a function to build prompt-completion pairs for our model.
This function will:
1. Take a row of survey data and format it using our column name mapping
2. Create a structured prompt with system and user messages
3. Set the ground truth vote as the expected completion
4. Return the formatted prompt-completion pair for model training/evaluation

In [None]:
def build_prompt_completion(
    row: dict,
    system_prompt: str = instruction,
) -> list[list[dict]]:
    user_prompt = "\n".join(
        [f"{column_name_map[k]}: {v}" for k, v in row.items() if k != label]
    )
    assistant_prompt = row[label]
    return {
        "prompt": [
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": user_prompt,
            },
        ],
        "completion": [
            {
                "role": "assistant",
                "content": assistant_prompt,
            },
        ],
    }


build_prompt_completion(
    row=dataset["train"][0],
)

Now we convert our dataset to the format needed for LLM training
The code below:
1. Applies our build_prompt_completion function to each row in the dataset
2. Removes the original feature columns and label column
3. Creates a new dataset with only the prompt-completion pairs

This transforms our tabular data into a format suitable for fine-tuning.


In [None]:
dataset_llm = dataset.map(build_prompt_completion).remove_columns(features+[label])
dataset_llm

## Preparing the Model

### Model Selection

We use `Llama-3.2-1B-Instruct` for this experiment, because:
1. It's a small model (1B parameters) that can run on limited hardware (one T4 GPU in our case, see exercise 1)
2. It's an instruction-tuned model, which makes it suitable for our classification task (see exercise 3)
3. It provides a good balance between performance and computational requirements
4. It's based on the Llama architecture which has shown strong performance on various NLP tasks
5. The model is open and accessible, making it suitable for educational purposes

In [12]:
model_id = "unsloth/Llama-3.2-1B-Instruct"

### Loading Tokenizer

The code below loads the tokenizer for our model.
The tokenizer is responsible for converting text into tokens that the model can understand.
We use `AutoTokenizer` to automatically select the appropriate tokenizer for our chosen model.

In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    model_id,  # The ID of the model to load the tokenizer for
    # revision=revision, # NOTE: revision should be set for an reproducible experiment 
    trust_remote_code=True,  # Allow the tokenizer to execute remote code from the model repository
)

### Loading the Quantized Model

Quantization reduces the memory required to store the model (Dettmers et al., 2022). Typically a model is stored in 16-bit precision, therefore for a 1B parameter model:

$$\frac{16 \text{ bits}}{8 \text{ bits/byte}} \times 1 \times 10^9 \text{ parameters} = 2 \text{ GB of VRAM}$$

With 4-bit quantization, all parameters are stored in 4-bit precision, reducing the memory requirement to:

$$\frac{4 \text{ bits}}{8 \text{ bits/byte}} \times 1 \times 10^9 \text{ parameters} = 0.5 \text{ GB of VRAM}$$

The code below loads our model with 4-bit quantization using the BitsAndBytesConfig,
which significantly reduces memory usage while maintaining reasonable performance.



In [None]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# load model in 4bit
model = AutoModelForCausalLM.from_pretrained(
    model_id,  # The ID of the model to load
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,  # Enable 4-bit quantization to reduce memory usage
    ),
    trust_remote_code=True,  # Allow execution of remote code from the model repository
    device_map="auto",  # Automatically determine whether to use CPU or GPU
)

# Overview of the model architecture
model

### LoRA
Low-Rank Adapters (LoRA) are a parameter efficient fine-tuning method (Hu et al., 2021). Instead of finetuning all model weights, LoRA finetunes the weights of the adapter layers only. This requires less memory and allows for faster finetuning.

![LoRA Diagram](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/lora_diagram.png) 

The code below sets up the LoRA configuration for fine-tuning the model.
It defines parameters like rank (r) which controls the size of low-rank matrices,
alpha which scales the LoRA contribution, dropout for regularization,
and specifies that we're applying LoRA to all linear layers in the model.
This configuration will be used to create trainable adapter layers while
keeping most of the original model weights frozen during fine-tuning.


In [None]:
from peft import LoraConfig, TaskType

# key hyperparameters
lora_rank = 8
lora_alpha = 8

lora_config = LoraConfig(
    r=lora_rank,  # Rank of the low-rank matrices
    lora_alpha=lora_alpha,  # Scaling factor for the LoRA contribution
    lora_dropout=0.05,  # Dropout probability for regularization (default)
    bias="none",  # Don't apply LoRA to bias parameters (default)
    task_type=TaskType.CAUSAL_LM,  # Specify that we're fine-tuning a causal language model
    target_modules="all-linear",  # Apply LoRA to all linear layers in the model
)

lora_config

## Metrics

For simplicity we evaluate on the _first token_ of the LLM generated answer only. If the first token with the highest probability matches the true first token we have classified correctly. We therefore can use typical classification metrics, i.e., _Accuracy_ and _F1 score_.

### Accuracy Score
Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

Where:
- $TP$: True positive
- $TN$: True negative
- $FP$: False positive
- $FN$: False negative


### F1 Score
F1 score is the harmonic mean of precision and recall, providing a balance between these two metrics:
$$
\text{F1 score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

Where:
- $\text{Precision} = TP / (TP + FP)$ - the ability of the model to avoid false positives
- $\text{Recall} = TP / (TP + FN)$ - the ability of the model to find all positive samples

Macro F1 score calculates the F1 score independently for each class and then takes the average, treating all classes equally regardless of their support:
$$
\text{Macro F1} = \frac{1}{n} \sum_{i=1}^{n} \text{F1}_i
$$


The cell below loads evaluation metrics from the Hugging Face 'evaluate' library.

These metrics will be used to evaluate our fine-tuned model's performance on the classification task.

In [None]:
import evaluate

hf_metrics = [
    evaluate.load("accuracy"),
    evaluate.load("f1", average="macro"),
]

# give description of each metric used
for metric in hf_metrics:
    print(metric.description)

## Additional Training Setup
This is python code is not covered in the workshop and such sets up some additional details for fine-tuning and evaluation.

In [None]:
import wandb

# set pad token id
if getattr(tokenizer, "pad_token_id") is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id
if getattr(model.config, "pad_token_id") is None:
    model.config.pad_token_id = tokenizer.pad_token_id

labels = df_survey_processed.ground_truth.unique()

first_token_ids = [
    tokenizer.encode(label, add_special_tokens=False)[0]
    for label in labels
]

def preprocess_logits_for_metrics(logits, labels):
    if isinstance(logits, tuple):
        # Depending on the model and config, logits may contain extra tensors,
        # like past_key_values, but logits always come first
        logits = logits[0]
    # Return highest probability token in answer options
    # return logits[:, :, first_token_ids]
    return logits.argmax(dim=-1)

y_probas = []

def compute_metrics(eval_pred):
    preds, labels = eval_pred
    # preds have the same shape as the labels, after the argmax(-1) has been calculated
    # by preprocess_logits_for_metrics but we need to shift the labels
    labels = labels[:, 1:]
    preds = preds[:, :-1]

    # -100 is a default value for ignore_index used by DataCollatorForCompletionOnlyLM
    mask = labels == -100
    labels[mask] = tokenizer.pad_token_id
    preds[mask] = tokenizer.pad_token_id

    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    results = {}
    for metric in hf_metrics:
        results |= metric.compute(predictions=preds[~mask], references=labels[~mask])

    return results

## Training the model

The code below sets up and runs a fine-tuning process for a language model:
1. Configures SFTTrainer with hyperparameters (learning rate, batch size, epochs)
2. Sets up Weights & Biases (wandb) for experiment tracking
3. Creates training arguments with evaluation strategy and logging settings
4. Initializes the SFTTrainer with the model, datasets, and LoRA configuration
5. Starts the training process to fine-tune the model on the ANES 2016 survey data

To run the training without wandb logging, set `wandb.init(mode='disabled')`.

In [None]:
from trl import SFTConfig, SFTTrainer
from datetime import datetime

# key hyperparameters
learning_rate = 2e-5  # Learning rate for optimizer - controls how quickly model parameters are updated
batch_size = 8        # Number of samples processed in each training batch
epochs = 1            # Number of complete passes through the training dataset (NOTE you should run multiple epochs in practice)

now = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
dataset_name = "argyle_anes_2016"
run_name = f"{model_id}_{dataset_name}_seed_{seed}_{now}"

# wandb.init(
#     mode='disabled',
# )
wandb.init(
    project="aapor-finetuning",
    name=run_name,
)

training_args = SFTConfig(
    # training parameters
    per_device_train_batch_size=batch_size,  # number of samples per batch on each device during training
    per_device_eval_batch_size=batch_size,   # number of samples per batch on each device during evaluation
    num_train_epochs=epochs,                 # total number of training epochs
    
    # evaluation settings
    do_eval=True,                            # whether to run evaluation
    eval_strategy="steps",                   # when to run evaluation (after certain steps)
    eval_steps=1 / 3,                        # evaluate after each third of an epoch
    
    # logging configuration
    logging_steps=10,                        # log metrics every 10 steps
    report_to="wandb",                       # send logs to Weights & Biases
    run_name=run_name,                       # name of the run for tracking
    
    output_dir="./results",                  # directory to save model checkpoints and logs
)

trainer = SFTTrainer(
    model=model,                                # The pre-trained model to fine-tune
    train_dataset=dataset_llm["train"],         # Training dataset with prompts and completions
    eval_dataset=dataset_llm["test"],           # Evaluation dataset for testing model performance
    args=training_args,                         # Training configuration settings
    peft_config=lora_config,                    # LoRA configuration for parameter-efficient fine-tuning
    compute_metrics=compute_metrics,            # Function to compute evaluation metrics (not covered in this course)
    preprocess_logits_for_metrics=preprocess_logits_for_metrics,  # Function to preprocess model outputs for metric calculation (not covered in this course)
)

trainer.evaluate()
trainer.train()
trainer.evaluate()

wandb.finish()

All fine-tuning results can be found live at: https://wandb.ai/tobihol/aapor-finetuning