# Cyber 207 Final Project
# QuasID: Open-Source Model for Quasi-Identifier (QI) Detection

#### By Jonathan Weiss & Yash Singh

## Table of Contents
1. [Introduction](#Introduction)
2. [Approach](#Approach)
   - [Fine-Tuning vs Training from Scratch](#Fine-Tuning-vs-Training-from-Scratch)
   - [Hardware & CUDA Requirements](#Hardware-&-CUDA-Requirements)
   - [Developing the Dataset](#Developing-the-Dataset)
   - [Model Configuration](#Model-Configuration)
       - [Parameter-Efficient Fine-Tuning (PEFT)](#Parameter-Efficient-Fine-Tuning-(PEFT))
       - [Low-Rank Adaptation (LoRA)](#Low-Rank-Adapation-(LoRA))
       - [Quantization](#Quantization)
4. [Step-By-Step](#Step-By-Step)
   - [Step 1: Import Required Libraries](#Step-1:-Import-Required-Libraries)
   - [Step 2: Load & Prepare the Training Data](#Step-2:-Load-&-Prepare-the-Training-Data)
   - [Step 3: Tokenize Data](#Step-3:-Tokenize-Data)
   - [Step 4: Prepare & Fine-Tune the Model](#Step-4:-Prepare-&-Fine-Tune-the-Model)
   - [Step 5: Save the Fine-Tuned Model](#Step-5:-Save-the-Fine-Tuned-Model)
   - [Step 6: Evaluate the Model](#Step-6:-Evaluate-the-Model)
5. [Conclusion](#Conclusion)

## Introduction
This project aims to address a gap in open-source privacy engineering tools by developing a machine learning model for the systematic identification of [Quasi-Identifiers (QIs)](https://en.wikipedia.org/wiki/Quasi-identifier) in datasets. The initiative stems from a previous metadata analysis project on datasets from [data.gov](https://data.gov), which revealed a lack of open-source tools for automated QI detection, despite the existence of tools for identifying personally identifiable information (PII), such as [Capital One's Data Profiler](https://www.capitalone.com/tech/open-source/basics-of-data-profiler/).

## Approach
### Fine-Tuning vs Training from Scratch
Initially, our team was planning on training a model from scratch. But given the nuances to the task of identifying QIs, we decided to fine-tune an existing model instead. Meta Llama 3-8B was chosen as it's the smallest of the [Llama 3 models](https://ai.meta.com/blog/meta-llama-3/) (8B, 70B, & 405B parameter versions), with the hopes that it would run more quickly on the consumer hardware at our disposal. Llama 3 was trained on 15 trillion tokens.

Fine-tuning a large language model (LLM) offers several advantages versus training from scratch, by leveraging pre-existing knowledge and capabilities of the LLM. We believed this would reduce our manually labeled data requirements. Since the beginning of this project, Llama 3.1 models have been released, which may have resulted in more efficient fine-tuning.

### Hardware Used & CUDA Requirements
*  GPU: NVIDIA GeForce RTX 4090, 24GB GDDR6X
*  CPU: 13th Gen Intel Core i9 13900KF, 24 cores, 68MB Cache
*  RAM: 32GB DDR5, 4800MHz

To ensure quick fine-tuning and execution of the model, the Python kernel must have access to CUDA-enabled GPUs. The Python environment should have the NVIDIA CUDA Toolkit and cuDNN library installed to enable GPU acceleration.

### Developing the Dataset
We leverage [UCI's Machine Learning Learning Repository](https://archive.ics.uci.edu/datasets) to prepare our training datasets. We used different types of datasets for example **Predict Students' Dropout and Academic Success, Student Performance, Heart Disease, Online Retail, Heart Failure Clinical Records, Bank Marketing** etc.

We tried to use as diverse data as possible to avoid biases. Since none of the datasets were labeled for the QI, we manually labeled the fields based on our understanding, and some internet definitions for which fields are considered to be QIs.

Initially we labeled the PII (Personally Identifiable Information) fields as well but later on we decided to remove those labels from the dataset as we did not find any use for it for the work we were doing. We also collected some data samples for all the different datasets we have

Our published dataset can be found on Sheet2 [here](https://docs.google.com/spreadsheets/d/10Kw-IujUPKUMSyaC0eRMWmpdWtyCspdl/edit?usp=drive_link&ouid=118311510098611665058&rtpof=true&sd=true). Before use, it must be converted to .CSV, have columns A, C, & D removed, and be saved in the same working directory as this Jupyter Notebook.been dealing with.

#### Challenges/Learning:
1. One of the challenge we faced during this process, there are only so many QIs, so no matter which dataset we look at we were finding the same or similar fields for eg. age, sex.
2. The other challenge we faced was the data we were looking at was represented in different format across the different datasets. For eg. sex: M/F; Male/Female; 1/2 etc.

### Model Configuration
Several different configurations were tested, with all initial configurations quickly resulting in CUDA running out of memory. [Meta Llama's Fine-Tuning How-To Guide](https://llama.meta.com/docs/how-to-guides/fine-tuning/) was leveraged in getting the model running on the consumer hardware available to us. Ultimately, the Parameter-Efficient Fine-Tuning (PEFT) method using Low-Rank Adaptation (LoRA) & Quantization were employed before we could get the model to fine-tune without running out of memory.
Hugging Face has examples of use of their Transformers library [published on GitHub](https://github.com/huggingface/transformers/blob/main/examples/README.md), which were referenced in the creation of this project.
#### Parameter-Efficient Fine-Tuning (PEFT)
PEFT is a technique used to fine-tune pre-trained models like Llama 3 without having to update all the parameters of the model, but instead focusing on a small subset of parameters (or introducing new weight matrices as defined in LoRA). This approach reduces computational and meory requiremetns for fine-tuning. Hugging Face's [documentation on PEFT](https://huggingface.co/docs/peft/en/index) was leveraged in the use of this project.
#### Low-Rank Adapation (LoRA)
A specific technique in PEFT by adding low-rank decomposition layers to the pre-trained model's parameters. In essence, this introduces a small number of new parameters that can be trained efficiently while keeping the other model parameters mostly unchanged. The adaptation decomposes weight matricies into smaller matrices which are easier to train and require less memory. Hugging Face also [publishes documentation on implementation of LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora).
#### Quantization
Quantization is a technique used to reduce precision of weights and activations in a models computations, to lower precision. This also helps reduce memory footprint and increase computational efficiency. Hugging Face publishes [documentation on quantization](https://huggingface.co/docs/peft/main/en/developer_guides/quantization) as well, using the [bitsandbyes](https://github.com/bitsandbytes-foundation/bitsandbytes) quantization library.

# Step-By-Step
## Step 1: Import Required Libraries
Some special libraries are required for fine-tuning the Llama-3 8b model used in this project, such as transformers, datasets, and peft. These libraries are maintained by [Hugging Face](https://huggingface.co/), a company specializing in development and maintenance of state-of-the-art transformer models. In addition to importing the required libraries, you must be authenticated with Hugging Face. This Jupyter Notebook assumes you have an accessible and valid Hugging Face access token.

In [None]:
import os

# A deep learning framework used for creating and training neural networks, commonly used for tasks involving tensor computations & GPU acceleration
# Torch is referenced in the transformers library below, but is also referenced in the torch.cuda.empty_cache() function which was used to clear GPU memory in between runs
import torch
from torch.cuda import empty_cache
from torch.utils.data import DataLoader, TensorDataset

# Used for manipulating arrays
import numpy as np

# Used to import manually labeled data used in fine-tuning from .csv
import pandas as pd

# Part of the Hugging Face library used for loading and processing datasets
from datasets import Dataset

# A library by Hugging Face for natural language processing tasks, providing implementations of transformer models, tokenizers, and utilities for training and fine-tuning models.
from transformers import (
    AutoConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    BitsAndBytesConfig,
    default_data_collator,
    EarlyStoppingCallback,
    Trainer,
    TrainerCallback,
    TrainingArguments
)

# Another Hugging Face library for parameter-efficient fine-tuning (PEFT) of large language models. It includes methods for preparing models and configuring lightweight adapters.
from peft import (
    prepare_model_for_kbit_training,
    LoraConfig,
    get_peft_model,
    PeftModel,
    PeftConfig
)

# Module from scikit-learn for evaluating models
from sklearn.metrics import (
    accuracy_score,
    precision_recall_fscore_support,
    classification_report,
    confusion_matrix
)

# Library for 8-bit optimizers and quantization methods, which are used to reduce the memory requirements of large models.
import bitsandbytes as bnb

# Utilities for initializing empty weights and inferring device maps for model parallelism
from accelerate import init_empty_weights, infer_auto_device_map

Run this block to ensure CUDA is available for training and inference.

In [None]:
print("CUDA available:", torch.cuda.is_available())
print("Number of GPUs:", torch.cuda.device_count())
print("Device name:", torch.cuda.get_device_name(torch.cuda.current_device()))

CUDA available: True
Number of GPUs: 1
Device name: NVIDIA GeForce RTX 4090


## Step 2: Load & Prepare the Training Data
The labeled_data.csv should be in the same working directory as the Jupyter Notebook.

In the context of preparing the training/test data for fine-tuning -- please note the following definitions:


*   Prompt: Question or statement the model needs to respond to
*   Completion: The expected response or classification label

During fine-tuning, the model learns to associate prompts with corresponding completions.

In [None]:
# More comprehensive Quasi-Identifier definition appeneded to beginning of dataframe to provide model more context on the specific definition of QI being used
qi_definition = """A Quasi-Identifier (QI) is an attribute or a combination of attributes that can potentially identify an individual when combined with other information.

Common examples of Quasi-Identifiers include:
1. Demographic information: age, date of birth, gender, sex, ethnicity, race
2. Location data: zip code, postal code, address, city, state, country
3. Socioeconomic indicators: occupation, education level, income bracket
4. Health-related information: height, weight, blood type
5. Temporal data: dates of significant events (e.g., admission date, discharge date)
6. Family-related information: marital status, number of children
7. Professional details: job title, workplace, years of experience

Note that some attributes might not be QIs on their own, but can become QIs when combined with other information.
"""

# Define the prompt template, which is used in combination with the training data during fine-tuning
prompt_template = """Classify if the following column contains a Quasi-Identifier:
Dataset: {dataset}
Column: {column}
Sample Data: {sample_data}
Considering the definition and examples of Quasi-Identifiers provided earlier,
Is this column a Quasi-Identifier? Answer Yes or No:"""

# Load the data from the local .csv into a pandas dataframe
df = pd.read_csv('labeled_data.csv')
display(df.head())

# Create prompt-completion pairs without QI definition
df['prompt'] = df.apply(lambda row: prompt_template.format(
    dataset=row['Dataset Name'],
    column=row['Column Name'],
    sample_data=row['Sample Data']
), axis=1)
df['completion'] = df['QI Flag'].map({'Y': "Yes", 'N': "No"})

# Create a hugging face dataset from the dataframe with QI definition as the first example for additional context
hf_dataset = Dataset.from_pandas(pd.concat([
    pd.DataFrame([{
        'prompt': qi_definition,
        'completion': " This indicates whether the prompt corresponds to a Quasi-Identifier (Yes or No). Use this information for the following classification tasks."
    }]),
    df[['prompt', 'completion']]
]))

# Split the dataset into training and validation sets
split_dataset = hf_dataset.train_test_split(test_size=0.2, seed=42)

print("\nHF Dataset columns:", split_dataset['train'].column_names)
print("\nNumber of training examples:", len(split_dataset['train']))
print("Number of validation examples:", len(split_dataset['test']))

# Display the QI definition and a full example
print("\nFirst Train Example:")
print(split_dataset['train'][0]['prompt'])
print("\nCompletion:", split_dataset['train'][0]['completion'])

print("\nSecond Test Example:")
print("Prompt:")
print(split_dataset['test'][1]['prompt'])
print("\nCompletion:", split_dataset['test'][1]['completion'])

Unnamed: 0,Dataset Name,Column Name,QI Flag,Sample Data,Unnamed: 4
0,Predict Students' Dropout and Academic Success,Marital status,Y,11112,
1,Predict Students' Dropout and Academic Success,Application mode,N,171511739,
2,Predict Students' Dropout and Academic Success,Application order,N,51521,
3,Predict Students' Dropout and Academic Success,Course,N,"171, 9254, 9070, 9773, 8014",
4,Predict Students' Dropout and Academic Success,Daytime/evening attendance,N,11110,



HF Dataset columns: ['prompt', 'completion', '__index_level_0__']

Number of training examples: 214
Number of validation examples: 54

First Train Example:
Classify if the following column contains a Quasi-Identifier:
Dataset: CDC Diabetes Health Indicators
Column: Stroke
Sample Data: 0, 0, 0, 0, 0
Considering the definition and examples of Quasi-Identifiers provided earlier,
Is this column a Quasi-Identifier? Answer Yes or No:

Completion: No

Second Test Example:
Prompt:
Classify if the following column contains a Quasi-Identifier:
Dataset: Higher Education Students Performance Evaluation
Column: Course ID
Sample Data: 1, 3, 2, 2, 2
Considering the definition and examples of Quasi-Identifiers provided earlier,
Is this column a Quasi-Identifier? Answer Yes or No:

Completion: No


## Step 3: Tokenize Data
This fine-tuned model leveraged Hugging Face's AutoTokenizer component from the Transformers library, which is designed to handle tokenization specifically for the Meta-Llama-3-8B model. The following definitions are provided to help

#### input_ids
`input_ids` are sequences of integers representing the tokenized version of input text. Each integer corresponds to a specific token (e.g., word or sub-word) from the model's vocabulary. These IDs are used as input to the model.

#### attention_mask
`attention_mask` is a binary mask indicating which tokens in the `input_ids` should be attended to by the model. Tokens that are part of the actual input text have a mask value of 1, while padding tokens have a mask value of 0. This helps the model to ignore padding tokens during processing, as padding tokens simply ensure all sequences are the same length.

#### labels
`labels` are the target outputs for each input sequence. In the context of language modeling or text generation tasks, `labels` are usually the expected continuation of the input text. During training, the model learns to predict these labels based on the given inputs. In this code, labels are set to -100 for prompt tokens to ignore them in loss calculation, focusing only on the completion tokens.

**NOTE:** In addition to requiring a Hugging Face account and API-key, the Meta-Llama-3 models are [Gated](https://huggingface.co/docs/hub/en/models-gated) on Hugging Face, meaning they require an access request be submitted and approved by the model authors before they can be accessed. Fortunately, we received approval relatively quickly, and this did not hold up progress.

In [None]:
# Load the tokenizer from the pre-trained Meta-Llama model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer.pad_token = tokenizer.eos_token  # Set the padding token to the end-of-sequence token
tokenizer.padding_side = "left"  # Set padding to left to ensure prompts are right-aligned
max_length = 512  # Define the maximum length for token sequences

# Define a custom data collator to convert lists to tensors
def custom_data_collator(features):
    batch = default_data_collator(features)

    # Convert lists to tensors if necessary
    for key in batch:
        if isinstance(batch[key], list):
            batch[key] = torch.tensor(batch[key])

    return batch

# Load and prepare the data from a CSV file
df = pd.read_csv('labeled_data.csv')
# Create a 'prompt' column by combining QI definition, prompt template, and dataset-specific information
df['prompt'] = df.apply(lambda row: qi_definition + "\n\n" + prompt_template.format(
    dataset=row['Dataset Name'],
    column=row['Column Name'],
    sample_data=row['Sample Data']
), axis=1)
# Create a 'completion' column with " Yes" or " No" based on the 'QI Flag'
df['completion'] = df['QI Flag'].map({'Y': " Yes", 'N': " No"})

# Create a Hugging Face dataset from the DataFrame
hf_dataset = Dataset.from_pandas(df[['prompt', 'completion']])

# Split the dataset into training and testing sets
split_dataset = hf_dataset.train_test_split(test_size=0.2, seed=42)

# Function to tokenize the dataset
def tokenize_function(examples):
    prompts = examples["prompt"]
    completions = examples["completion"]

    # Tokenize prompts without special tokens, truncate to leave space for BOS, completion, and EOS
    tokenized_prompts = tokenizer(prompts, add_special_tokens=False, truncation=True, max_length=max_length-3)
    # Tokenize completions without special tokens
    tokenized_completions = tokenizer(completions, add_special_tokens=False)

    input_ids = []
    attention_mask = []
    labels = []

    # For each prompt/completion pair, add beginning/end of sequence tokens, and combine
    for prompt, completion in zip(tokenized_prompts['input_ids'], tokenized_completions['input_ids']):
        # Combine prompt and completion with BOS and EOS tokens
        combined = [tokenizer.bos_token_id] + prompt + completion + [tokenizer.eos_token_id]

        # Pad or truncate to max_length
        if len(combined) < max_length:
            padding = [tokenizer.pad_token_id] * (max_length - len(combined)) # Pad the combined pair to the max length
            combined = padding + combined
            mask = [0] * len(padding) + [1] * len(combined)
        else:
            combined = combined[-max_length:]
            mask = [1] * max_length

        # Ensure mask length matches max_length
        mask = mask[-max_length:]

        # Create labels (set to -100 for prompt tokens)
        label = [-100] * (len(combined) - len(completion) - 1) + completion + [tokenizer.eos_token_id] # label is set to -100 to indicate which prompt tokens to ignore during calculation of loss
        label = label[-max_length:] # Ensures label list does not exceed max_length

        input_ids.append(combined)
        attention_mask.append(mask)
        labels.append(label)

    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels
    }

# Tokenize the dataset with the tokenize_function, remove original columns
tokenized_datasets = split_dataset.map(tokenize_function, batched=True, remove_columns=split_dataset["train"].column_names)

print("Data tokenized successfully.")

# Diagnostic: Check a sample from the tokenized dataset
sample = tokenized_datasets['train'][0]
print("\nSample tokenized data:")
print(f"Input IDs length: {len(sample['input_ids'])}")
print(f"Attention mask length: {len(sample['attention_mask'])}")
print(f"Labels length: {len(sample['labels'])}")

print("\nLast few tokens of Input IDs:")
print(sample['input_ids'][-10:])

print("\nLast few tokens of Attention Mask:")
print(sample['attention_mask'][-10:])

print("\nLast few tokens of Labels:")
print(sample['labels'][-10:])

print("\nDecoded sample (last 100 tokens):")
decoded_input = tokenizer.decode(sample['input_ids'][-100:])
print(decoded_input)

print("\nDecoded labels (non-ignored, last 20 tokens):")
decoded_labels = tokenizer.decode([l for l in sample['labels'] if l != -100][-20:])
print(decoded_labels)

# Check for 'Yes' and 'No' token IDs in the vocabulary
yes_id = tokenizer.encode(" Yes", add_special_tokens=False)[0]
no_id = tokenizer.encode(" No", add_special_tokens=False)[0]
print(f"\n'Yes' token ID: {yes_id}")
print(f"'No' token ID: {no_id}")

# Check if 'Yes' or 'No' appear in the labels of the sample
if yes_id in sample['labels'] or no_id in sample['labels']:
    print("'Yes' or 'No' found in labels.")
else:
    print("Neither 'Yes' nor 'No' found in labels.")

# Print the last few non-ignored labels
last_labels = [l for l in sample['labels'] if l != -100][-5:]
print("\nLast few non-ignored label IDs:")
print(last_labels)
print("Decoded:")
print(tokenizer.decode(last_labels))


Map:   0%|          | 0/213 [00:00<?, ? examples/s]

Map:   0%|          | 0/54 [00:00<?, ? examples/s]

Data tokenized successfully.

Sample tokenized data:
Input IDs length: 512
Attention mask length: 512
Labels length: 512

Last few tokens of Input IDs:
[10426, 37873, 30, 22559, 7566, 477, 2360, 25, 7566, 128001]

Last few tokens of Attention Mask:
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Last few tokens of Labels:
[-100, -100, -100, -100, -100, -100, -100, -100, 7566, 128001]

Decoded sample (last 100 tokens):
 workplace, years of experience

Note that some attributes might not be QIs on their own, but can become QIs when combined with other information.


Classify if the following column contains a Quasi-Identifier:
Dataset: Heart Failure Clinical Records
Column: age
Sample Data: 75, 55, 65, 50, 65
Considering the definition and examples of Quasi-Identifiers provided earlier,
Is this column a Quasi-Identifier? Answer Yes or No: Yes<|end_of_text|>

Decoded labels (non-ignored, last 20 tokens):
 Yes<|end_of_text|>

'Yes' token ID: 7566
'No' token ID: 2360
'Yes' or 'No' found in labels.

Last fe

## Step 4: Prepare & Fine-Tune the Model
This code prepares and fine-tunes a language model using bitsandbytes quantization and Low-Rank Adaptation (LoRA) to optimize memory usage and training efficiency. It configures quantization, maps model parts to GPU/CPU, enables gradient checkpointing, sets up LoRA for efficient fine-tuning, defines training parameters, and initializes the Trainer. Finally, it trains the model.

In [None]:
# Define a function to compute metrics for evaluation
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # Get the predicted classes by taking the argmax of the logits
    predictions = logits.argmax(axis=-1)

    # Filter out padding tokens (denoted by -100 in the labels)
    mask = labels != -100
    labels_filtered = labels[mask]
    predictions_filtered = predictions[mask]

    # Convert labels and predictions to binary classification for "Yes"
    yes_token_id = tokenizer.encode(" Yes")[1]
    labels_binary = (labels_filtered == yes_token_id).astype(int)
    predictions_binary = (predictions_filtered == yes_token_id).astype(int)

    # Debug prints to check the filtered and binary converted labels and predictions
    print(f"labels_filtered: {labels_filtered}")
    print(f"predictions_filtered: {predictions_filtered}")
    print(f"labels_binary: {labels_binary}")
    print(f"predictions_binary: {predictions_binary}")
    print(f"Unique labels in labels_binary: {np.unique(labels_binary)}")
    print(f"Unique labels in predictions_binary: {np.unique(predictions_binary)}")

    # Ensure there is at least one instance of each class in labels and predictions
    if len(np.unique(labels_binary)) == 1 or len(np.unique(predictions_binary)) == 1:
        return {
            "accuracy": accuracy_score(labels_binary, predictions_binary),
            "f1": 0.0,
            "precision": 0.0,
            "recall": 0.0
        }

    # Compute precision, recall, F1 score, and accuracy
    precision, recall, f1, _ = precision_recall_fscore_support(labels_binary, predictions_binary, average='binary', zero_division=0)
    acc = accuracy_score(labels_binary, predictions_binary)

    return {
        "accuracy": acc,
        "f1": f1,
        "precision": precision,
        "recall": recall
    }

# Set up BitsAndBytes configuration for model quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load the model with quantization using the BitsAndBytes configuration
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto"
)

# Enable gradient checkpointing to save memory during training
model.gradient_checkpointing_enable()

# Set up Low-Rank Adaptation (LoRA) configuration for efficient fine-tuning
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Prepare the model for training with k-bit (low precision) parameters
model = prepare_model_for_kbit_training(model)

# Get the PEFT (Parameter-Efficient Fine-Tuning) model with the specified LoRA configuration
model = get_peft_model(model, peft_config)

# Set up training arguments optimized for an RTX 4090 GPU
training_arguments = TrainingArguments(
    output_dir="./results",  # Directory to save model checkpoints and logs
    num_train_epochs=3,  # Number of training epochs
    per_device_train_batch_size=1,  # Reduced batch size due to memory constraints
    per_device_eval_batch_size=1,
    gradient_checkpointing=True,
    gradient_accumulation_steps=16,  # Accumulate gradients over multiple steps
    optim="paged_adamw_32bit",
    save_steps=100,  # Save model checkpoint every 100 steps
    logging_steps=10,  # Log training progress every 10 steps
    learning_rate=2e-4,  # Initial learning rate
    weight_decay=0.001,  # Weight decay for regularization
    fp16=True,  # Use half-precision floating-point for faster training
    bf16=False,
    max_grad_norm=0.3,  # Maximum gradient norm for clipping
    max_steps=-1,
    warmup_ratio=0.03,  # Warm-up ratio for learning rate scheduling
    group_by_length=True,
    lr_scheduler_type="constant",  # Use a constant learning rate scheduler
    report_to="none",  # Disable wandb logging
    evaluation_strategy="steps",  # Evaluate the model at regular intervals
    eval_steps=100,  # Evaluate every 100 steps
    load_best_model_at_end=True,  # Load the best model at the end of training
    metric_for_best_model="f1",  # Use F1 score to determine the best model
)

# Create a Trainer object to manage the training process
trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=custom_data_collator,  # Use the custom data collator for batching
    compute_metrics=compute_metrics,  # Use the custom metrics function for evaluation
)

# Start the training process
trainer.train()

# Clear the CUDA cache to free up GPU memory
torch.cuda.empty_cache()

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss,Validation Loss


## Step 5: Save the Fine-Tuned Model

In [None]:
model.save_pretrained("./fine_tuned_llama_8b_qlora")
tokenizer.save_pretrained("./fine_tuned_llama_8b_qlora")
torch.cuda.empty_cache()

## Step 6: Evaluate the Model
Evaluates model using tokenized test data.

In [None]:
def evaluate_model(model_path, tokenizer_path, tokenized_test_data, batch_size=1, device='auto'):
    # Automatically choose device (GPU if available, otherwise CPU)
    if device == 'auto':
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f"Using device: {device}")

    # Load model configuration for PEFT (Parameter-Efficient Fine-Tuning)
    peft_config = PeftConfig.from_pretrained(model_path)

    # Set up bitsandbytes quantization configuration
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,  # Use 4-bit quantization to reduce model size and memory usage
        bnb_4bit_use_double_quant=True,  # Enable double quantization for more precise quantization
        bnb_4bit_quant_type="nf4",  # Set the type of quantization to NF4
        bnb_4bit_compute_dtype=torch.bfloat16  # Use bfloat16 for computation to save memory while maintaining precision
    )

    # Load the base model with quantization and device mapping
    base_model = AutoModelForCausalLM.from_pretrained(
        peft_config.base_model_name_or_path,
        quantization_config=bnb_config,  # Apply the bitsandbytes quantization configuration
        device_map=device,  # Use the specified device for model components
        low_cpu_mem_usage=True  # Optimize for low CPU memory usage
    )

    # Load the fine-tuned model using the base model and PEFT configuration
    model = PeftModel.from_pretrained(base_model, model_path, device_map=device)
    model.eval()  # Set the model to evaluation mode

    # Load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
    tokenizer.padding_side = "left"  # Ensure left-padding for consistency
    tokenizer.pad_token = tokenizer.eos_token  # Set padding token to EOS token

    # Prepare the test data as tensors
    input_ids = torch.tensor(tokenized_test_data['input_ids'])
    attention_mask = torch.tensor(tokenized_test_data['attention_mask'])

    # Create a DataLoader for batched processing
    dataset = TensorDataset(input_ids, attention_mask)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)

    true_labels = []
    predictions = []

    with torch.no_grad():  # Disable gradient calculation for evaluation
        for batch_input_ids, batch_attention_mask in dataloader:
            batch_input_ids = batch_input_ids.to(device)
            batch_attention_mask = batch_attention_mask.to(device)

            # Extract true labels based on the second last token in input_ids
            batch_true_labels = [1 if ids[-2] == tokenizer.encode(" Yes", add_special_tokens=False)[0] else 0 for ids in batch_input_ids]
            true_labels.extend(batch_true_labels)

            # Generate model outputs
            outputs = model.generate(
                input_ids=batch_input_ids,
                attention_mask=batch_attention_mask,
                max_new_tokens=5,  # Limit the number of new tokens generated
                num_return_sequences=1,  # Generate one sequence per input
                do_sample=False,  # Disable sampling to use greedy decoding
                pad_token_id=tokenizer.eos_token_id  # Use EOS token for padding
            )

            # Decode the model outputs to text
            batch_pred_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)

            # Print some examples for debugging
            for i, (input_text, true_label, pred_text) in enumerate(zip(tokenizer.batch_decode(batch_input_ids, skip_special_tokens=True), batch_true_labels, batch_pred_texts)):
                if i < 5:  # Print the first 5 examples for debugging
                    print(f"Input: {input_text.strip()}")
                    print(f"True Label (Ground Truth): {'Yes (QI)' if true_label == 1 else 'No (Non-QI)'}")
                    print(f"Model Output: {pred_text.strip()}")

                    last_token = pred_text.split()[-1]
                    model_pred = 'Yes' if last_token == 'Yes' else 'No'

                    print(f"Model Prediction: {'Yes (QI)' if model_pred == 'Yes' else 'No (Non-QI)'}")
                    print("---")

            # Determine predictions based on the last token of the generated text
            batch_predictions = [1 if text.strip().endswith("Yes") else 0 for text in batch_pred_texts]
            predictions.extend(batch_predictions)

    # Calculate evaluation metrics
    accuracy = accuracy_score(true_labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(true_labels, predictions, average='binary', zero_division=0)

    class_report = classification_report(true_labels, predictions, target_names=['Non-QI', 'QI'], zero_division=0)
    cm = confusion_matrix(true_labels, predictions)

    # Print evaluation results
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print("\nClassification Report:")
    print(class_report)
    print("\nConfusion Matrix:")
    print(cm)

    return accuracy, precision, recall, f1

# Usage remains the same
model_path = "./fine_tuned_llama_8b_qlora"
tokenizer_path = "./fine_tuned_llama_8b_qlora"
tokenized_test_data = tokenized_datasets["test"]

accuracy, precision, recall, f1 = evaluate_model(model_path, tokenizer_path, tokenized_test_data)

Using device: cuda


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]



Input: A Quasi-Identifier (QI) is an attribute or a combination of attributes that can potentially identify an individual when combined with other information.

Common examples of Quasi-Identifiers include:
1. Demographic information: age, date of birth, gender, sex, ethnicity, race
2. Location data: zip code, postal code, address, city, state, country
3. Socioeconomic indicators: occupation, education level, income bracket
4. Health-related information: height, weight, blood type
5. Temporal data: dates of significant events (e.g., admission date, discharge date)
6. Family-related information: marital status, number of children
7. Professional details: job title, workplace, years of experience

Note that some attributes might not be QIs on their own, but can become QIs when combined with other information.


Classify if the following column contains a Quasi-Identifier:
Dataset: Predict Students' Dropout and Academic Success
Column: Curricular units 1st sem (approved)
Sample Data: 0,6,

### Conclusion

This project appears to have effectively utilized a fine-tuned Meta Llama 3-8B model with Low-Rank Adaptation (LoRA) and bitsandbytes quantization to create a model capable of identifying quasi-identifiers in datasets. This approach enabled the model to run efficiently on consumer hardware. By preparing a diverse dataset with manual labeling, we ensured comprehensive training. The model achieved high performance metrics, demonstrating its reliability in QI detection tasks. This project addresses a gap in privacy engineering tools and provides a scalable solution for identifying quasi-identifiers in various datasets, contributing to improved data privacy and security practices. Because this is the first time our team has attempted this sort of model, these results should be taken with a grain of salt, and properly peer-reviewed.

The main challenge we tackled in this project was dealing with hardware and time constraints. Even with a high-end consumer GPU, the hardware requirements to fine-tune even smaller LLMs are significant. Many training and evaluation iterations had to be restarted due to running out of memory, hence the optimizations to run on consumer hardware. The successful training and evaluation runs often took a long time to run, which slowed down how quickly new configurations could be tested.


### Libraries Used in This Virtual Environment

The following libraries and their versions are used in the virtual environment created for this Jupyter Notebook: