# LLAMA3 Fine-tuning for text classification using QLORA

This notebook is an implementaton of:
https://arxiv.org/abs/2305.14314

This notebook is derived from:
https://github.com/adidror005/youtube-videos/blob/main/LLAMA_3_Fine_Tuning_for_Sequence_Classification_Actual_Video.ipynb

Where I adapated the problem and data set from:
https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners

### Requirements:
* A GPU with enough memory, 12GB VRAM or more.  Nvidia/Tesla T4 or better would work

### Installs
* They suggest using latest version of transformers
* Must restart after install because the accelerate package used in the hugging face trainer requires it.

### Google Colab
For your convenience open this notebook on google colab with this link
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jkyamog/ml-experiments/blob/main/fine-tuning-qlora/LLAMA_3_Fine_Tuning_for_Sequence_Classification.ipynb)

Add the following secrets to Colab's secret:
* huggingface - hugging face token to access models
* kaggle_user - kaggle username from Setting > Account > API
* kaggle_key - kaggle key from Setting > Account > API

In [1]:
# Install Pytorch
#%pip install "torch==2.2.2" tensorboard

# Install Hugging Face libraries
#%pip install  --upgrade "transformers==4.40.0" "datasets==2.18.0" "accelerate==0.29.3" "evaluate==0.4.1" "bitsandbytes==0.43.1" "huggingface_hub==0.22.2" "trl==0.8.6" "peft==0.10.0"


In [2]:
%pip list | grep "transformers\\|datasets\\|accelerate\\|evaluate\\|bitsandbytes\\|huggingface-hub\\|trl\\|peft"

accelerate                1.5.2
bitsandbytes              0.45.3
datasets                  3.4.1
evaluate                  0.4.3
huggingface-hub           0.29.3
peft                      0.14.0
transformers              4.49.0
trl                       0.15.2
Note: you may need to restart the kernel to use updated packages.


### Big Picture Overview of Parameter Efficient Fine Tuning Methods like LoRA and QLoRA Fine Tuning for Sequence Classification

**The Essence of Fine-tuning**
- LLMs are pre-trained on vast amounts of data for broad language understanding.
- Fine-tuning is crucial for specializing in specific domains or tasks, involving adjustments with smaller, relevant datasets.

**Model Fine-tuning with PEFT: Exploring LoRA and QLoRA**
- Traditional fine-tuning is resource-intensive; PEFT (Parameter Efficient Fine-tuning) makes the process faster and less demanding.
- Focus on two PEFT methods: LoRA and QLoRA.

**The Power of PEFT**
- PEFT modifies only a subset of the LLM's parameters, enhancing speed and reducing memory demands, making it suitable for less powerful devices.

**LoRA: Efficiency through Adapters**
- **Low-Rank Adaptation (LoRA):** Injects small trainable adapters into the pre-trained model.
- **Equation:** For a weight matrix $W$, LoRA approximates $W = W_0 + BA$, where $W_0$ is the original weight matrix, and $BA$ represents the low-rank modification through trainable matrices $B$ and $A$.
- Adapters learn task nuances while keeping the majority of the LLM unchanged, minimizing overhead.

**QLoRA: Compression and Speed**
- **Quantized LoRA (QLoRA):** Extends LoRA by quantizing the model’s weights, further reducing size and enhancing speed.
- **Innovations in QLoRA:**
  1. **4-bit Quantization:** Uses a 4-bit data type, NormalFloat (NF4), for optimal weight quantization, drastically reducing memory usage.
  2. **Low-Rank Adapters:** Fine-tuned with 16-bit precision to effectively capture task-specific nuances.
  3. **Double Quantization:** Reduces quantization constants from 32-bit to 8-bit, saving additional memory without accuracy loss.
  4. **Paged Optimizers:** Manages memory efficiently during training, optimizing for large tasks.

**Why PEFT Matters**
- **Rapid Learning:** Speeds up model adaptation.
- **Smaller Footprint:** Eases deployment with reduced model size.
- **Edge-Friendly:** Fits better on devices with limited resources, enhancing accessibility.

**Conclusion**
- PEFT methods like LoRA and QLoRA revolutionize LLM fine-tuning by focusing on efficiency, facilitating faster adaptability, smaller models, and broader device compatibility.




### Fine-tuning for Text Classification:


#### 1. Text Generation with Classification Label as part of text
- **Approach**: Train the model to generate text that naturally appends the classification label at the end.
- **Input**: "Lorem ipsum dolor sit amet, consectetur adipiscing elit"
- **Output**: "Lorem ipsum dolor sit amet, consectetur adipiscing elit 0.25"
- **Use Case**: This method is useful for classifiying text


#### 2. Sequence Classification Head
- **Approach**: Add a sequence classification head (linear layer) on top of the LLaMa Model transformer. This setup is similar to GPT-2 and focuses on classifying the sentiment based on the last relevant token in the sequence.
    - **Token Positioning**:
        - **With pad_token_id**: The model identifies and ignores padding tokens, using the last non-padding token for classification.
        - **Without pad_token_id**: It defaults to the last token in each sequence.
        - **inputs_embeds**: If embeddings are directly passed (without input_ids), the model cannot identify padding tokens and takes the last embedding in each sequence as the input for classification.
- **Input**: Specific sentences (e.g., "Lorem ipsum dolor sit amet, consectetur adipiscing elit").
- **Output**: Direct classification (e.g., "0.25", "0.50").
- **Training Objective**: Minimize cross-entropy loss between the predicted and the actual sentiment labels.

https://huggingface.co/docs/transformers/main/en/model_doc/llama

### Peft Configs
* Bits and bytes config for quantization
* Lora config for lora

### Going to use Hugginface Transformers trainer class: Main componenents
* Hugging face dataset (for train + eval)
* Data collater
* Compute Metrics
* Class weights since we use custom trainer and also custom weighted loss..
* trainingArgs: like # epochs, learning rate, weight decay etc..



### Login to huggingface hub to put your LLama token so we can access Llama 3 7B Param Pre-trained Model

In [3]:
#device_map="DDP" # for DDP and running with `accelerate launch test_sft.py`
device_map='auto' # for PP and running with `python test_sft.py`

if device_map == "DDP":
    device_string = PartialState().process_index
    device_map={'':device_string}

In [4]:
#from google.colab import userdata
#hugginface_token = userdata.get('huggingface')
#!huggingface-cli login --token $hugginface_token

###### Imports

In [5]:
import os
import random
import functools
import csv
import pandas as pd
import numpy as np
import torch
import torch.nn.functional as F
import evaluate

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, confusion_matrix, classification_report, balanced_accuracy_score, accuracy_score

from scipy.stats import pearsonr
from datasets import Dataset, DatasetDict
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)


The easiest way to download Kaggle datasets is to use the Kaggle API. You can install this using pip by running this in a notebook cell:

!pip install kaggle
You need an API key to use the Kaggle API; to get one, click on your profile picture on the Kaggle website, and choose My Account, then click Create New API Token. This will save a file called kaggle.json to your PC. You need to copy this key on your GPU server. To do so, open the file you downloaded, copy the contents, and paste them in the following cell (e.g., creds = '{"username":"xxx","key":"xxx"}'):


In [6]:
#creds = '{"username":"' + userdata.get('kaggle_user') + '","key":"' + userdata.get('kaggle_key') + '"}'

Then execute this cell (this only needs to be run once):

In [7]:
#!pip install kaggle

import os
iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

iskaggle

from pathlib import Path

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
  cred_path.parent.mkdir(exist_ok=True)
  cred_path.write_text(creds)
  cred_path.chmod(0o600)

cred_path

PosixPath('/home/kotech/.kaggle/kaggle.json')

Now you can download datasets from Kaggle.

add Codeadd Markdown
path = Path('us-patent-phrase-to-phrase-matching')
add Codeadd Markdown
And use the Kaggle API to download the dataset to that path, and extract it:

In [8]:
from pathlib import Path
path = Path('us-patent-phrase-to-phrase-matching')

if not iskaggle and not path.exists():
    import zipfile,kaggle
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)

In [9]:
from pathlib import Path
path = Path('us-patent-phrase-to-phrase-matching')

!ls {path}

df = pd.read_csv(path/'train.csv')



sample_submission.csv  test.csv  train.csv


* Add also a numeric 0,1,2,3,4 version of label since we will need it later for fine tuning. We can save it in 'score_category'

In [10]:
df['score_ascat']=df['score'].astype('category')
df['score_category']=df['score_ascat'].cat.codes

df

Unnamed: 0,id,anchor,target,context,score,score_ascat,score_category
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50,0.50,2
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75,0.75,3
2,36d72442aefd8232,abatement,active catalyst,A47,0.25,0.25,1
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50,0.50,2
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00,0.00,0
...,...,...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00,1.00,4
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50,0.50,2
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50,0.50,2
36471,756ec035e694722b,wood article,wooden material,B44,0.75,0.75,3


* Suppose you want to decode later

In [11]:
df['score_ascat'].cat.categories
df.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,8d135da0b55b8c88,component composite coating,composition,H01
freq,1,152,24,2186


In [12]:
category_map = {code: category for code, category in enumerate(df['score_ascat'].cat.categories)}
category_map

{0: 0.0, 1: 0.25, 2: 0.5, 3: 0.75, 4: 1.0}

### Convert from Pandas DataFrame to Hugging Face Dataset
* Also let's shuffle the training set.
* We put the components train,val,test into a DatasetDict so we can access them later with HF trainer.
* Later we will add a tokenized dataset


In [13]:
df_test = pd.read_csv(path/'test.csv')

train_size = 0.8 # 80% of data
test_size = 0.2 # 20% of data
df_train, df_val = train_test_split(pd.read_csv(path/'train.csv'), train_size=train_size, test_size=test_size, random_state=42)

def generate_features(df):
  df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor
  if 'score' in df.columns:
    df['score_ascat']=df['score'].astype('category')
    df['score_category']=df['score_ascat'].cat.codes
  else:
    df['score_category'] = pd.NA

generate_features(df_train)
generate_features(df_val)


# Converting pandas DataFrames into Hugging Face Dataset objects:
dataset_train = Dataset.from_pandas(df_train.drop(['score_ascat', 'score'],axis=1).reset_index(drop=True))
dataset_val = Dataset.from_pandas(df_val.drop(['score_ascat', 'score'],axis=1).reset_index(drop=True))

In [14]:
# Combine them into a single DatasetDict
dataset = DatasetDict({
    'train': dataset_train,
    'val': dataset_val,
})
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'input', 'score_category'],
        num_rows: 29178
    })
    val: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'input', 'score_category'],
        num_rows: 7295
    })
})

* Since our classes are not balanced let's calculate class weights based on inverse value counts
* Convert to pytorch tensor since we will need it

In [15]:
df_train.score_category.value_counts(normalize=True)

score_category
2    0.337994
1    0.315854
0    0.204538
3    0.109500
4    0.032113
Name: proportion, dtype: float64

In [16]:
class_weights=(1/df_train.score_category.value_counts(normalize=True).sort_index()).tolist()
class_weights=torch.tensor(class_weights)
class_weights=class_weights/class_weights.sum()
class_weights


tensor([0.0953, 0.0617, 0.0577, 0.1781, 0.6072])

## Load LLama model with 4 bit quantization as specified in bits and bytes and prepare model for peft training

### Model Name

In [17]:
model_name = "meta-llama/Meta-Llama-3-8B"

#### Quantization Config (for QLORA)

In [18]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit = True, # enable 4-bit quantization
    bnb_4bit_quant_type = 'nf4', # information theoretically optimal dtype for normally distributed weights
    bnb_4bit_use_double_quant = True, # quantize quantized weights //insert xzibit meme
    bnb_4bit_compute_dtype = torch.bfloat16 # optimized fp format for ML
)


#### Lora Config

In [19]:
lora_config = LoraConfig(
    r = 16, # the dimension of the low-rank matrices
    lora_alpha = 8, # scaling factor for LoRA activations vs pre-trained weight activations
    target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj'],
    lora_dropout = 0.05, # dropout probability of the LoRA layers
    bias = 'none', # wether to train bias weights, set to 'none' for attention layers
    task_type = 'SEQ_CLS'
)

#### Load model
* AutomodelForSequenceClassification
* Num Labels is # of classes


In [20]:
from transformers.models.llama import LlamaPreTrainedModel
from typing import Callable, List, Optional, Tuple, Union
from transformers.cache_utils import Cache, DynamicCache, StaticCache
from transformers.modeling_outputs import (
    #BaseModelOutputWithPast,
    #CausalLMOutputWithPast,
    #QuestionAnsweringModelOutput,
    SequenceClassifierOutputWithPast,
    #TokenClassifierOutput,
)
from transformers.models.llama import LlamaModel
from torch import nn
from transformers.models.llama.modeling_llama import LlamaDecoderLayer, LlamaRotaryEmbedding

class LlamaForSequenceClassification2(LlamaPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.padding_idx = config.pad_token_id
        self.vocab_size = config.vocab_size
        
        self.num_labels = config.num_labels
        self.model = LlamaModel(config)
        
        self.out_num_layers = 1
        self.outlayers = nn.ModuleList(
            [LlamaDecoderLayer(config, layer_idx+config.num_hidden_layers) for layer_idx in range(self.out_num_layers)]
        )
        self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
        self.rotary_emb = self.model.rotary_emb

        # Initialize weights and apply final processing
        self.post_init()
        
        self.outlayers[0].load_state_dict(self.model.layers[31].state_dict())

    def get_input_embeddings(self):
        return self.model.embed_tokens

    def set_input_embeddings(self, value):
        self.model.embed_tokens = value

    def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
        r"""
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        '''
        if inputs_embeds is None:
            inputs_embeds = self.model.embed_tokens(input_ids)
        
        use_cache = use_cache if use_cache is not None else self.config.use_cache
        if use_cache and past_key_values is None:
            past_key_values = DynamicCache()
            
        cache_position = None
        if cache_position is None:
            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
            cache_position = torch.arange(
                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
            )
            
        if position_ids is None:
            position_ids = cache_position.unsqueeze(0)
        '''

        transformer_outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
            #cache_position = cache_position,
        )
        hidden_states = transformer_outputs[0]

        #use_cache = use_cache if use_cache is not None else self.config.use_cache
        use_cache = False
        past_key_values = None
        
        if use_cache and past_key_values is None:
            past_key_values = DynamicCache()
            
        cache_position = None
        if cache_position is None:
            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
            cache_position = torch.arange(
                past_seen_tokens, past_seen_tokens + hidden_states.shape[1], device=hidden_states.device
            )
            
        if position_ids is None:
            position_ids = cache_position.unsqueeze(0)
        
        causal_mask = self.model._update_causal_mask(
            attention_mask, hidden_states, cache_position, past_key_values, output_attentions
        )

        # create position embeddings to be shared across the decoder layers
        #position_embeddings = self.rotary_emb(inputs_embeds, position_ids)
        position_embeddings = self.rotary_emb(hidden_states, position_ids)
        #print('input shape =', hidden_states.shape)
        #print('hidden_states =', hidden_states)
        #print('causal_mask =', causal_mask)
        #print('position_ids =', position_ids)
        #print('past_key_values=', past_key_values)
        #print('output_attentions=', output_attentions)
        #print('use_cache=', use_cache)
        #print('cache_position=', cache_position)
        #print('position_embeddings=', position_embeddings)

        #print('input shape =', hidden_states.shape)
        for decoder_layer in self.outlayers[: self.out_num_layers]:
            layer_outputs = decoder_layer(
                    hidden_states,
                    attention_mask=causal_mask,
                    position_ids=position_ids,
                    past_key_value=past_key_values,
                    output_attentions=output_attentions,
                    use_cache=use_cache,
                    cache_position=cache_position,
                    position_embeddings=position_embeddings,
                    #**flash_attn_kwargs,
                )
            hidden_states = layer_outputs[0]
        #print('output shape =', hidden_states.shape)
        #print('hidden_states =', hidden_states)
        hidden_states = self.model.norm(hidden_states)
        '''
        '''
        logits = self.score(hidden_states)

        if input_ids is not None:
            batch_size = input_ids.shape[0]
        else:
            batch_size = inputs_embeds.shape[0]
            
        #print('config.pad_token_id =', self.config.pad_token_id)
        #print('batch_size =', batch_size)

        if self.config.pad_token_id is None and batch_size != 1:
            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
        if self.config.pad_token_id is None:
            last_non_pad_token = -1
        elif input_ids is not None:
            # To handle both left- and right- padding, we take the rightmost token that is not equal to pad_token_id
            non_pad_mask = (input_ids != self.config.pad_token_id).to(logits.device, torch.int32)
            token_indices = torch.arange(input_ids.shape[-1], device=logits.device)
            last_non_pad_token = (token_indices * non_pad_mask).argmax(-1)
        else:
            last_non_pad_token = -1
            logger.warning_once(
                f"{self.__class__.__name__} will not detect padding tokens in `inputs_embeds`. Results may be "
                "unexpected if using padding tokens in conjunction with `inputs_embeds.`"
            )

        pooled_logits = logits[torch.arange(batch_size, device=logits.device), last_non_pad_token]
        #print('pooled_logits.shape =', pooled_logits.shape)
        #print('pooled_logits =', pooled_logits)

        loss = None
        if labels is not None:
            loss = self.loss_function(logits=logits, labels=labels, pooled_logits=pooled_logits, config=self.config)

        if not return_dict:
            output = (pooled_logits,) + transformer_outputs[1:]
            return ((loss,) + output) if loss is not None else output

        #raise

        return SequenceClassifierOutputWithPast(
            loss=loss,
            logits=pooled_logits,
            past_key_values=transformer_outputs.past_key_values,
            hidden_states=transformer_outputs.hidden_states,
            attentions=transformer_outputs.attentions,
        )
        


In [21]:
model = LlamaForSequenceClassification2.from_pretrained(
    model_name,
    #quantization_config=quantization_config,
    num_labels=len(category_map),
    device_map=device_map,
)

model

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some weights of LlamaForSequenceClassification2 were not initialized from the model checkpoint at meta-llama/Meta-Llama-3-8B and are newly initialized: ['outlayers.0.input_layernorm.weight', 'outlayers.0.mlp.down_proj.weight', 'outlayers.0.mlp.gate_proj.weight', 'outlayers.0.mlp.up_proj.weight', 'outlayers.0.post_attention_layernorm.weight', 'outlayers.0.self_attn.k_proj.weight', 'outlayers.0.self_attn.o_proj.weight', 'outlayers.0.self_attn.q_proj.weight', 'outlayers.0.self_attn.v_proj.weight', 'score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


LlamaForSequenceClassification2(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-0

* prepare_model_for_kbit_training() function to preprocess the quantized model for training.

In [22]:
for name, param in model.named_parameters():
    if param.requires_grad:
        print(name)

model.embed_tokens.weight
model.layers.0.self_attn.q_proj.weight
model.layers.0.self_attn.k_proj.weight
model.layers.0.self_attn.v_proj.weight
model.layers.0.self_attn.o_proj.weight
model.layers.0.mlp.gate_proj.weight
model.layers.0.mlp.up_proj.weight
model.layers.0.mlp.down_proj.weight
model.layers.0.input_layernorm.weight
model.layers.0.post_attention_layernorm.weight
model.layers.1.self_attn.q_proj.weight
model.layers.1.self_attn.k_proj.weight
model.layers.1.self_attn.v_proj.weight
model.layers.1.self_attn.o_proj.weight
model.layers.1.mlp.gate_proj.weight
model.layers.1.mlp.up_proj.weight
model.layers.1.mlp.down_proj.weight
model.layers.1.input_layernorm.weight
model.layers.1.post_attention_layernorm.weight
model.layers.2.self_attn.q_proj.weight
model.layers.2.self_attn.k_proj.weight
model.layers.2.self_attn.v_proj.weight
model.layers.2.self_attn.o_proj.weight
model.layers.2.mlp.gate_proj.weight
model.layers.2.mlp.up_proj.weight
model.layers.2.mlp.down_proj.weight
model.layers.2.inp

In [23]:
#model = prepare_model_for_kbit_training(model)
#model

In [24]:
for param in model.parameters():
    param.requires_grad = False
for param in model.score.parameters():
    param.requires_grad = True
#for param in model.score1.parameters():
#    param.requires_grad = True
#for param in model.score2.parameters():
#    param.requires_grad = True
for param in model.outlayers.parameters():
    param.requires_grad = True

In [25]:
for name, param in model.named_parameters():
    if param.requires_grad:
        print(name)

outlayers.0.self_attn.q_proj.weight
outlayers.0.self_attn.k_proj.weight
outlayers.0.self_attn.v_proj.weight
outlayers.0.self_attn.o_proj.weight
outlayers.0.mlp.gate_proj.weight
outlayers.0.mlp.up_proj.weight
outlayers.0.mlp.down_proj.weight
outlayers.0.input_layernorm.weight
outlayers.0.post_attention_layernorm.weight
score.weight


* get_peft_model prepares a model for training with a PEFT method such as LoRA by wrapping the base model and PEFT configuration with get_peft_model

In [26]:
#model = get_peft_model(model, lora_config)
#model

In [27]:
model.model.layers[31].state_dict()

OrderedDict([('self_attn.q_proj.weight',
              tensor([[ 8.3618e-03, -3.4790e-03,  7.8735e-03,  ..., -6.8665e-04,
                       -1.4114e-03, -1.9684e-03],
                      [ 3.1128e-03, -1.5717e-03,  1.3428e-02,  ..., -1.5991e-02,
                       -3.8452e-03,  1.1902e-03],
                      [ 7.1716e-03,  1.2207e-02, -1.0376e-02,  ...,  2.9504e-06,
                       -2.6733e-02,  3.2501e-03],
                      ...,
                      [-6.0425e-03,  1.6846e-02,  1.4221e-02,  ..., -1.0132e-02,
                       -1.6602e-02,  2.4048e-02],
                      [-1.7944e-02, -2.4902e-02,  1.0803e-02,  ..., -3.4424e-02,
                        2.3956e-03, -2.2217e-02],
                      [-3.8757e-03, -1.7090e-02,  1.1414e-02,  ...,  2.1973e-03,
                       -5.9891e-04, -6.9427e-04]], device='cuda:3')),
             ('self_attn.k_proj.weight',
              tensor([[ 0.0126, -0.0272,  0.0303,  ...,  0.0481, -0.0062,  0.0065],
 

In [28]:
model.outlayers[0].load_state_dict(model.model.layers[31].state_dict())

<All keys matched successfully>

In [29]:
for param in model.outlayers.parameters():
    print(param)

Parameter containing:
tensor([[ 8.3618e-03, -3.4790e-03,  7.8735e-03,  ..., -6.8665e-04,
         -1.4114e-03, -1.9684e-03],
        [ 3.1128e-03, -1.5717e-03,  1.3428e-02,  ..., -1.5991e-02,
         -3.8452e-03,  1.1902e-03],
        [ 7.1716e-03,  1.2207e-02, -1.0376e-02,  ...,  2.9504e-06,
         -2.6733e-02,  3.2501e-03],
        ...,
        [-6.0425e-03,  1.6846e-02,  1.4221e-02,  ..., -1.0132e-02,
         -1.6602e-02,  2.4048e-02],
        [-1.7944e-02, -2.4902e-02,  1.0803e-02,  ..., -3.4424e-02,
          2.3956e-03, -2.2217e-02],
        [-3.8757e-03, -1.7090e-02,  1.1414e-02,  ...,  2.1973e-03,
         -5.9891e-04, -6.9427e-04]], device='cuda:3', requires_grad=True)
Parameter containing:
tensor([[ 0.0126, -0.0272,  0.0303,  ...,  0.0481, -0.0062,  0.0065],
        [-0.0266, -0.0155, -0.0435,  ..., -0.0216, -0.0515,  0.0091],
        [ 0.0107,  0.0070,  0.0159,  ..., -0.0048, -0.0025, -0.0025],
        ...,
        [-0.0197,  0.0422, -0.0142,  ..., -0.0164, -0.0437,  0.0

In [30]:
for param in model.outlayers.parameters():
    if param.requires_grad:
        print(param)

Parameter containing:
tensor([[ 8.3618e-03, -3.4790e-03,  7.8735e-03,  ..., -6.8665e-04,
         -1.4114e-03, -1.9684e-03],
        [ 3.1128e-03, -1.5717e-03,  1.3428e-02,  ..., -1.5991e-02,
         -3.8452e-03,  1.1902e-03],
        [ 7.1716e-03,  1.2207e-02, -1.0376e-02,  ...,  2.9504e-06,
         -2.6733e-02,  3.2501e-03],
        ...,
        [-6.0425e-03,  1.6846e-02,  1.4221e-02,  ..., -1.0132e-02,
         -1.6602e-02,  2.4048e-02],
        [-1.7944e-02, -2.4902e-02,  1.0803e-02,  ..., -3.4424e-02,
          2.3956e-03, -2.2217e-02],
        [-3.8757e-03, -1.7090e-02,  1.1414e-02,  ...,  2.1973e-03,
         -5.9891e-04, -6.9427e-04]], device='cuda:3', requires_grad=True)
Parameter containing:
tensor([[ 0.0126, -0.0272,  0.0303,  ...,  0.0481, -0.0062,  0.0065],
        [-0.0266, -0.0155, -0.0435,  ..., -0.0216, -0.0515,  0.0091],
        [ 0.0107,  0.0070,  0.0159,  ..., -0.0048, -0.0025, -0.0025],
        ...,
        [-0.0197,  0.0422, -0.0142,  ..., -0.0164, -0.0437,  0.0

### Load the tokenizer

#### Since LLAMA3 pre-training doesn't have EOS token
* Set the pad_token_id to eos_token_id
* Set pad token ot eos_token

In [31]:
tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)

tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.pad_token = tokenizer.eos_token

#### Update some model configs
* Must use .cache = False as below or it crashes from my experience

In [32]:
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False
model.config.pretraining_tp = 1

# Trainer Components
* model
* tokenizer
* training arguments
* train dataset
* eval dataset
* Data Collater
* Compute Metrics
* class_weights: In our case since we are using a custom trainer so we can use a weighted loss we will subclass trainer and define the custom loss.

#### Create LLAMA tokenized dataset which will house our train/val parts during the training process but after applying tokenization

In [33]:
MAX_LEN = 512
col_to_delete = ['id', 'anchor', 'context', 'target']

def llama_preprocessing_function(examples):
    return tokenizer(examples['input'], truncation=True, max_length=MAX_LEN)

tokenized_datasets = dataset.map(llama_preprocessing_function, batched=True, remove_columns=col_to_delete)
tokenized_datasets = tokenized_datasets.rename_column("score_category", "label")
tokenized_datasets.set_format("torch")

Map:   0%|          | 0/29178 [00:00<?, ? examples/s]

Map:   0%|          | 0/7295 [00:00<?, ? examples/s]

## Data Collator
A **data collator** prepares batches of data for training or inference in machine learning, ensuring uniform formatting and adherence to model input requirements. This is especially crucial for variable-sized inputs like text sequences.

### Functions of Data Collator

1. **Padding:** Uniformly pads sequences to the length of the longest sequence using a special token, allowing simultaneous batch processing.
2. **Batching:** Groups individual data points into batches for efficient processing.
3. **Handling Special Tokens:** Adds necessary special tokens to sequences.
4. **Converting to Tensor:** Transforms data into tensors, the required format for machine learning frameworks.

### `DataCollatorWithPadding`

The `DataCollatorWithPadding` specifically manages padding, using a tokenizer to ensure that all sequences are padded to the same length for consistent model input.

- **Syntax:** `collate_fn = DataCollatorWithPadding(tokenizer=tokenizer)`
- **Purpose:** Automatically pads text data to the longest sequence in a batch, crucial for models like BERT or GPT.
- **Tokenizer:** Uses the provided `tokenizer` for sequence processing, respecting model-specific vocabulary and formatting rules.

This collator is commonly used with libraries like Hugging Face's Transformers, facilitating data preprocessing for various NLP models.


In [34]:
collate_fn = DataCollatorWithPadding(tokenizer=tokenizer)


# define which metrics to compute for evaluation
* We will use balanced accuracy and accuracy for simplicity

In [35]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    try:
        # it's a classification task, take the argmax
        predictions_processed = np.argmax(predictions, axis=1)

        # Calculate Pearson correlation
        pearson, _ = pearsonr(predictions_processed, labels)

        return {'pearson': pearson}
    except Exception as e:
        print(f"Error in compute_metrics: {e}")
        return {'pearson': None}

### Define custom trainer with classweights
* We will have a custom loss function that deals with the class weights and have class weights as additional argument in constructor

In [36]:
class CustomTrainer(Trainer):
    def __init__(self, *args, class_weights=None, **kwargs):
        super().__init__(*args, **kwargs)
        # Ensure label_weights is a tensor
        if class_weights is not None:
            self.class_weights = torch.tensor(class_weights, dtype=torch.float32).to(self.args.device)
        else:
            self.class_weights = None

    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        # Extract labels and convert them to long type for cross_entropy
        labels = inputs.pop("labels").long()

        # Forward pass
        outputs = model(**inputs)

        # Extract logits assuming they are directly outputted by the model
        logits = outputs.get('logits')

        # Compute custom loss with class weights for imbalanced data handling
        if self.class_weights is not None:
            loss = F.cross_entropy(logits, labels, weight=self.class_weights)
        else:
            loss = F.cross_entropy(logits, labels)

        return (loss, outputs) if return_outputs else loss


# define training args

In [37]:
training_args = TrainingArguments(
    output_dir = 'sequence_classification',
    learning_rate = 1e-3,
    #lr_scheduler_type = 'cosine',
    lr_scheduler_type = 'polynomial',
    lr_scheduler_kwargs = { 'power': 2, 'lr_end': 1e-5 },
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 8,
    num_train_epochs = 15,
    weight_decay = 0.01,
    evaluation_strategy = 'epoch',
    save_strategy = 'epoch',
    load_best_model_at_end = True
)



#### Define custom trainer

In [38]:
trainer = CustomTrainer(
    model = model,
    args = training_args,
    train_dataset = tokenized_datasets['train'],
    eval_dataset = tokenized_datasets['val'],
    tokenizer = tokenizer,
    data_collator = collate_fn,
    compute_metrics = compute_metrics,
    class_weights=class_weights,
)

  super().__init__(*args, **kwargs)
  self.class_weights = torch.tensor(class_weights, dtype=torch.float32).to(self.args.device)


* https://huggingface.co/docs/transformers/en/training

### Run trainer!

In [None]:
train_result = trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33msdfsdfrr[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Pearson
1,1.5824,1.543643,0.399051
2,1.305,1.196669,0.567383
3,1.1903,1.173575,0.588461
4,1.1037,1.199388,0.618502
5,1.0486,1.06914,0.656459
6,0.9736,0.975858,0.672222
7,0.8678,0.99905,0.679241
8,0.897,0.998553,0.700248
9,0.8662,0.994286,0.676203
10,0.8088,0.987652,0.69485


In [42]:
train_result = trainer.train()

Epoch,Training Loss,Validation Loss,Pearson
1,1.3592,1.379263,0.538239
2,1.1927,1.203274,0.568286
3,1.0774,1.244954,0.628035
4,1.0665,1.141556,0.629707


KeyboardInterrupt: 

In [45]:
train_result = trainer.train()

Epoch,Training Loss,Validation Loss,Pearson
1,0.9063,1.012215,0.680267
2,0.873,0.997269,0.686252


KeyboardInterrupt: 

In [None]:
train_result = trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33msdfsdfrr[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Pearson
1,2.1312,1.77823,0.561356
2,1.0153,1.050993,0.662856


In [37]:
train_result = trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33msdfsdfrr[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Pearson
1,1.3105,1.272402,0.505867
2,1.1184,1.191806,0.541483


#### Let's check the results
* I wrapped in a function a convenient way add the predictions

In [40]:
def make_predictions(model, df):


  # Convert summaries to a list
  sentences = df.input.tolist()

  # Define the batch size
  batch_size = 32  # You can adjust this based on your system's memory capacity

  # Initialize an empty list to store the model outputs
  all_outputs = []

  # Process the sentences in batches
  for i in range(0, len(sentences), batch_size):
      # Get the batch of sentences
      batch_sentences = sentences[i:i + batch_size]

      # Tokenize the batch
      inputs = tokenizer(batch_sentences, return_tensors="pt", padding=True, truncation=True, max_length=512)

      # Move tensors to the device where the model is (e.g., GPU or CPU)
      inputs = {k: v.to('cuda' if torch.cuda.is_available() else 'cpu') for k, v in inputs.items()}

      # Perform inference and store the logits
      with torch.no_grad():
          outputs = model(**inputs)
          all_outputs.append(outputs['logits'])

  final_outputs = torch.cat(all_outputs, dim=0)
  df['predictions']=final_outputs.argmax(axis=1).cpu().numpy()
  df['predictions']=df['predictions'].apply(lambda l:category_map[l])




### Analyze performance

In [41]:
def get_performance_metrics(df_test):
  y_test = df_test.score.round()
  y_pred = df_test.predictions.round()
  print(f"comparing test {y_test} and pred {y_pred}")

  print("Confusion Matrix:")
  print(confusion_matrix(y_test, y_pred))

  print("\nClassification Report:")
  print(classification_report(y_test, y_pred))

  print("Balanced Accuracy Score:", balanced_accuracy_score(y_test, y_pred))
  print("Accuracy Score:", accuracy_score(y_test, y_pred))

In [42]:
make_predictions(model,df_val)

get_performance_metrics(df_val)
df_val

comparing test 33511    0.0
18670    0.0
18049    0.0
31660    1.0
15573    0.0
        ... 
5040     0.0
33907    1.0
9090     0.0
25999    0.0
22135    0.0
Name: score, Length: 7295, dtype: float64 and pred 33511    0.0
18670    0.0
18049    0.0
31660    0.0
15573    0.0
        ... 
5040     0.0
33907    1.0
9090     0.0
25999    0.0
22135    0.0
Name: predictions, Length: 7295, dtype: float64
Confusion Matrix:
[[5465  779]
 [ 306  745]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.95      0.88      0.91      6244
         1.0       0.49      0.71      0.58      1051

    accuracy                           0.85      7295
   macro avg       0.72      0.79      0.74      7295
weighted avg       0.88      0.85      0.86      7295

Balanced Accuracy Score: 0.7920444730652179
Accuracy Score: 0.8512679917751885


Unnamed: 0,id,anchor,target,context,score,input,score_ascat,score_category,predictions
33511,ed1c4e525eb105fe,transmit alarm,display indicator,G08,0.00,TEXT1: G08; TEXT2: display indicator; ANC1: tr...,0.00,0,0.50
18670,5386316f318f5221,locking formation,retaining element,B60,0.25,TEXT1: B60; TEXT2: retaining element; ANC1: lo...,0.25,1,0.25
18049,1544ca6753fcbddd,lateral power,transducer,H01,0.25,TEXT1: H01; TEXT2: transducer; ANC1: lateral p...,0.25,1,0.25
31660,f9d8979b94cec923,spreader body,spreader,A01,0.75,TEXT1: A01; TEXT2: spreader; ANC1: spreader body,0.75,3,0.50
15573,e151ca5ea5cc0f08,high gradient magnetic separators,magnetic filtration,B03,0.50,TEXT1: B03; TEXT2: magnetic filtration; ANC1: ...,0.50,2,0.50
...,...,...,...,...,...,...,...,...,...
5040,f297c5e94dd07e6e,cervical support,gel pack,A47,0.25,TEXT1: A47; TEXT2: gel pack; ANC1: cervical su...,0.25,1,0.25
33907,06779da2bf614d00,trommel screen,trommel screen,B03,1.00,TEXT1: B03; TEXT2: trommel screen; ANC1: tromm...,1.00,4,1.00
9090,ed6245a94e7e5a77,different conductivity,conductive,H03,0.50,TEXT1: H03; TEXT2: conductive; ANC1: different...,0.50,2,0.50
25999,f93d71c0e9af4923,prolog,sliding window,H03,0.00,TEXT1: H03; TEXT2: sliding window; ANC1: prolog,0.00,0,0.50


### Saving the model trainer state and model adapters

In [43]:
metrics = train_result.metrics
max_train_samples = len(dataset_train)
metrics["train_samples"] = min(max_train_samples, len(dataset_train))
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

***** train metrics *****
  epoch                    =        15.0
  total_flos               = 387392093GF
  train_loss               =      1.0205
  train_runtime            =  4:13:36.48
  train_samples            =       29178
  train_samples_per_second =      28.763
  train_steps_per_second   =       3.596


#### Saving the adapter model
* Note this doesn't save the entire model. It only saves the adapters.

In [63]:
trainer.save_model("saved_model")



[1;34mwandb[0m: 
[1;34mwandb[0m: 🚀 View run [33msequence_classification[0m at: [34mhttps://wandb.ai/sdfsdfrr/huggingface/runs/d0zscpin[0m
[1;34mwandb[0m: Find logs at: [1;35mwandb/run-20250317_171149-d0zscpin/logs[0m


### Save to google drive
Make sure before disconnecting and deleting the Colab runtime you save your model for future inference use

In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
!cp -r sequence_classification /content/drive/MyDrive/Colab

In [None]:
!cp -r saved_model /content/drive/MyDrive/Colab