# Lightweight Fine-Tuning Project

In this cell, describe your choices for each of the following

* PEFT technique: 
* Model: 
* Evaluation approach: 
* Fine-tuning dataset: 

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

### Loading dataset and adapating dataset for the model

In [3]:
# Python imports
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding
import math
import numpy as np


# Load the telugu language dataset
ds = load_dataset('Davlan/sib200', 'tel_Telu')
# Show ds
ds

Downloading readme: 0.00B [00:00, ?B/s]

Downloading data: 100%|██████████| 259k/259k [00:00<00:00, 2.68MB/s]
Downloading data: 100%|██████████| 34.3k/34.3k [00:00<00:00, 452kB/s]
Downloading data: 100%|██████████| 76.1k/76.1k [00:00<00:00, 982kB/s]


Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['index_id', 'category', 'text'],
        num_rows: 701
    })
    validation: Dataset({
        features: ['index_id', 'category', 'text'],
        num_rows: 99
    })
    test: Dataset({
        features: ['index_id', 'category', 'text'],
        num_rows: 204
    })
})

In [4]:
# Uncomment and fix your label conversion code:
categories = ["science/technology", "travel", "politics", "sports", "health", "entertainment", "geography"]

label2id = {label: idx for idx, label in enumerate(categories)}
id2label = {idx: label for label, idx in label2id.items()}

def convert_labels(example):
    example["labels"] = label2id[example["category"]]
    return example

ds = ds.map(convert_labels)
ds = ds.remove_columns('category')
ds

Map:   0%|          | 0/701 [00:00<?, ? examples/s]

Map:   0%|          | 0/99 [00:00<?, ? examples/s]

Map:   0%|          | 0/204 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['index_id', 'text', 'labels'],
        num_rows: 701
    })
    validation: Dataset({
        features: ['index_id', 'text', 'labels'],
        num_rows: 99
    })
    test: Dataset({
        features: ['index_id', 'text', 'labels'],
        num_rows: 204
    })
})

In [5]:
## Checking data types to solve the error "Unable to create tensors"
# [type(label) for label in ds['train']["labels"]]

### Using the XLM-RoBERTa model
(https://huggingface.co/FacebookAI/xlm-roberta-base)

In [7]:
# Load tokenizer for the XLM-RoBERTa model
tokenizer = AutoTokenizer.from_pretrained("FacebookAI/xlm-roberta-base")
# Load XLM-RoBERTa foundation model 
model = AutoModelForSequenceClassification.from_pretrained("FacebookAI/xlm-roberta-base", num_labels=7,
    id2label=id2label,
    label2id=label2id)
# Print the model
model

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


XLMRobertaForSequenceClassification(
  (roberta): XLMRobertaModel(
    (embeddings): XLMRobertaEmbeddings(
      (word_embeddings): Embedding(250002, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): XLMRobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x XLMRobertaLayer(
          (attention): XLMRobertaAttention(
            (self): XLMRobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): XLMRobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768,

In [8]:
# Tokenize the data
def preprocessing(data):    
    tokenized = tokenizer(data['text'], truncation=True, padding=True, max_length=512)
    tokenized['labels']=data['labels']
    return tokenized

tokenized_data = ds.map(preprocessing, batched=True, remove_columns=['text','index_id'])
tokenized_data

Map:   0%|          | 0/701 [00:00<?, ? examples/s]

Map:   0%|          | 0/99 [00:00<?, ? examples/s]

Map:   0%|          | 0/204 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 701
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 99
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 204
    })
})

In [33]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)
    return {"eval_accuracy": (predictions == labels).mean()}

In [10]:
# Data collator
data_collator = DataCollatorWithPadding(tokenizer)

# Load pretrained model and evaluate model after each epoch
args = TrainingArguments(output_dir="./fm_telugu", per_device_eval_batch_size=8)
trainer = Trainer(
    model=model,
    args=args,
    data_collator=data_collator,
    eval_dataset=tokenized_data['test'],
    compute_metrics=compute_metrics
)

results = trainer.evaluate()

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [11]:
# Show results
print(f"Foundation Model Accuracy: {results['eval_accuracy']:.2%}")

Foundation Model Accuracy: 10.78%


### Conclusion
XLM-RoBERTa model has an Accuracy of 10.78%

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [39]:
# PEFT Model - based on the XLM-RoBERTa
import torch
from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType

## Config for PEFT
# bnb_config  = BitsAndBytesConfig(load_in_8bit=True,
#                                 bnb_8bit_use_double_quant=True)
lora_config = LoraConfig(r=32,
                         lora_alpha=64,
                         lora_dropout=0.2,
                         target_modules=["query", "key", "value"],                                
                         task_type=TaskType.SEQ_CLS,
                         bias="none")

## Get Base model
base_model = AutoModelForSequenceClassification.from_pretrained("FacebookAI/xlm-roberta-base",
                                                                num_labels=7,
                                                                id2label=id2label,
                                                                label2id=label2id, 
#                                                                 quantization_config=bnb_config,                                                                
                                                                device_map="auto")
## Prepare for k-bit training
# model = prepare_model_for_kbit_training(base_model)

## Get peft model
peft_model = get_peft_model(base_model, lora_config)

## Print trainable parameters
peft_model.print_trainable_parameters()

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 2,961,422 || all params: 280,414,478 || trainable%: 1.056087410722067


In [41]:
# Data collator is set to default
# Compute metrics are the same

# Training Arguments
training_args = TrainingArguments(
    output_dir="/tmp/peft_telugu",
    learning_rate=2e-4,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    remove_unused_columns=False,
    fp16=True,
    gradient_checkpointing=False,
    optim="adamw_torch",
    ddp_find_unused_parameters=False
)

# Train PEFT model 
trainer = Trainer(
    model=peft_model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_data['train'],
    eval_dataset=tokenized_data['validation'],
    compute_metrics=compute_metrics
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.21434,0.787879
2,No log,1.189126,0.767677
3,0.327500,1.120553,0.818182
4,0.327500,1.381401,0.787879
5,0.327500,1.519963,0.777778
6,0.196700,1.298708,0.79798
7,0.196700,1.470326,0.79798
8,0.196700,1.357994,0.828283
9,0.123100,1.418454,0.808081
10,0.123100,1.405328,0.808081


Checkpoint destination directory /tmp/peft_telugu/checkpoint-176 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory /tmp/peft_telugu/checkpoint-352 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=1760, training_loss=0.19673018022017044, metrics={'train_runtime': 141.3052, 'train_samples_per_second': 49.609, 'train_steps_per_second': 12.455, 'total_flos': 462751596165000.0, 'train_loss': 0.19673018022017044, 'epoch': 10.0})

###  ⚠️ IMPORTANT ⚠️

Due to workspace storage constraints, you should not store the model weights in the same directory but rather use `/tmp` to avoid workspace crashes which are irrecoverable.
Ensure you save it in /tmp always.

In [42]:
# Saving the model
peft_model.save_pretrained("/tmp/peft_telugu")
tokenizer.save_pretrained("/tmp/peft_telugu")

('/tmp/peft_telugu/tokenizer_config.json',
 '/tmp/peft_telugu/special_tokens_map.json',
 '/tmp/peft_telugu/sentencepiece.bpe.model',
 '/tmp/peft_telugu/added_tokens.json',
 '/tmp/peft_telugu/tokenizer.json')

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [None]:
from peft import AutoPeftModelForSequenceClassification

# Load the fine tuned model
loaded_peft_model = AutoPeftModelForSequenceClassification.from_pretrained("/tmp/peft_telugu")

# Evaluate the peft model 
args = TrainingArguments(output_dir="./peft_telugu_eval", per_device_eval_batch_size=8)
trainer = Trainer(
    model=loaded_peft_model,
    args=args,
    data_collator=data_collator,
    eval_dataset=tokenized_data['test'],
    compute_metrics=compute_metrics
)

peft_results = trainer.evaluate()

In [44]:
# Show results
print(f"Foundation Model Accuracy: {results.get('eval_accuracy', results.get('accuracy', 'N/A'))}")
print(f"PEFT Model Accuracy:       {peft_results.get('eval_accuracy', peft_results.get('accuracy', 'N/A'))}")

Foundation Model Accuracy: 0.10784313725490197
PEFT Model Accuracy:       0.8284313725490197


### Increase in performance on the dataset
The foundation model once adapted, has a huge increase in performance.