# Efficiently Fine-Tune Seq2Seq Models with Low Rank Adaptation (LoRA)

We are going to leverage Hugging Face [Transformers](https://huggingface.co/docs/transformers/index), [Accelerate](https://huggingface.co/docs/accelerate/index), and [PEFT](https://github.com/huggingface/peft). 

You will learn how to:

1. Setup Development Environment
2. Load and prepare the dataset
3. Fine-Tune Multilingual BERT with LoRA and bnb int-8
4. Evaluate & run Inference
5. Cost performance comparison

### Quick intro: PEFT or Parameter Efficient Fine-tuning

[PEFT](https://github.com/huggingface/peft), or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:

- LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
- Prefix Tuning: [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)
- P-Tuning: [GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf)
- Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)

*Note: This tutorial was created and run on a NC24 VM on Azure, including 1 NVIDIA A100*
### Plan:

* Give overview on building blocks
 * Abstractive summarization
 * Evaluation metric (ROUGE)
 * BERT (https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270)
 * The other LLM, which is yet to be decided (LLM2)
 * Prompt engineering
* Practical steps: [Starting with preparing this section first]
  * Install required libraries
  *   Load and explore the dataset
  *  SECTION 1: Abstractive summarization using BERT
    * build summarization flow using BERT
    * Train
    * Evaluate
  *  SECTION 2: Abstractive summarization using LLM2
    * consruct summarization prompt(s)
    * Generate summaries
    * Evaluate

### TODOs

1. Test on colab T4
2. List compute requirements
3. Add requirements.txt and/or conda yaml

# Install required modules

In [1]:
# !pip install datasets
# !pip install arabert
# !pip install accelerate -U

In [2]:
# !pip install transformers[torch]

In [3]:
#install evaluation metric
# !pip install rouge_score

# Dataset

In [4]:
from datasets import load_dataset

dataset = load_dataset('Goud/Goud-sum')


In [5]:
#Data Exploration
print(dataset['train'][0])


{'article': 'منير العلمي من مراكش: تحول فضاء مقر الغرفة الفلاحية بمدينة مراكش، الذي يحتضن في هذه الأثناء، انتخاب رئيس وأعضاء المكتب المسير للغرفة الفلاحية بجهة مراكش آسفي، إلى حلبة للاشتباكات والملاسنات، بعد اشتداد الخلاف بين البرلمانيين حميد العكرود وعمر خفيف، اللذين ينتميان إلى حزب التجمع الوطني للأحرار، ما كاد يعصف بالاجتماع بعد انطلاق شرارة الاشتباك بالأيادي التي أجهضت في مهدها بتدخل بعض الحاضرين. وحسب شهود عيان، فإن عمر خفيف، الذي يشغل رئيس جماعة أكفاي، ومدعم الحبيب بن الطالب المنسق الاقليمي لحزب الأصالة والمعاصر الذي يتجه لتولي رئاسة الغرفة لولاية تانية، رفض دخول حميد العكرود للمنافسة على رئاسة الغرفة، واصفا إياه بـ “الأمي الذي لايفقه شيئا”، ليدخل الطرفان في ملاسنات كلامية قبل أن يتحول الصراع إلى تشابك بالأيدي. ', 'headline': 'برلمانيين من حزب الحمامة قلبوها بونيا قبل انتخاب رئيس وأعضاء غرفة الفلاحة بجهة مراكش آسفي (صور)', 'categories': "['آش واقع', 'الرئيسية']"}


In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'headline', 'categories'],
        num_rows: 139288
    })
    validation: Dataset({
        features: ['article', 'headline', 'categories'],
        num_rows: 9497
    })
    test: Dataset({
        features: ['article', 'headline', 'categories'],
        num_rows: 9497
    })
})

In [19]:
print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

Train dataset size: 139288
Test dataset size: 9497


In [7]:
# Sample record
print(dataset['train'][0])

'برلمانيين من حزب الحمامة قلبوها بونيا قبل انتخاب رئيس وأعضاء غرفة الفلاحة بجهة مراكش آسفي (صور)'

Preprocessing the data

#Abstractive summarization using BERT

Pick the evaluation metric and explain what it means, here I'm going with rouge




In [8]:
#Evaluating model performance
#Defining Metrics:

#Explain the evaluation metrics you will use (e.g., ROUGE, BLEU).
from datasets import load_metric

In [9]:
#from transformers import BertTokenizer, AutoModelForSeq2SeqLM, pipeline
#from arabert.preprocess import ArabertPreprocessor

#model_name = "aubmindlab/bert-base-arabertv2"
#preprocessor = ArabertPreprocessor(model_name="")
#tokenizer = BertTokenizer.from_pretrained(model_name)

## build summarization flow using BERT

In [10]:
from datasets import load_dataset, Dataset, load_metric
from transformers import BertTokenizer, EncoderDecoderModel, Trainer, TrainingArguments, DataCollatorForSeq2Seq


# Load the tokenizer and model
model_name = "bert-base-multilingual-cased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = EncoderDecoderModel.from_encoder_decoder_pretrained(model_name, model_name)

# Set decoder_start_token_id
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id

Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['bert.encoder.layer.0.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.0.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.0.crossattention.output.dense.bias', 'bert.encoder.layer.0.crossattention.output.dense.weight', 'bert.encoder.layer.0.crossattention.self.key.bias', 'bert.encoder.layer.0.crossattention.self.key.weight', 'bert.encoder.layer.0.crossattention.self.query.bias', 'bert.encoder.layer.0.crossattention.self.query.weight', 'bert.encoder.layer.0.crossattention.self.value.bias', 'bert.encoder.layer.0.crossattention.self.value.weight', 'bert.encoder.layer.1.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.1.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.1.crossattention.output.dense.bias', 'bert.encoder.layer.1.crossattention.output.dense.weight', 'bert.encoder.layer.1.crossattention.self.key.bia

In [18]:
from datasets import concatenate_datasets
import numpy as np
# The maximum total input sequence length after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["article"], truncation=True), batched=True, remove_columns=["article", "categories"])
input_lenghts = [len(x) for x in tokenized_inputs["input_ids"]]
# take 85 percentile of max length for better utilization
max_source_length = int(np.percentile(input_lenghts, 85))
print(f"Max source length: {max_source_length}")

# The maximum total sequence length for target text after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["categories"], truncation=True), batched=True, remove_columns=["article", "categories"])
target_lenghts = [len(x) for x in tokenized_targets["input_ids"]]
# take 90 percentile of max length for better utilization
max_target_length = int(np.percentile(target_lenghts, 90))
print(f"Max target length: {max_target_length}")

Map:   0%|          | 0/148785 [00:00<?, ? examples/s]

Max source length: 512


Map:   0%|          | 0/148785 [00:00<?, ? examples/s]

Max target length: 16


In [11]:
# Preprocess the data
def preprocess_function(examples):
    inputs = tokenizer(examples["article"], max_length=512, truncation=True, padding="max_length")
    outputs = tokenizer(examples["headline"], max_length=150, truncation=True, padding="max_length")

    inputs["decoder_input_ids"] = outputs["input_ids"]
    inputs["labels"] = outputs["input_ids"].copy()

    # replace padding token id's of the labels by -100 so it's ignored by the loss
    inputs["labels"] = [[(label if label != tokenizer.pad_token_id else -100) for label in labels] for labels in inputs["labels"]]

    return inputs

In [20]:
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["article", "categories", "headline"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")
# save datasets to disk for later easy loading
tokenized_dataset["train"].save_to_disk("data/train")
tokenized_dataset["test"].save_to_disk("data/eval")

Map:   0%|          | 0/139288 [00:00<?, ? examples/s]

Map:   0%|          | 0/9497 [00:00<?, ? examples/s]

Map:   0%|          | 0/9497 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'token_type_ids', 'attention_mask', 'decoder_input_ids', 'labels']


Saving the dataset (0/2 shards):   0%|          | 0/139288 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/9497 [00:00<?, ? examples/s]

In [12]:
tokenized_datasets = dataset.map(preprocess_function, batched=True)


# Define data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    save_total_limit=3,
)




In [14]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType

# Define LoRA Config 
lora_config = LoraConfig(
 r=16, 
 lora_alpha=32,
 target_modules=["query", "value"],
 lora_dropout=0.05,
 bias="none",
 task_type=TaskType.SEQ_2_SEQ_LM
)
# prepare int-8 model for training
model = prepare_model_for_kbit_training(model)

# add LoRA adaptor
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# trainable params: 18874368 || all params: 11154206720 || trainable%: 0.16921300163961817

trainable params: 1,769,472 || all params: 385,964,283 || trainable%: 0.4585


In [15]:
# Define ROUGE metric
rouge = load_metric("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them
    labels = [[(label if label != -100 else tokenizer.pad_token_id) for label in labels] for labels in labels]
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Compute ROUGE scores
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract the ROUGE scores
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

    return result



  rouge = load_metric("rouge")


In [16]:
# Sample a subset of the tokenized training data
subset_fraction = 0.05  # 5% of the training data
train_subset = tokenized_datasets["train"].shuffle(seed=42).select(range(int(subset_fraction * len(tokenized_datasets["train"]))))



In [21]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

output_dir="goud-bert"

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
		auto_find_batch_size=True,
    learning_rate=1e-3, # higher learning rate
    num_train_epochs=3,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=500,
    save_strategy="no",
    report_to="tensorboard",
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

## Train

In [22]:
# Train the model
trainer.train()


We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Step,Training Loss


## Evaluate

In [None]:
# Evaluate the model
eval_results = trainer.evaluate()
print(eval_results)


In [None]:

# Save the model
model.save_pretrained("./fine_tuned_bert2bert_model")
tokenizer.save_pretrained("./fine_tuned_bert2bert_model")