# **Low-resource NLP**

<img src="https://github.com/stevenkolawole/indaba-low-resource-nlp-prac/blob/main/content/lr_llm_header.png?raw=1" width="60%" allign ="center"/>


<a href="https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Indaba_2024_Prac_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> [Change colab link to point to prac.]

© Deep Learning Indaba 2024. Apache License 2.0.

**Authors:**
- Ali Zaidi
- Aya Salama
- Khalil Mrini
- Steven Kolawole

**Introduction:**

Low-resource NLP (Natural Language Processing) refers to the study and development of NLP models and systems for languages, tasks, or domains that have limited data and resources available. These can include languages with fewer digital text corpora, limited computational tools, or less-developed linguistic research.

**Key Challenges in Low-Resource NLP**

1. **Data Scarcity:**
   - **Limited Training Data:** Many languages lack large annotated corpora necessary for training NLP models.
   - **Lack of Pre-trained Models:** Popular NLP models like BERT, GPT, and others are often not available for low-resource languages.

2. **Linguistic Diversity:**
   - **Morphological Complexity:** Some languages have complex grammatical structures and morphological richness.
   - **Dialectal Variations:** A lack of standardized versions can complicate NLP tasks.

3. **Resource Limitations:**
   - **Computational Constraints:** Low-resource scenarios often involve limited access to computational power and storage.
   - **Expertise and Tools:** Fewer linguistic experts and fewer NLP tools are tailored for these languages.

**Topics:**

Content: [Natural Language Processing, Low-resource, Large Language Models, Parameter Efficient Finetuning, Adaptation]  
Level: [Intermediate]


**Aims/Learning Objectives:**

- Exploring data scarcity challenges
- Exploring Compute resource limitations and addressing them with Parameter efficient finetuning
- Comparing Performance between Small (BERT) and Large (GPT) Language models on low-resource languages/tasks
  

**Prerequisites:**

[Knowledge required for this prac. You can link a relevant parallel track session, blogs, papers, courses, topics etc.]
_ link resources on LLMs, Bert, Masakhane papers on low resouce conversations

**Outline:**

[Points that link to each section. Auto-generate following the instructions [here](https://stackoverflow.com/questions/67458990/how-to-automatically-generate-a-table-of-contents-in-colab-notebook).]


**Before you start:**

[Tasks just before starting.]


Storyline: working on a task with scarce data which is summarization in Moroccan Darija.
this task pose resources constrainst as the Moroccan Darija can be considered a low resource dialect of Arabic, we will set this task in a coumputational resource poor environment so our training should be able to run on a commodity GPU
we will be using parameter efficient fine tuning technique (LORA) to optimize the training procedure in order to make it feasible

# Setup

##  Run cell to setup the needed packages and resources

**Resource folders**:

The resources you'll need to run this practical will be downloaded when you run the next cell.
After downloading and extraction are complete, you'll have the following folders present in the "resources" folder in the parent directory:

- *models* folder: this folder has the pre-trained models that will be utilized in the practical
- *dataset* folder: this folder has the Goud-sum dataset that we will be utilizing in the pratical
- *genrated_responses* folder: this folder has pregenrated summaries that will be utilized in Section2

In [1]:
!git clone https://github.com/stevenkolawole/indaba-low-resource-nlp-prac.git
%cd indaba-low-resource-nlp-prac

import utils

# Install the required packages
utils.install_requirements()

# Download and extract the zip file containing the resources
utils.download_and_extract_zip("https://dli2024prac.blob.core.windows.net/testres/resources.zip")
model, tokenizer, config = utils.load_models()


Cloning into 'indaba-low-resource-nlp-prac'...
remote: Enumerating objects: 70, done.[K
remote: Counting objects: 100% (70/70), done.[K
remote: Compressing objects: 100% (54/54), done.[K
remote: Total 70 (delta 25), reused 32 (delta 12), pack-reused 0 (from 0)[K
Receiving objects: 100% (70/70), 3.26 MiB | 7.96 MiB/s, done.
Resolving deltas: 100% (25/25), done.
/content/indaba-low-resource-nlp-prac
Collecting datasets==2.14.4 (from -r requirements.txt (line 1))
  Downloading datasets-2.14.4-py3-none-any.whl.metadata (19 kB)
Collecting peft==0.6.0 (from -r requirements.txt (line 2))
  Downloading peft-0.6.0-py3-none-any.whl.metadata (23 kB)
Collecting transformers==4.32.0 (from -r requirements.txt (line 3))
  Downloading transformers-4.32.0-py3-none-any.whl.metadata (118 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 118.5/118.5 kB 5.3 MB/s eta 0:00:00
Collecting tqdm==4.66.1 (from -r requirements.txt (line 4))
  Downloading tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)
     ━━━━━

## Imports

In [5]:

import numpy as np
import torch

from IPython.display import display, HTML
from tqdm import tqdm
from datasets import load_dataset, concatenate_datasets, load_metric, load_from_disk
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType, PeftConfig, PeftModel
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, EncoderDecoderModel, DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments

# Introduction

In this practical we are interested in generating headlines for news articles featured on the news website [Goud.ma](www.Gound.ma).
Refer to the [github.com/issam9/goud-summarization-dataset](https://github.com/issam9/goud-summarization-dataset) repository for the training code for the workshop paper [Goud.ma: a News Dataset for Summarization in Moroccan Darija](https://openreview.net/forum?id=BMVq5MELb9).

We will frame this as a summarization task where the input is the body of a news article and the output is an appropriate headline. The [Goud dataset](https://huggingface.co/datasets/Goud/Goud-sum) contains 158k articles and their headlines. All headlines are in Moroccan Darija, while articles may be in Moroccan Darija, in Modern Standard Arabic, or a mix of both (code-switched Moroccan Darija).

**Data Fields**
- *article*: a string containing the body of the news article
- *headline*: a string containing the article's headline
- *categories*: a list of string of article categories

In [3]:
display(HTML(
    """
    <iframe
    src="https://huggingface.co/datasets/Goud/Goud-sum/embed/viewer/default/train"
    frameborder="0"
    width="100%"
    height="560px"
    ></iframe>
    """
))

## What we will do:
<img src="https://github.com/stevenkolawole/indaba-low-resource-nlp-prac/blob/main/content/DLI_LR_llm_prac_1.png?raw=1" width="40%" />


## Evaluation Metric: ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate the quality of summaries by comparing them to reference (or ground truth) summaries. ROUGE is widely used in Natural Language Processing (NLP) tasks, particularly for evaluating the performance of text summarization models.

![ROUGE-Base](https://i0.wp.com/blog.uptrain.ai/wp-content/uploads/2024/01/rouge-n.webp?resize=700%2C228&ssl=1)

### Key ROUGE Variants

1. **ROUGE-N**: Measures the overlap of n-grams between the candidate summary and the reference summary.

![ROUGE-1](https://clementbm.github.io/assets/2021-12-23/rouge-unigrams.png)

*caption:*
$ROUGE_1 = \frac{7}{10} = 0.7$

   - **ROUGE-1**: Overlap of unigrams (1-gram).
   - **ROUGE-2**: Overlap of bigrams (2-grams).
   - **ROUGE-L**: Measures the longest common subsequence (LCS) between the candidate and reference summaries.

2. **ROUGE-L**: Measures the longest common subsequence (LCS) between the candidate summary and the reference summary. Unlike ROUGE-N, ROUGE-L considers sentence-level structure similarity by identifying the longest co-occurring sequence of words in both summaries.



3. **ROUGE-W**: A weighted version of ROUGE-L that gives more importance to the contiguous LCS.

4. **ROUGE-S**: Measures the overlap of skip-bigrams, which are pairs of words in their order of appearance that can have any number of gaps between them.

### How ROUGE is Computed

ROUGE metrics can be calculated in terms of three measures:

- **Recall**: The ratio of overlapping units (n-grams, LCS, or skip-bigrams) between the candidate summary and the reference summary to the total units in the reference summary. It answers, "How much of the reference summary is captured by the candidate summary?"

- **Precision**: The ratio of overlapping units between the candidate summary and the reference summary to the total units in the candidate summary. It answers, "How much of the candidate summary is relevant to the reference summary?"

- **F1-Score**: The harmonic mean of Precision and Recall. This gives a balanced measure that considers both precision and recall.

### Importance of ROUGE

ROUGE is essential for summarization tasks because it provides a standardized way to evaluate and compare different summarization models. Higher ROUGE scores generally indicate that the candidate summary is more similar to the reference summary, meaning the model is likely performing well.

#### NOTE: Caveat
<div style="display: flex; justify-content: space-between;">
    <img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*8ZNpaag-Nr2GLs3A-sz0aQ.png" alt="limitation 1" width="250"/>
    <img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*CLIKeyKYiR6sNA4yjIkCWg.png" alt="limitation 2" width="250"/>
    <img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*667HMbjSLJhwR_xqBau3JQ.png" alt="limitation 3" width="250"/>
</div>

While ROUGE and other evaluation metrics (e.g., BLEU, METEOR, etc) serve as valuable tools for quick and straightforward evaluation of language models, they have certain limitations that render them less than ideal. To begin with, they fall short when it comes to assessing the fluency, coherence, and overall meaning of passages. They are also relatively insensitive to word order. ROUGE primarily measures lexical overlap and may not fully capture the semantic meaning or quality of a summary. For these reasons, researchers are still trying to find improved metrics.

Therefore, these metrics are not shoe-in replacements for human evaluation, but are best used in conjunction with human evaluations for a more comprehensive assessment of summary quality.

## Load dataset

In [None]:
# Load the dataset from disk
dataset = load_from_disk("./resources/data/Goud-sum/Goud-sum")

## Check dataset


In [None]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['article', 'headline', 'categories'],
        num_rows: 139288
    })
    validation: Dataset({
        features: ['article', 'headline', 'categories'],
        num_rows: 9497
    })
    test: Dataset({
        features: ['article', 'headline', 'categories'],
        num_rows: 9497
    })
})


In [None]:
dataset['train'][0]

{'article': 'منير العلمي من مراكش: تحول فضاء مقر الغرفة الفلاحية بمدينة مراكش، الذي يحتضن في هذه الأثناء، انتخاب رئيس وأعضاء المكتب المسير للغرفة الفلاحية بجهة مراكش آسفي، إلى حلبة للاشتباكات والملاسنات، بعد اشتداد الخلاف بين البرلمانيين حميد العكرود وعمر خفيف، اللذين ينتميان إلى حزب التجمع الوطني للأحرار، ما كاد يعصف بالاجتماع بعد انطلاق شرارة الاشتباك بالأيادي التي أجهضت في مهدها بتدخل بعض الحاضرين. وحسب شهود عيان، فإن عمر خفيف، الذي يشغل رئيس جماعة أكفاي، ومدعم الحبيب بن الطالب المنسق الاقليمي لحزب الأصالة والمعاصر الذي يتجه لتولي رئاسة الغرفة لولاية تانية، رفض دخول حميد العكرود للمنافسة على رئاسة الغرفة، واصفا إياه بـ “الأمي الذي لايفقه شيئا”، ليدخل الطرفان في ملاسنات كلامية قبل أن يتحول الصراع إلى تشابك بالأيدي. ',
 'headline': 'برلمانيين من حزب الحمامة قلبوها بونيا قبل انتخاب رئيس وأعضاء غرفة الفلاحة بجهة مراكش آسفي (صور)',
 'categories': "['آش واقع', 'الرئيسية']"}

# Section 1: Efficiently Fine-Tune Seq2Seq Models with Low Rank Adaptation (LoRA)

The goal of this section is to fine-tune a base model for our summarization task using a parameter efficient mechanism called low-rank adapatation (LoRA). An implementation of this technique is part of the Parameter Efficient Fine-Tuning (PEFT) library from Hugging Face. We will leverage the 🤗 [Transformers](https://huggingface.co/docs/transformers/index), [Accelerate](https://huggingface.co/docs/accelerate/index), and [PEFT](https://github.com/huggingface/peft) in this section.

You will learn how to:

1. Setup Development Environment
2. Load and prepare the dataset
3. Fine-Tune Multilingual BERT with LoRA and bnb int-8
4. Evaluate & run inference
5. Cost performance comparison

### Quick intro to PEFT or Parameter Efficient Fine-tuning
<div style="display: flex; justify-content: center; align-items: flex-start;">    <figure style="text-align: center;">
        <a href="https://arxiv.org/abs/2303.15647#" target="_blank">
            <img src="https://github.com/stevenkolawole/indaba-low-resource-nlp-prac/blob/main/content/PEFT_method.png?raw=1" width="90%" />
        </a>
        <figcaption><a href="https://arxiv.org/abs/2303.15647#" target="_blank">PEFT Methods, from the paper "Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning"
</a></figcaption>
    </figure>
</div>

[PEFT](https://github.com/huggingface/peft), or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained models, including but not limited to language models and diffusion mdoels, to various downstream applications without needing fine-tuning all the model's parameters. PEFT includes techniques and variants of many methods such as:

- LoRA: [LoRA: Low-Rank Adaptation of Language models](https://arxiv.org/abs/2106.09685)
- Prefix Tuning: [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/abs/2110.07602)
- P-Tuning: [GPT Understands, Too](https://arxiv.org/abs/2103.10385)
- Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/abs/2104.08691)

## Low-Rank Adaptation (LoRA)

While large language models (LLMs) have shown remarkable performance across a wide range of NLP tasks, they require significant computation resources to train, fine-tune and deploy. In addition, many real-world use-cases require adapting avialable LLMs to their target task in order to achieve desired performance.

While fine-tuning an entire LLM is cost prohibitive, even on small datasets. For example, fully fine-tuning the Llama7B model requires 112GB of VRAM, i.e. at least two 80GB A100 GPUs. Fortunately, parameter efficient fine-tuning methods such as LoRA allow users with meager resources to adapat an LLM to their target task efficiently and effectively.

In this tutorial we explore QLoRA, which is a parameter-efficient fine-tuning technique that reduces the number of parameters fine-tuned during the adaptation process, and additionally introduces quantization to further lower the memory footprint of the adapted model.

### How Does LoRA Work?

The paper [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) takes inspiration from the conjecture that over-parameterized models span a low-rank intrinsic dimension. A low intrinsic dimension means the data can be effectively represented or approximated by a lower-dimensional space while retaining most of its essential information or structure. In other words, this means we can decompose the new weight matrix for the adapted task into lower-dimensional (smaller) matrices without losing significant information.

Concretely, let us suppose $\delta W$ is the weight update for an $A\times B$ weight matrix. Then, a low-rank decompsoition of $\delta W$ can be expressed as: $\delta W = W_A W_B$, where $W_A$ is an $A\times k$ matrix and $W_B$ is a $k\times B$ matrix. Here, $k$ is the rank of the decomposition, and is typically much smaller than $A$ and $B$.

![Image courtesty from Sebastian Raschka's Ligthning.AI tutorial on LoRA](https://lightningaidev.wpengine.com/wp-content/uploads/2023/04/lora-4-300x226@2x.png)

## Summarization Using MT5

Prior to fine-tuning our model, we need to select the model we will use as our base model. In this case, we will use the [MT5](https://huggingface.co/google/mt5-small) model, which is a multilingual variant of the T5 model. The MT5 model is trained on a large multilingual corpus and is capable of performing a wide range of NLP tasks, including summarization.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("google/mt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("google/mt5-small")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565




Next we prepare our datasets for training. This requires tokenizing the input and output sequences, padding them to the desired length, and then converting them into PyTorch Dataset objects.

In [None]:
# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(
    lambda x: tokenizer(x["article"], truncation=True),
    batched=True,
    remove_columns=["categories", "headline"],
)
input_lengths = [len(x) for x in tokenized_inputs["input_ids"]]

In [None]:
# take 85 percentile of max length for better utilization
max_source_length = int(np.percentile(input_lengths, 85))
print(f"Max source length: {max_source_length}")

Max source length: 571


In [None]:
# The maximum total sequence length for target text after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(
    lambda x: tokenizer(x["headline"], truncation=True),
    batched=True,
    remove_columns=["article", "categories"],
)
target_lengths = [len(x) for x in tokenized_targets["input_ids"]]
# take 90 percentile of max length for better utilization
max_target_length = int(np.percentile(target_lengths, 90))
print(f"Max target length: {max_target_length}")

Max target length: 50


In [None]:
def preprocess_function(sample, padding="max_length"):
    # # add prefix to the input for t5
    # inputs = ["summarize: " + item for item in sample["dialogue"]]

    # tokenize inputs
    model_inputs = tokenizer(sample['article'], max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `summary` keyword argument
    labels = tokenizer(sample["headline"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["headline", "article", "categories"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


In [None]:
# tokenized_dataset["test"].save_to_disk("arabic-goud-data/eval")

Saving the dataset (0/1 shards):   0%|          | 0/9497 [00:00<?, ? examples/s]

Finally we need to define our configuration for LoRA. The primary parameters for LoRA are:

* `r`: this is the rank of the decomposed matrices $A$ and $B$ to be learned during fine-tuning. A smaller number will save more GPU memory but might decrease performance.
* `lora_alpha`: this is the weight of the low-rank loss in the total loss function, or the coefficient for the learned $\Delta W$ factor. A larger number will typically result in a larger behavior change after fine-tuning.
* `lora_dropout`: the dropout ratio for layers in the LoRA adapters $A$ and $B$.
* `target_modules`: which modules to learn the low-rank decomposition for. This could be all linear layers, for example, or specific modules in the base network

In [None]:
lora_config = LoraConfig(
  r=16,
  lora_alpha=32,
  target_modules=["q", "v"],
  lora_dropout=0.05,
  bias="none",
  task_type=TaskType.SEQ_2_SEQ_LM
)
# prepare int-8 model for training
model = prepare_model_for_kbit_training(model)

# add LoRA adaptor
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 688,128 || all params: 300,864,896 || trainable%: 0.2287


In [None]:
# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8,
    return_tensors='pt'
)

In [None]:
NUM_EPOCHS = 10
output_dir = "lora-goud-mt5-small"

In [None]:
# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir="lora-mt5-goud",
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=NUM_EPOCHS,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=500,
    save_strategy="no",
    report_to=["tensorboard",
              #  "wandb",
               ],
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset = tokenized_dataset["train"],
    eval_dataset = tokenized_dataset["validation"].select(range(20)),
    tokenizer = tokenizer,
)

In [8]:
display(HTML(
    """
    <iframe src="https://wandb.ai/alizaidi/huggingface/runs/2rwkxynz?nw=nwuseralizaidi" style="border:none;height:1024px;width:100%"> 
    """))

In [None]:
# trainer.train()



[34m[1mwandb[0m: wandb version 0.17.7 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


[34m[1mwandb[0m: Tracking run with wandb version 0.17.6


[34m[1mwandb[0m: Run data is saved locally in [35m[1m/home/alizaidi/dev/nlp/llms/indaba/indaba-low-resource-nlp-prac/wandb/run-20240816_012514-brfaca9p[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.


[34m[1mwandb[0m: Syncing run [33mlora-mt5-goud[0m


[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/alizaidi/huggingface[0m


[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/alizaidi/huggingface/runs/brfaca9p[0m


Step,Training Loss
500,6.3641
1000,5.0197
1500,4.7336
2000,4.5739
2500,4.487
3000,4.441
3500,4.3759
4000,4.3706
4500,4.3127
5000,4.3099


TrainOutput(global_step=174110, training_loss=3.7825945418129274, metrics={'train_runtime': 17642.8655, 'train_samples_per_second': 78.949, 'train_steps_per_second': 9.869, 'total_flos': 8.31857984054231e+17, 'train_loss': 3.7825945418129274, 'epoch': 10.0})

In [None]:
peft_model_id="peft-lora-mt5-goud-results"

In [None]:
# trainer.model.save_pretrained(peft_model_id)
# tokenizer.save_pretrained(peft_model_id)

('peft-lora-mt5-goud-results/tokenizer_config.json',
 'peft-lora-mt5-goud-results/special_tokens_map.json',
 'peft-lora-mt5-goud-results/spiece.model',
 'peft-lora-mt5-goud-results/added_tokens.json',
 'peft-lora-mt5-goud-results/tokenizer.json')

In [None]:
# trainer.push_to_hub("alizaidi/lora-mt5-goud-ar")

tokenizer.json:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/2.77M [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/5.30k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/alizaidi/lora-mt5-goud/commit/fee52adbeda959a97d9b839e0b8f0a5f315b0713', commit_message='alizaidi/lora-mt5-goud-ar', commit_description='', oid='fee52adbeda959a97d9b839e0b8f0a5f315b0713', pr_url=None, pr_revision=None, pr_num=None)

## Evaluation 1: LoRA Model


In [19]:
config = PeftConfig.from_pretrained("alizaidi/lora-mt5-goud")
base_model = AutoModelForSeq2SeqLM.from_pretrained("google/mt5-small")
device_map = {"": 0} if torch.cuda.is_available() else None
model = PeftModel.from_pretrained(base_model, "alizaidi/lora-mt5-goud", device_map=device_map)
tokenizer = AutoTokenizer.from_pretrained("alizaidi/lora-mt5-goud")



In [21]:
text = dataset["test"][0]["article"]
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(input_ids=inputs["input_ids"], max_new_tokens=128)
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0])

بالفيديو. طالبة كليات فاس خرجات للاحتجاج على فتح الأحياء والجامعات


In [15]:
dataset["test"][0]["headline"]

'روبورطاج.. البياتة بالليل على برا شكل احتجاجي جديد للمطالبة بفتح الأحياء الجامعية بفاس'

In [23]:
def generate_batch_sized_chunks(list_of_elements, batch_size):
    """
    split the dataset into smaller batches that we can process simultaneously
    Yield successive batch-sized chunks from list_of_elements.
    """
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]


def calculate_metric_on_test_ds(
    dataset,
    metric,
    model,
    tokenizer,
    batch_size=16,
    device="cuda" if torch.cuda.is_available() else "cpu",
    column_text="article",
    column_summary="highlights",
):
    article_batches = list(
        generate_batch_sized_chunks(dataset[column_text], batch_size)
    )
    target_batches = list(
        generate_batch_sized_chunks(dataset[column_summary], batch_size)
    )

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)
    ):

        inputs = tokenizer(
            article_batch,
            max_length=512,
            truncation=True,
            padding=True,
            return_tensors="pt",
        )

        summaries = model.generate(
            input_ids=inputs["input_ids"].to(device),
            attention_mask=inputs["attention_mask"].to(device),
            # length_penalty=0.8,
            # num_beams=8,
            # max_length=128,
        )
        """ parameter for length penalty ensures that the model does not generate sequences that are too long. """

        # Finally, we decode the generated texts,
        # replace the  token, and add the decoded texts with the references to the metric.
        decoded_summaries = [
            tokenizer.decode(
                s, skip_special_tokens=True, clean_up_tokenization_spaces=True
            )
            for s in summaries
        ]

        decoded_summaries = [d.replace("", " ").strip().lower() for d in decoded_summaries]
        metric.add_batch(predictions=decoded_summaries, references=target_batch)

    #  Finally compute and return the ROUGE scores.
    score = metric.compute()
    return score


def evaluation(tokenizer, model, dataset):
    device = "cuda" if torch.cuda.is_available() else "cpu"

    # loading data
    rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
    rouge_metric = load_metric("rouge")
    score = calculate_metric_on_test_ds(
        dataset["test"][0:10],
        rouge_metric,
        model,
        tokenizer,
        batch_size=2,
        column_text="article",
        column_summary="headline",
        device="cpu"
    )

    rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)

    return rouge_dict

In [24]:
evaluation(tokenizer, model, dataset)

100%|██████████| 5/5 [00:03<00:00,  1.42it/s]


{'rouge1': 0.1, 'rouge2': 0.0, 'rougeL': 0.1, 'rougeLsum': 0.1}

## Evaluation 2: Evaluation on Already-Finetuned Models (AraBERT, DziriBERT, DarijaBERT)


In [23]:
# Load the dataset
dataset = load_dataset("Goud/Goud-sum")

In [24]:
# List of models to evaluate
models = [
    "Goud/AraBERT-summarization-goud",
    "Goud/DziriBERT-summarization-goud",
    "Goud/DarijaBERT-summarization-goud"
]

In [None]:
# Evaluate each model
for model_name in models:
    print(f"Evaluating model: {model_name}")

    if "AraBERT" in model_name or "DziriBERT" in model_name or "DarijaBERT" in model_name:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        tokenizer.model_max_length = 1024
        model = EncoderDecoderModel.from_pretrained(model_name)
        model.config.max_position_embeddings = 1024
    else:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    model.to("cuda" if torch.cuda.is_available() else "cpu")

    rouge_scores = evaluation(tokenizer, model, dataset)
    print(f"ROUGE scores for {model_name}:")
    print(rouge_scores)
    print("\n")

Evaluating model: Goud/AraBERT-summarization-goud


100%|██████████| 4749/4749 [46:38<00:00,  1.70it/s]


ROUGE scores for Goud/AraBERT-summarization-goud:
{'rouge1': 0.022797056089286873, 'rouge2': 0.0014566003299287494, 'rougeL': 0.022827197556336394, 'rougeLsum': 0.02274151487234012}


Evaluating model: Goud/DziriBERT-summarization-goud


100%|██████████| 4749/4749 [48:19<00:00,  1.64it/s]


ROUGE scores for Goud/DziriBERT-summarization-goud:
{'rouge1': 0.018970946146984042, 'rouge2': 0.001262554089762682, 'rougeL': 0.018998986508380316, 'rougeLsum': 0.018859523893034066}


Evaluating model: Goud/DarijaBERT-summarization-goud


 23%|██▎       | 1077/4749 [11:28<38:57,  1.57it/s]

# Activity: Native Language Article Summarization

**Task:** Fetch an article in your native language and assess the summarization and headline generation capabilities of ChatGPT.

**Steps:**

1. **Select an Article:** Choose a relevant and recent article written in your native language. Ensure it is of short or medium length.

2. **Summarize with ChatGPT:** Use ChatGPT to generate a summary of the selected article.

3. **Evaluate Summarization Quality:**
    - **Impression:** Share your impression of the summarization quality. Consider the following:
        - **Accuracy:** Does the summary capture the main points and essence of the article?
        - **Clarity:** Is the summary clear and easy to understand?
        - **Coverage:** Does the summary include all critical information from the article?

4. **Provide Feedback:** Offer constructive feedback on the summarization. Highlight any discrepancies or areas for improvement.


In [None]:
# @title Generate Quiz Form. (Run Cell)
from IPython.display import HTML

HTML(
    """
<iframe
	src="https://forms.gle/sggLJWMFQ4JQCmHL8",
  width="80%"
	height="320px" >
	Loading...
</iframe>
"""
)

# Section 2: Summarization using GPT

In this section, we will utilize both OpenAI's Large and Small language models to perform the same task of abstractive text summarization. We will then evaluate their performance and compare the results with those obtained from the models discussed in Section 1.

Unlike traditional fine-tuning approaches that involve updating model weights, the initial step in adapting a GPT-based model for a specific task is prompt engineering, which does not require weight updates.

<div style="display: flex; align-items: flex-start;">
    <figure style="margin-right: 10px; text-align: center;">
        <a href="https://arxiv.org/pdf/2005.14165" target="_blank">
            <img src="https://github.com/stevenkolawole/indaba-low-resource-nlp-prac/blob/main/content/traditional-finetuning.png?raw=1" width="80%" />
        </a>
        <figcaption><a href="https://arxiv.org/pdf/2005.14165" target="_blank">Traditional Fine-Tuning</a></figcaption>
    </figure>
    <figure style= "text-align: center;">
        <a href="https://arxiv.org/pdf/2005.14165" target="_blank">
            <img src="https://github.com/stevenkolawole/indaba-low-resource-nlp-prac/blob/main/content/prompting.png?raw=1" width="80%" />
        </a>
        <figcaption><a href="https://arxiv.org/pdf/2005.14165" target="_blank">Prompting</a></figcaption>
    </figure>
</div>


## Imports

In [None]:
import pandas as pd
from rouge_metric import PyRouge
from tqdm import tqdm
import os
import random
import openai
import time
from openai import OpenAI
import os
import time

## Utility Functions

In [None]:
def evaluate_rouge(hypotheses, references):
    these_refs = [[ref.strip().lower()] for ref in references]
    rouge = PyRouge(rouge_n=(1, 2), rouge_l=True)
    scores = rouge.evaluate(hypotheses, these_refs)
    print(scores)

def substring_after_colon(input_string):
    colon_index = input_string.find(':')
    if colon_index != -1:
        return input_string[colon_index + 1:]
    else:
        return input_string

In [None]:
# Define dataset and paths
DATASET = "Goud"
MAX_TRAIN = 0
model_name = "gpt-4o-mini"

output_filename = f"./{DATASET}_{model_name}_test_generated_{MAX_TRAIN}.csv"

# Load dataset
goud_data = dataset
train_source = goud_data["train"]["article"]
train_target = goud_data["train"]["headline"]
test_source = goud_data["test"]["article"]

## Function to summarize news articles

In [None]:
key = ""

In [None]:
def summarize_news_article(MAX_TRAIN=20):

    client = OpenAI(api_key=key)
    rewritten_prompt_count = 0
    line_count = 0
    wait_time = 1
    df_lines = []
    tokens_consumption = 0
    existing_len = 0
    if os.path.exists(output_filename):
        existing_df = pd.read_csv(output_filename)
        existing_len = existing_df.shape[0]
        rewritten_prompt_count = existing_len
        line_count = existing_len
        df_lines = existing_df.to_dict('records')

    for data in tqdm(test_source[existing_len:], desc=f"Lines processed from {existing_len}-th line"):
        news_article = data.strip()
        line_count += 1
        made_error = True
        num_error = 0
        while made_error:
            messages = [{"role": "system", "content": "You are asked to summarize a news article written in Modern Standard Arabic and Moroccan Darija, and write that summary as a clickbait headline, in Moroccan Darija only.\n"}]
            if MAX_TRAIN - num_error > 0:
                for _ in range(MAX_TRAIN - num_error):
                    idx = random.choice(range(len(train_source)))
                    train_src = train_source[idx]
                    train_tgt = train_target[idx]
                    messages.append({"role": "user", "content": f"Summarize the following news article into a headline in Moroccan Darija only:\n\"{train_src}\""})
                    messages.append({"role": "assistant", "content": f"Absolutely! Here is the headline summarizing your news article:\n\"{train_tgt}\""})
            messages.append({"role": "user", "content": f"Summarize the following news article into a headline in Moroccan Darija only:\n\"{news_article}\""})
            try:
                response = client.chat.completions.create(
                    messages=messages,
                    model=model_name,
                )
                headline = response.choices[0].message.content
                df_lines.append({"article": news_article,"generated_headline": headline,"prompt_messages":messages})

                rewritten_prompt_count += 1
                made_error = False
            except Exception as e:
                if isinstance(e, openai.RateLimitError):
                    print("Rate limit error")
                    print(f"Wait for {wait_time} seconds because all calls failed: ", flush=True)
                    time.sleep(wait_time)
                    wait_time *= 2
                else:
                    print(e)
                    num_error += 1
                    print("May be too long, reducing context to:", MAX_TRAIN - num_error)
            #time.sleep(1)

    df = pd.DataFrame.from_dict(df_lines)
    df.to_csv(output_filename)

## Execute the summarization


In [None]:
#record cell running time
import time
start_time = time.time()
summarize_news_article(0)
print("--- %s seconds ---" % (time.time() - start_time))

Lines processed from 0-th line: 100%|██████████| 9497/9497 [2:34:22<00:00,  1.03it/s]  


--- 9263.82370686531 seconds ---


## Load Generated output and evaluate ROUGE


In the "generated_responses" folder you will find the gpt responses corresponding to 0,1,5,20 shot prompts.

Evaluate the generated headline summaries by running ROUGE evaluation and add the results to the table of results.

### ROUGE Metric Results: 0 Shot

| Metric   | Recall (r)        | Precision (p)     | F1-Score (f)      |
|----------|-------------------|-------------------|-------------------|
| ROUGE-1  | 0.1228            | 0.1069            | 0.1143            |
| ROUGE-2  | 0.0282            | 0.0235            | 0.0256            |
| ROUGE-L  | 0.1128            | 0.0980            | 0.1049            |


In [None]:
shot_count = 0
hypotheses = pd.read_csv(f".\generated_responses\{model_name}\Goud_{model_name}_test_generated_{str(shot_count)}.csv", encoding = "UTF-8")["generated_headline"].tolist()
hypotheses = [substring_after_colon(hypo).replace("\"", "").strip() for hypo in hypotheses]
references = goud_data["test"]["headline"]
evaluate_rouge(hypotheses, references)


{'rouge-1': {'r': 0.11884854492859614, 'p': 0.12977184865821972, 'f': 0.12407023545586798}, 'rouge-2': {'r': 0.0325784137233658, 'p': 0.03389945881811894, 'f': 0.03322581039836068}, 'rouge-l': {'r': 0.11128523451125509, 'p': 0.12142149826298465, 'f': 0.11613260817866634}}


### ROUGE Metric Results: 20 Shot
Pick one or more of the files that have the previously generated N-shot GPT responses, present in the "generated_responses" folder, run the evaluation and then populate the table below

| Metric   | Recall (r)        | Precision (p)     | F1-Score (f)      |
|----------|-------------------|-------------------|-------------------|
| ROUGE-1  |             |             |             |
| ROUGE-2  |            |             |            |
| ROUGE-L  |             |             |           |


In [None]:
shot_count = 20
model_name  = "gpt4"  #"C:\Users\salamaaya\OneDrive - Microsoft\Desktop\DLI\Indaba2024-practical\indaba-low-resource-nlp-prac\generated_responses\gpt4\Goud_test_generated_5.csv"
hypotheses = pd.read_csv(f".\generated_responses\{model_name}\Goud_test_generated_{str(shot_count)}.csv", encoding = "UTF-8")["generated_headline"].tolist()
hypotheses = [substring_after_colon(hypo).replace("\"", "").strip() for hypo in hypotheses]
references = goud_data["test"]["headline"]
evaluate_rouge(hypotheses, references)

{'rouge-1': {'r': 0.13750397869020445, 'p': 0.1306205269799418, 'f': 0.13397389480280544}, 'rouge-2': {'r': 0.03387859852289136, 'p': 0.03194819653833152, 'f': 0.032885092553751126}, 'rouge-l': {'r': 0.1264128630910928, 'p': 0.11986076103165903, 'f': 0.12304965282629712}}


In [None]:
# @title Generate Quiz Form. (Run Cell)
from IPython.display import HTML

HTML(
    """
<iframe
	src="https://forms.gle/zbJoTSz3nfYq1VrY6",
  width="80%"
	height="1200px" >
	Loading...
</iframe>
"""
)

## Conclusion
**Summary:**

[Summary of the main points/takeaways from the prac.]

**Next Steps:**

[Next steps for people who have completed the prac, like optional reading (e.g. blogs, papers, courses, youtube videos). This could also link to other pracs.]

**Appendix:**

[Anything (probably math heavy stuff) we don't have space for in the main practical sections.]

**References:**

[References for any content used in the notebook.]

For other practicals from the Deep Learning Indaba, please visit [here](https://github.com/deep-learning-indaba/indaba-pracs-2022).

## Feedback

Please provide feedback that we can use to improve our practicals in the future.

In [None]:
# @title Generate Feedback Form. (Run Cell)
from IPython.display import HTML

HTML(
    """
<iframe
	src="https://forms.gle/WUpRupqfhFtbLXtN6",
  width="80%"
	height="1200px" >
	Loading...
</iframe>
"""
)

<img src="https://baobab.deeplearningindaba.com/static/media/indaba-logo-dark.d5a6196d.png" width="50%" />