# **Low-resource NLP**

<img src="https://github.com/stevenkolawole/indaba-low-resource-nlp-prac/blob/c426a3e79c3a06d38562f49d7fc57ec0c751622c/content/lr_llm_header.png?raw=1" width="60%" allign ="center"/>


<a href="https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Indaba_2024_Prac_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> [Change colab link to point to prac.]

© Deep Learning Indaba 2024. Apache License 2.0.

**Authors:**
- Ali Zaidi
- Aya Salama
- Khalil Mrini
- Steven Kolawole

**Introduction:**

Low-resource NLP (Natural Language Processing) refers to the study and development of NLP models and systems for languages, tasks, or domains that have limited data and resources available. These can include languages with fewer digital text corpora, limited computational tools, or less-developed linguistic research.

**Key Challenges in Low-Resource NLP**

1. **Data Scarcity:**
   - **Limited Training Data:** Many languages lack large annotated corpora necessary for training NLP models.
   - **Lack of Pre-trained Models:** Popular NLP models like BERT, GPT, and others are often not available for low-resource languages.

2. **Linguistic Diversity:**
   - **Morphological Complexity:** Some languages have complex grammatical structures and morphological richness.
   - **Dialectal Variations:** A lack of standardized versions can complicate NLP tasks.

3. **Resource Limitations:**
   - **Computational Constraints:** Low-resource scenarios often involve limited access to computational power and storage.
   - **Expertise and Tools:** Fewer linguistic experts and fewer NLP tools are tailored for these languages.

**Topics:**

Content: [Natural Language Processing, Large Language Models,Parameter Efficient Finetuning, Adaptation]  
Level: [Intermediate]


**Aims/Learning Objectives:**

- Exploring data scarcity challenges
- Exploring Compute resource limitations
- Comparing SOTA LLM Performance on low-resource languages/tasks (depending on which dataset we will end up using )

**Prerequisites:**

[Knowledge required for this prac. You can link a relevant parallel track session, blogs, papers, courses, topics etc.]

**Outline:**

[Points that link to each section. Auto-generate following the instructions [here](https://stackoverflow.com/questions/67458990/how-to-automatically-generate-a-table-of-contents-in-colab-notebook).]


**Before you start:**

[Tasks just before starting.]


Storyline: working on a task with scarce data which is summarization in Moroccan Darija.
this task pose resources constrainst as the Moroccan Darija can be considered a low resource dialect of Arabic, we will set this task in a coumputational resource poor environment so our training should be able to run on a commodity GPU
we will be using parameter efficient fine tuning technique (LORA) to optimize the training procedure in order to make it feasible

## Installation and Imports [should download any needed resources]

In [None]:
!pip install datasets
!pip install arabert
!pip install accelerate -U
!pip install transformers[torch]
!pip install rouge_score

In [None]:
#download the datset
#download the model checkpoints
#download the GPT outputs


# Task and Dataset Overiew

In this practical we are interested in generating headlines for news articles featured on the news website [Goud.ma](www.Gound.ma).

We will frame this as a summarization task where the input is the body of a news article and the output is an appropriate headline. The [Goud dataset](https://github.com/issam9/goud-summarization-dataset) contains 158k articles and their headlines. All headlines are in Moroccan Darija, while articles may be in Moroccan Darija, in Modern Standard Arabic, or a mix of both (code-switched Moroccan Darija).

**Data Fields**
- *article*: a string containing the body of the news article
- *headline*: a string containing the article's headline
- *categories*: a list of string of article categories

## What we will do:

<img src="https://github.com/stevenkolawole/indaba-low-resource-nlp-prac/blob/c426a3e79c3a06d38562f49d7fc57ec0c751622c/content/DLI_LR_llm_prac_1.png?raw=1" width="40%" />

## Evaluation Metric: ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate the quality of summaries by comparing them to reference (or ground truth) summaries. ROUGE is widely used in Natural Language Processing (NLP) tasks, particularly for evaluating the performance of text summarization models.

![ROUGE-Base](https://i0.wp.com/blog.uptrain.ai/wp-content/uploads/2024/01/rouge-n.webp?resize=700%2C228&ssl=1)

### Key ROUGE Variants

1. **ROUGE-N**: Measures the overlap of n-grams between the candidate summary and the reference summary.

![ROUGE-1](https://clementbm.github.io/assets/2021-12-23/rouge-unigrams.png)

*caption:*
$ROUGE_1 = \frac{7}{10} = 0.7$

   - **ROUGE-1**: Overlap of unigrams (1-gram).
   - **ROUGE-2**: Overlap of bigrams (2-grams).
   - **ROUGE-L**: Measures the longest common subsequence (LCS) between the candidate and reference summaries.

2. **ROUGE-L**: Measures the longest common subsequence (LCS) between the candidate summary and the reference summary. Unlike ROUGE-N, ROUGE-L considers sentence-level structure similarity by identifying the longest co-occurring sequence of words in both summaries.



3. **ROUGE-W**: A weighted version of ROUGE-L that gives more importance to the contiguous LCS.

4. **ROUGE-S**: Measures the overlap of skip-bigrams, which are pairs of words in their order of appearance that can have any number of gaps between them.

### How ROUGE is Computed

ROUGE metrics can be calculated in terms of three measures:

- **Recall**: The ratio of overlapping units (n-grams, LCS, or skip-bigrams) between the candidate summary and the reference summary to the total units in the reference summary. It answers, "How much of the reference summary is captured by the candidate summary?"

- **Precision**: The ratio of overlapping units between the candidate summary and the reference summary to the total units in the candidate summary. It answers, "How much of the candidate summary is relevant to the reference summary?"

- **F1-Score**: The harmonic mean of Precision and Recall. This gives a balanced measure that considers both precision and recall.

### Importance of ROUGE

ROUGE is essential for summarization tasks because it provides a standardized way to evaluate and compare different summarization models. Higher ROUGE scores generally indicate that the candidate summary is more similar to the reference summary, meaning the model is likely performing well.

#### NOTE: Caveat
<div style="display: flex; justify-content: space-between;">
    <img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*8ZNpaag-Nr2GLs3A-sz0aQ.png" alt="limitation 1" width="250"/>
    <img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*CLIKeyKYiR6sNA4yjIkCWg.png" alt="limitation 2" width="250"/>
    <img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*667HMbjSLJhwR_xqBau3JQ.png" alt="limitation 3" width="250"/>
</div>

While ROUGE and other evaluation metrics (e.g., BLEU, METEOR, etc) serve as valuable tools for quick and straightforward evaluation of language models, they have certain limitations that render them less than ideal. To begin with, they fall short when it comes to assessing the fluency, coherence, and overall meaning of passages. They are also relatively insensitive to word order. ROUGE primarily measures lexical overlap and may not fully capture the semantic meaning or quality of a summary. For these reasons, researchers are still trying to find improved metrics.

Therefore, these metrics are not shoe-in replacements for human evaluation, but are best used in conjunction with human evaluations for a more comprehensive assessment of summary quality.

In [None]:
from datasets import load_dataset

dataset = load_dataset('Goud/Goud-sum')


In [None]:
#Data Exploration
print(dataset['train'][0])

In [None]:
dataset

# Section1: Efficiently Fine-Tune Seq2Seq Models with Low Rank Adaptation (LoRA)
We are going to leverage Hugging Face [Transformers](https://huggingface.co/docs/transformers/index), [Accelerate](https://huggingface.co/docs/accelerate/index), and [PEFT](https://github.com/huggingface/peft).

You will learn how to:

1. Setup Development Environment
2. Load and prepare the dataset
3. Fine-Tune Multilingual BERT with LoRA and bnb int-8
4. Evaluate & run Inference
5. Cost performance comparison

### Quick intro to PEFT or Parameter Efficient Fine-tuning
<img src="https://github.com/stevenkolawole/indaba-low-resource-nlp-prac/blob/c426a3e79c3a06d38562f49d7fc57ec0c751622c/content/PEFT_method.png?raw=1" width="60%" />

[PEFT](https://github.com/huggingface/peft), or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:

- LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
- Prefix Tuning: [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)
- P-Tuning: [GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf)
- Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)

## Summarization flow using BERT

In [None]:
from datasets import load_dataset, Dataset, load_metric
from transformers import BertTokenizer, EncoderDecoderModel, Trainer, TrainingArguments, DataCollatorForSeq2Seq


# Load the tokenizer and model
model_name = "bert-base-multilingual-cased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = EncoderDecoderModel.from_encoder_decoder_pretrained(model_name, model_name)

# Set decoder_start_token_id
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id

Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['bert.encoder.layer.0.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.0.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.0.crossattention.output.dense.bias', 'bert.encoder.layer.0.crossattention.output.dense.weight', 'bert.encoder.layer.0.crossattention.self.key.bias', 'bert.encoder.layer.0.crossattention.self.key.weight', 'bert.encoder.layer.0.crossattention.self.query.bias', 'bert.encoder.layer.0.crossattention.self.query.weight', 'bert.encoder.layer.0.crossattention.self.value.bias', 'bert.encoder.layer.0.crossattention.self.value.weight', 'bert.encoder.layer.1.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.1.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.1.crossattention.output.dense.bias', 'bert.encoder.layer.1.crossattention.output.dense.weight', 'bert.encoder.layer.1.crossattention.self.key.bia

In [None]:
from datasets import concatenate_datasets
import numpy as np
# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["article"], truncation=True), batched=True, remove_columns=["article", "categories"])
input_lenghts = [len(x) for x in tokenized_inputs["input_ids"]]
# take 85 percentile of max length for better utilization
max_source_length = int(np.percentile(input_lenghts, 85))
print(f"Max source length: {max_source_length}")

# The maximum total sequence length for target text after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["categories"], truncation=True), batched=True, remove_columns=["article", "categories"])
target_lenghts = [len(x) for x in tokenized_targets["input_ids"]]
# take 90 percentile of max length for better utilization
max_target_length = int(np.percentile(target_lenghts, 90))
print(f"Max target length: {max_target_length}")

Map:   0%|          | 0/148785 [00:00<?, ? examples/s]

Max source length: 512


Map:   0%|          | 0/148785 [00:00<?, ? examples/s]

Max target length: 16


In [None]:
# Preprocess the data
def preprocess_function(examples):
    inputs = tokenizer(examples["article"], max_length=512, truncation=True, padding="max_length")
    outputs = tokenizer(examples["headline"], max_length=150, truncation=True, padding="max_length")

    inputs["decoder_input_ids"] = outputs["input_ids"]
    inputs["labels"] = outputs["input_ids"].copy()

    # replace padding token id's of the labels by -100 so it's ignored by the loss
    inputs["labels"] = [[(label if label != tokenizer.pad_token_id else -100) for label in labels] for labels in inputs["labels"]]

    return inputs

In [None]:
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["article", "categories", "headline"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")
# save datasets to disk for later easy loading
tokenized_dataset["train"].save_to_disk("data/train")
tokenized_dataset["test"].save_to_disk("data/eval")

Map:   0%|          | 0/139288 [00:00<?, ? examples/s]

Map:   0%|          | 0/9497 [00:00<?, ? examples/s]

Map:   0%|          | 0/9497 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'token_type_ids', 'attention_mask', 'decoder_input_ids', 'labels']


Saving the dataset (0/2 shards):   0%|          | 0/139288 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/9497 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets = dataset.map(preprocess_function, batched=True)


# Define data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    save_total_limit=3,
)




In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType

# Define LoRA Config
lora_config = LoraConfig(
 r=16,
 lora_alpha=32,
 target_modules=["query", "value"],
 lora_dropout=0.05,
 bias="none",
 task_type=TaskType.SEQ_2_SEQ_LM
)
# prepare int-8 model for training
model = prepare_model_for_kbit_training(model)

# add LoRA adaptor
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# trainable params: 18874368 || all params: 11154206720 || trainable%: 0.16921300163961817

trainable params: 1,769,472 || all params: 385,964,283 || trainable%: 0.4585


In [None]:
# Define ROUGE metric
rouge = load_metric("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them
    labels = [[(label if label != -100 else tokenizer.pad_token_id) for label in labels] for labels in labels]
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Compute ROUGE scores
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract the ROUGE scores
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

    return result



  rouge = load_metric("rouge")


In [None]:
# Sample a subset of the tokenized training data
subset_fraction = 0.05  # 5% of the training data
train_subset = tokenized_datasets["train"].shuffle(seed=42).select(range(int(subset_fraction * len(tokenized_datasets["train"]))))



In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

output_dir="goud-bert"

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
		auto_find_batch_size=True,
    learning_rate=1e-3, # higher learning rate
    num_train_epochs=3,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=500,
    save_strategy="no",
    report_to="tensorboard",
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

## Training [we will not be running training during the practical time]

In [None]:
# Train the model
trainer.train()


We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Step,Training Loss


In [None]:
# Save the model
model.save_pretrained("./fine_tuned_bert2bert_model")
tokenizer.save_pretrained("./fine_tuned_bert2bert_model")

## Evaluation1: early checkpoint of the model

In [None]:
#load an early checkpoint
#run evaluation

## Evaluation2: final trained model

# Section2: Summarization using GPT

In [None]:
from openai import AzureOpenAI
from tqdm import tqdm
import pandas as pd
import os
from datasets import load_dataset
import random
import openai
import time

DATASET = "Goud"
MAX_TRAIN = 20

goud_data = load_dataset("Goud/Goud-sum")

train_source = goud_data["train"]["article"]
train_target = goud_data["train"]["headline"]
train_len = len(train_source)

test_source = goud_data["test"]["article"]

output_filename = f"./{DATASET}_test_generated_{MAX_TRAIN}.csv"

def summarize_news_article():
    rewritten_prompt_count = 0
    line_count = 0
    wait_time = 1
    df_lines = []
    client = AzureOpenAI(
        azure_endpoint="",
        api_version="",
        api_key="",
    )
    existing_len = 0
    if os.path.exists(output_filename):
        existing_df = pd.read_csv(output_filename)
        existing_len = existing_df.shape[0]
        rewritten_prompt_count = existing_len
        line_count = existing_len
        df_lines = existing_df.to_dict('records')
    for data in tqdm(test_source[existing_len:], desc=f"Lines processed from {existing_len}-th line"):
        news_article = data.strip()
        line_count += 1
        made_error = True
        num_error = 0
        while made_error:
            messages = [{"role": "system", "content": "You are asked to summarize a news article written in Modern Standard Arabic and Moroccan Darija, and write that summary as a clickbait headline, in Moroccan Darija only.\n"}]
            if MAX_TRAIN - num_error > 0:
                for _ in range(MAX_TRAIN-num_error):
                    idx = random.choice(range(train_len))
                    train_src = train_source[idx]
                    train_tgt = train_target[idx]
                    messages.append({"role": "user", "content": f"Summarize the following news article into a headline in Moroccan Darija only:\n\"{train_src}\""})
                    messages.append({"role": "assistant", "content": f"Absolutely! Here is the headline summarizing your news article:\n\"{train_tgt}\""})
            messages.append({"role": "user", "content": f"Summarize the following news article into a headline in Moroccan Darija only:\n\"{news_article}\""})
            try:
                response = client.chat.completions.create(
                    messages=messages,
                    model="gpt-4-32k-0613", #Must fill in, optional: gpt-35-turbo、gpt-4、gpt-4-32k
                )
                rewritten_prompt = response.choices[0].message.content
                df_lines.append({"article": news_article, "generated_headline": rewritten_prompt})
                rewritten_prompt_count += 1
                made_error = False
            except Exception as e:
                if type(e) is openai.RateLimitError:
                    print("Rate limit error")
                    print(f"Wait for {wait_time} seconds because all calls failed: ", flush=True)
                    time.sleep(wait_time)
                    wait_time *= 2
                else:
                    print(e)
                    num_error += 1
                    print("May be too long, reducing context to:", MAX_TRAIN-num_error)


    df = pd.DataFrame.from_dict(df_lines)
    df.to_csv(output_filename)

summarize_news_article()

In [None]:
import pandas as pd
from datasets import load_dataset
from rouge_metric import PyRouge

def evaluate_rouge(hypotheses, references):
    these_refs = [[ref.strip().lower()] for ref in references]
    rouge = PyRouge(rouge_n=(1, 2), rouge_l=True)
    scores = rouge.evaluate(hypotheses, these_refs)
    print(scores)

def substring_after_colon(input_string):
    # Find the index of the first colon
    colon_index = input_string.find(':')

    # If a colon is found, return the substring starting just after it
    if colon_index != -1:
        return input_string[colon_index + 1:]
    else:
        # If no colon is found, return an empty string
        return input_string


if __name__ == "__main__":
    DATASET = "Goud"
    hypotheses = pd.read_csv(f"./{DATASET}_test_generated_0.csv")["generated_headline"].tolist()
    hypotheses = [substring_after_colon(hypo).replace("\"", "").strip() for hypo in hypotheses]
    goud_data = load_dataset("Goud/Goud-sum")
    references = goud_data["test"]["headline"]
    evaluate_rouge(hypotheses, references)

**Group Task:**

Task that involves asking your neighbour or a group a question.

In [None]:
# @title Generate Quiz Form. (Run Cell)
from IPython.display import HTML

HTML(
    """
<iframe
	src="https://forms.gle/zbJoTSz3nfYq1VrY6",
  width="80%"
	height="1200px" >
	Loading...
</iframe>
"""
)

## Conclusion
**Summary:**

[Summary of the main points/takeaways from the prac.]

**Next Steps:**

[Next steps for people who have completed the prac, like optional reading (e.g. blogs, papers, courses, youtube videos). This could also link to other pracs.]

**Appendix:**

[Anything (probably math heavy stuff) we don't have space for in the main practical sections.]

**References:**

[References for any content used in the notebook.]

For other practicals from the Deep Learning Indaba, please visit [here](https://github.com/deep-learning-indaba/indaba-pracs-2022).

## Feedback

Please provide feedback that we can use to improve our practicals in the future.

In [None]:
# @title Generate Feedback Form. (Run Cell)
from IPython.display import HTML

HTML(
    """
<iframe
	src="https://forms.gle/WUpRupqfhFtbLXtN6",
  width="80%"
	height="1200px" >
	Loading...
</iframe>
"""
)

<img src="https://baobab.deeplearningindaba.com/static/media/indaba-logo-dark.d5a6196d.png" width="50%" />