# Fine Tuning using *xlm-roberta-base*

## Installing the `datasets` Library

In order to fine-tune our LLM, we need to manage and preprocess large datasets efficiently. The `datasets` library by Hugging Face provides tools to load, manipulate, and share datasets for NLP tasks, including Named Entity Recognition (NER).

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.5 MB/s[0m eta [36m0:00

## Loading CoNLL-formatted NER Dataset

The function `load_conll_dataset` is designed to load and parse data in **CoNLL format** for **Named Entity Recognition (NER)** tasks. This format is commonly used for NER, where each line contains a word and its corresponding entity label, and sentences are separated by blank lines.

### Breakdown of the Function:

1. **Parsing CoNLL Data**:  
   The inner function `parse_conll(file_path)` reads the CoNLL file line by line:
   - Each line contains a **word** and its **label**, separated by a space.
   - Sentences are stored as lists of word-label pairs, and blank lines signify the end of a sentence.
   - After processing all lines, the sentences are grouped in a list, where each sentence is a list of tuples, with each tuple containing a word and its corresponding label.

2. **Data Formatting**:  
   The parsed data is converted into a dictionary:
   - **"tokens"**: A list of lists, where each inner list contains the tokens (words) of a sentence.
   - **"ner_tags"**: A list of lists, where each inner list contains the NER labels corresponding to the tokens in the same sentence.

3. **Dataset Creation**:  
   The dictionary is then passed to Hugging Face’s `Dataset.from_dict()` method, which converts the parsed data into a Hugging Face `Dataset` object. This allows us to utilize the powerful tools provided by the Hugging Face ecosystem for further processing, training, and evaluation of the model.


In [2]:
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer

def load_conll_dataset(file_path):
    # Function to parse CoNLL data and return it as a Hugging Face Dataset
    def parse_conll(file_path):
        sentences = []
        current_sentence = []

        with open(file_path, 'r', encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if line == "":  # New sentence
                    if current_sentence:
                        sentences.append(current_sentence)
                        current_sentence = []
                else:
                    word, label = line.split()  # Assumes word and label are separated by space
                    current_sentence.append((word, label))

            if current_sentence:  # Catch any remaining sentence
                sentences.append(current_sentence)

        return sentences

    # Parse the data
    parsed_sentences = parse_conll(file_path)

    # Prepare the data in dictionary format
    data = {
        "tokens": [[word for word, label in sentence] for sentence in parsed_sentences],
        "ner_tags": [[label for word, label in sentence] for sentence in parsed_sentences],
    }

    # Create and return a Hugging Face dataset
    dataset = Dataset.from_dict(data)

    return dataset


In [3]:
#Load the dataset
file_path = "/content/conll_format_data.conll"
dataset = load_conll_dataset(file_path)


In [4]:
dataset

Dataset({
    features: ['tokens', 'ner_tags'],
    num_rows: 936
})

# Tokenizer Initialization

In this section, we initialize the tokenizer using a pre-trained model from Hugging Face. The **tokenizer** is responsible for converting raw text (sentences or tokens) into input tokens that the model can process during training or inference. For this project, we can select a model that supports Amharic language tokenization. we're currently using the `xlm-roberta-base` model.

In [5]:
#Initialize the tokenizer (using a model that supports Amharic, like XLM-R or bert-tiny-amharic)
model_name = "xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]



## Encoding NER Labels

The `encode_labels` function is responsible for converting human-readable Named Entity Recognition (NER) tags (such as `B-Product`, `I-Product`, etc.) into integer-encoded labels. This is necessary because machine learning models typically require labels in numerical form.

In [6]:
def encode_labels(data):
  # NER tags
  ner_tags = ['B-Product', 'I-Product', 'B-PRICE','I-PRICE', 'B-LOC', 'I-LOC','O']

  # Create a dictionary to map each tag to a unique integer
  label_to_id = {label: idx for idx, label in enumerate(ner_tags)}
  data['ner_tags'] = [label_to_id[tag] for tag in data['ner_tags']]

  # # Get unique NER labels (used in evaluation)
  # label_list = sorted(label_to_id.keys())

  return data

# Apply the function to the dataset
encoded_dataset = dataset.map(encode_labels)

Map:   0%|          | 0/936 [00:00<?, ? examples/s]

## Tokenizing and Aligning Labels

The `tokenize_and_align_labels` function is used to tokenize input sentences while ensuring that the NER labels (or any other token-level labels) remain aligned with the tokens after subword tokenization. This is essential when working with models like BERT or DistilBERT, which tokenize words into subwords, as we need to make sure that the NER tags still map correctly to their corresponding words or subwords.

In this step, we use the `tokenize_and_align_labels` function to tokenize the dataset and ensure that the NER tags (which are now encoded as integers) remain aligned with the tokens after subword tokenization.


In [7]:
def tokenize_and_align_labels(dataset, tokenizer, label_all_tokens=False):
    def tokenize_and_align(examples):
        tokenized_inputs = tokenizer(examples['tokens'], truncation=True,padding=True, is_split_into_words=True)

        labels = []
        for i, label in enumerate(examples["ner_tags"]):
            word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to words
            label_ids = []
            previous_word_idx = None
            for word_idx in word_ids:
                if word_idx is None:
                    label_ids.append(-100)  # Special tokens
                elif word_idx != previous_word_idx:  # Only label first subword
                    label_ids.append(label[word_idx])
                else:
                    label_ids.append(-100)
                previous_word_idx = word_idx
            labels.append(label_ids)

        tokenized_inputs["labels"] = labels
        return tokenized_inputs

    tokenized_dataset = dataset.map(tokenize_and_align, batched=True)
    return tokenized_dataset
#Tokenize and align labels
tokenized_dataset = tokenize_and_align_labels(encoded_dataset, tokenizer)

Map:   0%|          | 0/936 [00:00<?, ? examples/s]

## Splitting the Dataset into Training and Validation Sets

Once the dataset has been tokenized and the labels have been aligned, the next step is to split it into **training** and **validation** sets. This allows us to train the model on one portion of the data and validate its performance on another portion, which helps prevent overfitting.

In [8]:
split_dataset = tokenized_dataset.train_test_split(test_size=0.2)
train_dataset = split_dataset["train"]
val_dataset = split_dataset["test"]

## Setting Up Training Arguments for Fine-Tuning

The `TrainingArguments` class from Hugging Face's `transformers` library provides an easy way to configure training hyperparameters for fine-tuning transformer models. In this section, we define key arguments that control the training process, such as the evaluation strategy, learning rate, batch size, number of epochs, and weight decay.

In [9]:
#Setting up training arguments
from transformers import TrainingArguments

def setup_training_args(output_dir):
    training_args = TrainingArguments(
        output_dir=output_dir,
        evaluation_strategy="epoch",  # Evaluate after every epoch
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=3,
        weight_decay=0.01,
    )
    return training_args
# Set up training arguments
output_dir = "./fine_tuned_model"
training_args = setup_training_args(output_dir)



## Custom Evaluation Function for Model Metrics

To assess the performance of our fine-tuned Named Entity Recognition (NER) model, we define a custom evaluation function `compute_metrics`. This function calculates various performance metrics such as accuracy, precision, recall, and F1 score, which are essential for understanding the model's effectiveness.

In [10]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
import numpy as np

# Custom evaluation function to calculate accuracy, precision, recall, and F1
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (usually padding tokens) from the labels
    true_labels = [[label for (pred, label) in zip(prediction, label) if label != -100] for prediction, label in zip(predictions, labels)]
    true_predictions = [[pred for (pred, label) in zip(prediction, label) if label != -100] for prediction, label in zip(predictions, labels)]

    # Flatten lists for metric calculation
    true_labels_flat = [label for sublist in true_labels for label in sublist]
    true_predictions_flat = [pred for sublist in true_predictions for pred in sublist]

    precision, recall, f1, _ = precision_recall_fscore_support(true_labels_flat, true_predictions_flat, average="weighted")
    accuracy = accuracy_score(true_labels_flat, true_predictions_flat)

    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1,
    }


## Fine-Tuning the Model for Named Entity Recognition (NER)

Fine-tuning a pre-trained transformer model on a specific task,  Named Entity Recognition (NER) in our case, is crucial for adapting the model to recognize relevant entities within your dataset. In this section, we set up and execute the fine-tuning process using Hugging Face’s `Trainer`.

In [11]:
#  Fine-tune the model
from transformers import AutoModelForTokenClassification, Trainer

def fine_tune_model(model_name, train_dataset, val_dataset, training_args):
    # Load pre-trained model
    model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=num_labels)

    # Initialize the Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )

    # Fine-tune the model
    trainer.train()

    return model,trainer


num_labels = 7 # Since we have seven tags
model,trainer = fine_tune_model(model_name, train_dataset, val_dataset, training_args)


model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.408609,0.8543,0.804581,0.8543,0.795376
2,No log,0.214079,0.934258,0.922642,0.934258,0.927938
3,No log,0.18174,0.941065,0.93124,0.941065,0.9356


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Evaluation
Calculates various performance metrics such as accuracy, precision, recall, and F1 score

In [12]:
# Evaluate the model
evaluation_results = trainer.evaluate()
print(evaluation_results)

{'eval_loss': 0.18173973262310028, 'eval_accuracy': 0.9410649663966913, 'eval_precision': 0.9312404925114645, 'eval_recall': 0.9410649663966913, 'eval_f1': 0.9356001764853812, 'eval_runtime': 4.9466, 'eval_samples_per_second': 38.006, 'eval_steps_per_second': 2.426, 'epoch': 3.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Save Model and Tokenizer
Save model and tokenizer for later use

In [13]:
# Save the fine-tuned model
def save_model(model, tokenizer,output_dir):
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)


save_model(model,tokenizer,output_dir)

## Test with a new message

In [14]:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

# Load the fine-tuned model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("./fine_tuned_model")
tokenizer = AutoTokenizer.from_pretrained(model_name)  # Ensure you use the same tokenizer

def predict_ner(sentence):
    # Step 1: Tokenize the input sentence
    inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True, is_split_into_words=False)

    # Step 2: Make predictions
    with torch.no_grad():  # Disable gradient calculation
        outputs = model(**inputs)
        logits = outputs.logits

    # Get the predicted label indices
    predictions = torch.argmax(logits, dim=2)

    # Step 3: Map predictions back to labels
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    predicted_labels = []
    for i, token in enumerate(tokens):
        if token in tokenizer.special_tokens_map.values():  # Skip special tokens
            predicted_labels.append(-1)  # Assign a default value for special tokens
        else:
            # Append the prediction for the token (convert from tensor to int)
            predicted_labels.append(predictions[0][i].item())

    # Map the predicted indices to NER labels
    ner_tags = ['B-Product', 'I-Product', 'B-PRICE', 'I-PRICE', 'B-LOC', 'I-LOC', 'O']

    # Convert predicted label indices to string labels (use the correct mapping here)
    predicted_labels = [ner_tags[label] if label != -1 else "O" for label in predicted_labels]

    return tokens, predicted_labels

# Example usage
sentence = "አይፎን 15 ፕሮ ማክስ ዋጋ 10000 ብር አድርሻ አዳማ"
tokens, predicted_labels = predict_ner(sentence)

# Display the tokens and their predicted labels
for token, label in zip(tokens, predicted_labels):
    print(f"{token}: {label}")




<s>: O
▁አይ: I-Product
ፎ: I-Product
ን: I-Product
▁15: I-Product
▁: I-Product
ፕሮ: I-Product
▁ማ: I-Product
ክስ: I-Product
▁ዋጋ: B-PRICE
▁10000: I-PRICE
▁ብር: I-PRICE
▁አድር: I-PRICE
ሻ: I-PRICE
▁አዳ: B-LOC
ማ: I-Product
</s>: O
