## Fine-Tuning NER Model
The goal of this task is to fine-tune a Named Entity Recognition (NER) model to extract key entities such as products, prices, and locations from Amharic Telegram messages. We will use a pre-trained language model and adapt it for the specific task of identifying business-related entities in the Amharic language.

In [1]:
# Import necessary libraries
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
scripts_path = '/content/drive/MyDrive/train sample/scripts/'


In [None]:
! pip install datasets

In [5]:
# Add the correct scripts folder path
import sys
sys.path.append('/content/drive/MyDrive/train_sample/scripts')

In [6]:
# Import the module
from ner_finetuning import *

In [7]:
#  Load and prepare the data
file_path = '/content/drive/MyDrive/train_sample/labeled_ner_data.conll'

### prepare the data

In [8]:
# prepare the data
sentences, labels = load_conll_data(file_path)
dataset = prepare_dataset(sentences, labels)

In [9]:
selected_columns = dataset.select_columns(['tokens', 'ner_tags'])
print(selected_columns)


Dataset({
    features: ['tokens', 'ner_tags'],
    num_rows: 1340
})


In [10]:
# printing the 'tokens' and 'ner_tags' columns
tokens = dataset['tokens']
ner_tags = dataset['ner_tags']

# Print the first 5 entries as an example
for i in range(5):
    print(f"Tokens: {tokens[i]}")
    print(f"NER Tags: {ner_tags[i]}")


Tokens: ['ለኮንዶሚኒየም', 'ለጠባብ', 'ቤቶች', 'ገላግሌ', 'የሆነ', 'ከንፁህ', 'የሲልከን', 'ጥሬ', 'እቃ', 'የተሰራ', 'የልጆች', 'ማጠቢያ', 'ምስሉ', 'ላይ', 'እንደሚያዩት', 'መታጠፍ', 'መዘርጋት', 'የሚችል', '3350ብር', 'ይደውሉልን', 'እርሶ', 'መምጣት', 'ባይመቾ', 'እኛ', 'ያሉበት', 'ድረስ', 'እናደርስሎታለን', 'ስልክ', '0905707448', '0909003864', 'ሲና', 'የተመረጡና', 'ጥራታቸውን', 'የጠበቁ', 'የልጆች', 'እቃ', 'አስመጪ', '0909003864', '0905707448', 'እቃ', 'ለማዘዝ', 'ከስር', 'ያለውን', 'ሊንኮች', 'በመጫን', 'ማዘዝ', 'ትችላላቹ', '@', '@2', 'አድራሻ', 'ቁጥር', 'ገርጂ', 'ኢምፔሪያል', 'ከሳሚ', 'ህንፃ', 'ጎን', 'አልፎዝ', 'ፕላዛ', 'ግራውንድ', 'ላይ', 'እንደገቡ', 'ያገኙናል', '2ቁጥር2', '4ኪሎ', 'ቅድስት', 'ስላሴ', 'ህንፃ', 'ማለትም', 'ከብልፅግና', 'ዋናፅፈት', 'ቤት', 'ህንፃ', 'በስተ', 'ቀኝ', 'ባለው', 'አስፓልት', '20ሜትር', 'ዝቅ', 'እንዳሉ', 'ሀበሻ', 'ኮፊ', 'የሚገኝበት', 'ቀይ', 'ሸክላ', 'ህንፃ', '2ተኛ', 'ፎቅ', 'ላይ', 'ያገኙናል', '3ቁጥር3', 'ብስራተ', 'ገብርኤል', 'ላፍቶ', 'ሞል', 'መግቢያው', 'ፊት', 'ለፊት', 'የሚገኘው', 'የብስራተ', 'ገብርኤል', 'ቤተ', 'ክርስቲያን', 'ህንፃ', 'አንደኛ', 'ፎቅ', 'ላይ', 'ደረጃ', 'እንደወጣቹ', 'በስተግራ', 'በኩል', 'ሱቅ', 'ቁጥር', '-09', 'ክቡራን', 'ደምበኞቻችን', 'ገርጂ', 'አልፎዝ', 'ፕላዛ', 'ላይ', 'አራት', 'ኪሎ', 'ቅድስት', 'ስላሴ', 'እንዲሁም', 'ብስራተ', 'ገ

### label encoding

In [11]:
#  Define label encoding
label_list, label2id, id2label = get_label_encodings()

In [12]:

label_list, label2id, id2label = get_label_encodings()

# Create a DataFrame to store the mappings
df = pd.DataFrame({
    'Label': label_list,
    'Label to ID': [label2id[label] for label in label_list],
    'ID to Label': [id2label[label2id[label]] for label in label_list]
})

# Print the DataFrame (table format)
print(df)


       Label  Label to ID ID to Label
0          O            0           O
1  B-Product            1   B-Product
2  I-Product            2   I-Product
3      B-LOC            3       B-LOC
4      I-LOC            4       I-LOC
5    B-Price            5     B-Price
6    I-Price            6     I-Price


### Load tokenizer and model

In [13]:
# Load tokenizer and model
#model_name = "bert-base-multilingual-cased"
model_name="xlm-roberta-base"
tokenizer, model = load_model_and_tokenizer(model_name, len(label_list), id2label, label2id)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Tokenization and NER Label Alignment

-  tokenizing input sentences and aligning the corresponding NER labels. It uses a Hugging Face tokenizer.


In [14]:
#  Tokenize the dataset
tokenized_dataset = dataset.map(
    lambda examples: tokenize_and_align_labels(examples, tokenizer, label2id),
    batched=True
)

Map:   0%|          | 0/1340 [00:00<?, ? examples/s]

In [15]:
#  Split the dataset into training and evaluation sets
train_test_split = tokenized_dataset.train_test_split(test_size=0.1)  # Splits into train and test sets

# Accessing train and eval datasets from the split
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']

In [16]:
# train test split
train_test_split

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1206
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 134
    })
})

### fine-tune the model ( xlm-roberta-base )

In [17]:
# Set up trainer and fine-tune the model
output_dir = "./results"
trainer = setup_trainer(model, tokenizer, train_dataset, eval_dataset, output_dir)  # Pass both train and eval datasets
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.1077
2,No log,0.048249
3,No log,0.02252
4,No log,0.020867
5,No log,0.019684


TrainOutput(global_step=380, training_loss=0.11338071321186266, metrics={'train_runtime': 200.8385, 'train_samples_per_second': 30.024, 'train_steps_per_second': 1.892, 'total_flos': 393922667128320.0, 'train_loss': 0.11338071321186266, 'epoch': 5.0})

In [18]:
# Save the fine-tuned model
model.save_pretrained("./amharic_ner_model")
tokenizer.save_pretrained("./amharic_ner_model")

('./amharic_ner_model/tokenizer_config.json',
 './amharic_ner_model/special_tokens_map.json',
 './amharic_ner_model/sentencepiece.bpe.model',
 './amharic_ner_model/added_tokens.json',
 './amharic_ner_model/tokenizer.json')

### sample- model predicition 

In [19]:
# sample - Use the model for predictions
sample_text = "ለኮንዶሚኒየም ለጠባብ ቤቶች ገላግሌ የሆነ ከንፁህ የሲልከን ጥሬ እቃ የተሰራ"
predictions = predict_ner(sample_text, model, tokenizer, id2label)

In [20]:
# Print results
for token, label in predictions:
    print(f"{token}: {label}")

<s>: O
▁ለ: B-Product
ኮ: B-Product
ን: B-Product
ዶ: B-Product
ሚኒ: B-Product
የም: B-Product
▁: B-Product
ለጠ: O
ባብ: B-Product
▁ቤቶች: O
▁: B-Product
ገላ: O
ግ: O
ሌ: O
▁የሆነ: O
▁: B-Product
ከን: O
ፁ: B-Product
ህ: O
▁የ: I-Product
ሲ: B-Product
ል: I-Product
ከን: I-Product
▁ጥ: I-Product
ሬ: I-Product
▁እ: I-Product
ቃ: I-Product
▁የተ: O
ሰራ: O
</s>: O
