## **Environment Setup**
 We'll begin by installing necessary libraries, including transformers, datasets, seqeval, and others needed for token classification and model evaluation.

In [1]:
!pip install transformers datasets seqeval sentences
!pip install pyarrow==14.0.1


Collecting seqeval
  Using cached seqeval-1.2.2.tar.gz (43 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
[31mERROR: Could not find a version that satisfies the requirement sentences (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for sentences[0m[31m
[0mCollecting pyarrow==14.0.1
  Using cached pyarrow-14.0.1-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.0 kB)
Using cached pyarrow-14.0.1-cp310-cp310-manylinux_2_28_x86_64.whl (38.0 MB)
Installing collected packages: pyarrow
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 17.0.0
    Uninstalling pyarrow-17.0.0:
      Successfully uninstalled pyarrow-17.0.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.6.1 requires pyarrow<16.2.0a0,>=16.1.0, but you have pyarrow 14.0.1 which is incompatible.
datasets 3.0.1 requires pyarro

## **Import Libraries**
Next, we import key libraries required for token classification, dataset management, and training.

In [2]:
!pip install datasets
# Import Libraries
import pandas as pd
from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer, TrainingArguments, DataCollatorForTokenClassification
from datasets import Dataset

Collecting pyarrow>=15.0.0 (from datasets)
  Using cached pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Using cached pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
Installing collected packages: pyarrow
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 14.0.1
    Uninstalling pyarrow-14.0.1:
      Successfully uninstalled pyarrow-14.0.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.6.1 requires pyarrow<16.2.0a0,>=16.1.0, but you have pyarrow 17.0.0 which is incompatible.[0m[31m
[0mSuccessfully installed pyarrow-17.0.0


## **Upload Dataset**
We upload the dataset in .conll format that has already been labeled with entities (Product, Location, Price).

In [3]:
# Upload Dataset
from google.colab import files
uploaded = files.upload()

Saving combined_dataset.conll to combined_dataset.conll


## **Read Dataset and Preprocess**
The dataset in .conll format is read, and tokens are checked for any mismatches between the tokens and labels.

In [4]:
# Function to read the .conll file and check for mismatches
def read_conll_format(file_path):
    sentences = []
    labels = []
    with open(file_path, 'r', encoding='utf-8') as file:
        sentence = []
        label = []
        for line in file:
            if line.strip() == "":
                if sentence:  # Only append non-empty sentences
                    sentences.append(sentence)
                    labels.append(label)
                    sentence = []
                    label = []
            else:
                token, tag = line.split()
                sentence.append(token)
                label.append(tag)
    return sentences, labels

# Read the file
file_path = 'combined_dataset.conll'
sentences, labels = read_conll_format(file_path)

# Check for mismatches between tokens and labels
mismatches = []
for i, (sentence, label) in enumerate(zip(sentences, labels)):
    if len(sentence) != len(label):
        mismatches.append((i, len(sentence), len(label)))

if mismatches:
    print("Mismatches found:", mismatches)
else:
    print("No mismatches found!")


No mismatches found!


## **Label Normalization**
In this step, we map all label variations (e.g., B-PRICE, I-PRICE) to a consistent naming convention to avoid errors during training.

In [5]:
# Define a mapping for label standardization
label_mapping = {
    "O":"o",
    "B-LOC": "b-loc",
    "I-LOC": "i-loc",
    "B-PRODUCT": "b-product",
    "I-PRODUCT": "i-product",
    "B-PRICE": "b-price",
    "B-price": "b-price",
    "I-PRICE": "i-price",
}
combined_dataset = {"labels": labels}
# Normalize labels based on the mapping
combined_dataset["labels"] = [
    [label_mapping.get(label, label) for label in label_list]
    for label_list in combined_dataset["labels"]
]


## **Tokenizer Setup and Dataset Preparation**
We use the XLM-Roberta tokenizer to process the Amharic text. We also convert string labels to corresponding integer IDs, which the model can process.

In [None]:
from transformers import AutoTokenizer
from datasets import Dataset

# Tokenizer setup
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

# Label to ID mapping
label_to_id = {
    "o": 0,
    "b-loc": 1,
    "i-loc": 2,
    "b-product": 3,
    "b-prod": 3,  # Alias for b-product
    "i-product": 4,
    "b-price": 5,
    "i-price": 6,
}

# Convert string labels to their corresponding integer IDs
def convert_labels_to_ids(examples):
    return {
        "labels": [
            [label_to_id[label] for label in label_sequence]
            for label_sequence in examples["labels"]
        ]
    }

# Tokenize and align labels with conversion to integers
def tokenize_and_align_labels(examples):
    # Tokenize inputs with padding and truncation enabled
    tokenized_inputs = tokenizer(
        examples['tokens'],
        truncation=True,
        padding=True,  # Ensures uniform length
        is_split_into_words=True
    )

    labels = []
    for i, label in enumerate(examples['labels']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to words
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)  # Special tokens are ignored
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])  # Use the label of the word
            else:
                label_ids.append(-100)  # Ignore subwords
            previous_word_idx = word_idx

        labels.append(label_ids)

    # Add labels to the tokenized inputs
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

# Create a dataset dictionary
dataset_dict = {"tokens": sentences, "labels": labels}
dataset = Dataset.from_dict(dataset_dict)

# Step 1: Convert string labels to integer IDs
dataset = dataset.map(convert_labels_to_ids, batched=True)

# Step 2: Tokenize and align the labels
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)


Map:   0%|          | 0/47878 [00:00<?, ? examples/s]

Map:   0%|          | 0/47878 [00:00<?, ? examples/s]

## **Split Dataset**

In [None]:
split_dataset = tokenized_datasets.train_test_split(test_size=0.2)
train_dataset = split_dataset['train']
validation_dataset = split_dataset['test']


NameError: name 'tokenized_datasets' is not defined

## **Define Training Arguments**
We define the arguments for training the NER model, such as batch size, learning rate, and evaluation strategy.

In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",  # Evaluate at the end of every epoch
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',  # Log directory
    logging_steps=10,
    save_strategy="epoch",
)




## **Define Compute Metrics**

In [None]:
!pip install evaluate
!pip install seqeval

import evaluate
import numpy as np

# Load the seqeval metric for token classification tasks
metric = evaluate.load("seqeval")
label_list = ["o", "b-loc", "i-loc", "b-product", "i-product", "b-price", "i-price"]

# Define a compute_metrics function
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_labels = [
        [label_list[l] for l in label if l != -100]
        for label in labels
    ]
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    return metric.compute(predictions=true_predictions, references=true_labels)


Collecting seqeval
  Using cached seqeval-1.2.2.tar.gz (43 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=5bbcde8c2026d003da8a05cfb31208f1bb6e167e4aba4fee5fb2552eefc0d57b
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


## **Train the model**

In [None]:
# Initialize model
model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-base", num_labels=len(label_list))

# Data collator
data_collator = DataCollatorForTokenClassification(tokenizer)

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

# Train the model
trainer.train()


model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Unnamed: 3,Loc,Price,Product,Overall Precision,Overall Recall,Overall F1,Overall Accuracy
1,0.0104,0.019542,"{'precision': 0.8750815394651011, 'recall': 0.8843111404087014, 'f1': 0.879672131147541, 'number': 6068}","{'precision': 0.9749928140270192, 'recall': 0.9803468208092485, 'f1': 0.9776624873901137, 'number': 3460}","{'precision': 0.9565217391304348, 'recall': 0.9513513513513514, 'f1': 0.953929539295393, 'number': 3145}","{'precision': 0.8547687861271677, 'recall': 0.8666666666666667, 'f1': 0.8606766096762459, 'number': 2730}",0.910299,0.916445,0.913361,0.994969
2,0.0337,0.009282,"{'precision': 0.9443349753694581, 'recall': 0.9477587343441002, 'f1': 0.9460437571969075, 'number': 6068}","{'precision': 0.9909988385598142, 'recall': 0.9864161849710983, 'f1': 0.988702201622248, 'number': 3460}","{'precision': 0.9781299524564184, 'recall': 0.9812400635930048, 'f1': 0.9796825396825397, 'number': 3145}","{'precision': 0.9322897706589006, 'recall': 0.9380952380952381, 'f1': 0.9351834946138398, 'number': 2730}",0.95951,0.961566,0.960537,0.997475
3,0.0007,0.007956,"{'precision': 0.9622672598451145, 'recall': 0.962425840474621, 'f1': 0.962346543626926, 'number': 6068}","{'precision': 0.9947643979057592, 'recall': 0.9884393063583815, 'f1': 0.9915917657291968, 'number': 3460}","{'precision': 0.979746835443038, 'recall': 0.9844197138314785, 'f1': 0.9820777160983346, 'number': 3145}","{'precision': 0.9636163175303197, 'recall': 0.9604395604395605, 'f1': 0.9620253164556962, 'number': 2730}",0.973356,0.972408,0.972882,0.998184


Trainer is attempting to log a value of "{'precision': 0.8750815394651011, 'recall': 0.8843111404087014, 'f1': 0.879672131147541, 'number': 6068}" of type <class 'dict'> for key "eval/_" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.9749928140270192, 'recall': 0.9803468208092485, 'f1': 0.9776624873901137, 'number': 3460}" of type <class 'dict'> for key "eval/loc" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.9565217391304348, 'recall': 0.9513513513513514, 'f1': 0.953929539295393, 'number': 3145}" of type <class 'dict'> for key "eval/price" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.8547687861271677, 'recall': 0.8666666666666667, 

TrainOutput(global_step=14364, training_loss=0.022145927235595577, metrics={'train_runtime': 5652.1102, 'train_samples_per_second': 20.33, 'train_steps_per_second': 2.541, 'total_flos': 1.0826212019003304e+16, 'train_loss': 0.022145927235595577, 'epoch': 3.0})

## **Model Evaluation**
The model is evaluated after training. Metrics such as F1-score, precision, and recall will be computed.

In [None]:
# Evaluate the model
eval_results = trainer.evaluate()
print(eval_results)


Trainer is attempting to log a value of "{'precision': 0.9622672598451145, 'recall': 0.962425840474621, 'f1': 0.962346543626926, 'number': 6068}" of type <class 'dict'> for key "eval/_" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.9947643979057592, 'recall': 0.9884393063583815, 'f1': 0.9915917657291968, 'number': 3460}" of type <class 'dict'> for key "eval/loc" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.979746835443038, 'recall': 0.9844197138314785, 'f1': 0.9820777160983346, 'number': 3145}" of type <class 'dict'> for key "eval/price" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.9636163175303197, 'recall': 0.9604395604395605, '

{'eval_loss': 0.007955918088555336, 'eval__': {'precision': 0.9622672598451145, 'recall': 0.962425840474621, 'f1': 0.962346543626926, 'number': 6068}, 'eval_loc': {'precision': 0.9947643979057592, 'recall': 0.9884393063583815, 'f1': 0.9915917657291968, 'number': 3460}, 'eval_price': {'precision': 0.979746835443038, 'recall': 0.9844197138314785, 'f1': 0.9820777160983346, 'number': 3145}, 'eval_product': {'precision': 0.9636163175303197, 'recall': 0.9604395604395605, 'f1': 0.9620253164556962, 'number': 2730}, 'eval_overall_precision': 0.9733558617104237, 'eval_overall_recall': 0.9724079724728949, 'eval_overall_f1': 0.9728816862070085, 'eval_overall_accuracy': 0.9981839807021333, 'eval_runtime': 120.6055, 'eval_samples_per_second': 79.399, 'eval_steps_per_second': 9.925, 'epoch': 3.0}


## **Save the Fine-tuned Model**
Finally, we save the fine-tuned model and tokenizer for future use.

In [None]:
# Save the model
trainer.save_model("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")


('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/sentencepiece.bpe.model',
 './fine_tuned_model/added_tokens.json',
 './fine_tuned_model/tokenizer.json')

## **Model Comparison & Selection**
XLM-Roberta: This is a strong multilingual model, often good for NER tasks across languages.  
DistilBERT: This is a distilled version of BERT, which is smaller and faster.  
mBERT (Multilingual BERT): This is a multilingual version of BERT, which is effective for low-resource languages like Amharic.

In [6]:
# Tokenizer setup for distilbert-base-multilingual-cased
from transformers import AutoTokenizer
from datasets import Dataset

# Use the tokenizer for distilbert
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased")

# Label mapping (you already have this)
label_to_id = {
    "o": 0,
    "b-loc": 1,
    "i-loc": 2,
    "b-product": 3,
    "b-prod": 3,  # Alias for b-product
    "i-product": 4,
    "b-price": 5,
    "i-price": 6,
}

# Tokenize and align the dataset with the correct labels
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples['tokens'],
        truncation=True,
        padding=True,
        is_split_into_words=True
    )

    labels = []
    for i, label in enumerate(examples['labels']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)  # Ignore special tokens
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])  # Label for the first wordpiece token
            else:
                label_ids.append(-100)  # Ignore subword tokens
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

# Create the dataset
dataset_dict = {"tokens": sentences, "labels": labels}
dataset = Dataset.from_dict(dataset_dict)

# Convert string labels to their corresponding integer IDs
def convert_labels_to_ids(examples):
    return {
        "labels": [
            [label_to_id[label] for label in label_sequence]
            for label_sequence in examples["labels"]
        ]
    }

# Map the conversion and tokenization functions to the dataset
dataset = dataset.map(convert_labels_to_ids, batched=True)
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)

# Inspect tokenized input to check for issues
print(tokenized_datasets[0])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]



Map:   0%|          | 0/47878 [00:00<?, ? examples/s]

Map:   0%|          | 0/47878 [00:00<?, ? examples/s]

{'tokens': ['ይሄንን', 'ተጭነው', 'ያድርጉ፣', 'ቤተሰብ', 'ይሁኑ'], 'labels': [-100, 0, 0, 0, -100, 0, 0, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -10

In [7]:
split_dataset = tokenized_datasets.train_test_split(test_size=0.2)
train_dataset = split_dataset['train']
validation_dataset = split_dataset['test']


In [8]:
!pip install evaluate
!pip install seqeval

import evaluate
import numpy as np

# Load the seqeval metric for token classification tasks
metric = evaluate.load("seqeval")
label_list = ["o", "b-loc", "i-loc", "b-product", "i-product", "b-price", "i-price"]

# Define a compute_metrics function
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_labels = [
        [label_list[l] for l in label if l != -100]
        for label in labels
    ]
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    return metric.compute(predictions=true_predictions, references=true_labels)


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3
Collecting seqeval
  Using cached seqeval-1.2.2.tar.gz (43 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=e137a76379b3f1e6a6932885c20860a310260c84f44a8999db8c3b941d69b63e
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

## **DistilBERT**

In [None]:
import os
from transformers import AutoModelForTokenClassification, Trainer, TrainingArguments, DataCollatorForTokenClassification

# Enable debugging for CUDA (if applicable)
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

# Fine-tune distilbert-base-multilingual-cased
model_name = "distilbert-base-multilingual-cased"
print(f"Fine-tuning {model_name}")

# Load pre-trained model for token classification
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(label_to_id))

# Set up training arguments
training_args = TrainingArguments(
    output_dir=f"./results/{model_name}",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,  # Adjust batch size according to available memory
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir=f'./logs/{model_name}',
    logging_steps=10,
    save_total_limit=2,
    save_strategy="epoch",
    load_best_model_at_end=True
)

# Set up the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,  # Training data
    eval_dataset=tokenized_datasets,   # Use the same data for evaluation in this example
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,  # Ensure this function is defined to compute F1, accuracy, etc.
    data_collator=DataCollatorForTokenClassification(tokenizer=tokenizer)
)

# Fine-tune the model
trainer.train()

# Evaluate the model
eval_results = trainer.evaluate()

# Print the evaluation results
print(f"Results for {model_name}: {eval_results}")


Fine-tuning distilbert-base-multilingual-cased


model.safetensors:   0%|          | 0.00/542M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Unnamed: 3,Loc,Price,Product,Overall Precision,Overall Recall,Overall F1,Overall Accuracy
1,0.095,0.053059,"{'precision': 0.5734590976267061, 'recall': 0.4203064767912326, 'f1': 0.4850813493887306, 'number': 31389}","{'precision': 0.7399820056751333, 'recall': 0.6012483832874094, 'f1': 0.6634400595681311, 'number': 17783}","{'precision': 0.8723829156735167, 'recall': 0.8418500559492726, 'f1': 0.856844569584612, 'number': 16086}","{'precision': 0.7940365823101979, 'recall': 0.4563980701375387, 'f1': 0.5796332708400018, 'number': 13887}",0.71793,0.552972,0.624746,0.981214


Trainer is attempting to log a value of "{'precision': 0.5734590976267061, 'recall': 0.4203064767912326, 'f1': 0.4850813493887306, 'number': 31389}" of type <class 'dict'> for key "eval/_" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.7399820056751333, 'recall': 0.6012483832874094, 'f1': 0.6634400595681311, 'number': 17783}" of type <class 'dict'> for key "eval/loc" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.8723829156735167, 'recall': 0.8418500559492726, 'f1': 0.856844569584612, 'number': 16086}" of type <class 'dict'> for key "eval/price" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.7940365823101979, 'recall': 0.45639807013753

In [None]:
# Save the fine-tuned model and tokenizer
output_dir = "./fine_tuned_model_2"  # Set the directory where you'd like to save the model
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"Model and tokenizer saved to {output_dir}")


In [None]:
import shutil

# Create a backup zip file
shutil.make_archive("fine_tuned_model_2_backup", 'zip', output_dir)


## **mBERT (Multilingual BERT)**

In [None]:
from transformers import AutoModelForTokenClassification, Trainer, TrainingArguments, DataCollatorForTokenClassification, AutoTokenizer

# Load the third pre-trained model and tokenizer (e.g., BERT or another one you're working with)
model_name = "bert-base-multilingual-cased"  # Or whichever model you're using
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(label_to_id))
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results_bert_multilingual",  # Folder to save results
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs_bert_multilingual",
    logging_steps=10,
    save_total_limit=2,
    load_best_model_at_end=True,
    save_strategy="epoch" # Make sure save_strategy matches evaluation_strategy
)

# Split the dataset into train and validation sets
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True) #Ensure the dataset is tokenized
split_dataset = tokenized_datasets.train_test_split(test_size=0.2) # Split dataset

# Create the Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split_dataset['train'], # Access the train split
    eval_dataset=split_dataset['test'], # Access the validation split
    tokenizer=tokenizer,
    data_collator=DataCollatorForTokenClassification(tokenizer=tokenizer)
)

# Fine-tune the third model
trainer.train()

# Save the third fine-tuned model and tokenizer
output_dir_third = "./fine_tuned_model_3"  # Set the directory where you'd like to save the model
model.save_pretrained(output_dir_third)
tokenizer.save_pretrained(output_dir_third)

print(f"Third model and tokenizer saved to {output_dir_third}")

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]



Map:   0%|          | 0/47878 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,0.1194,0.069423
2,0.0748,0.051137


In [None]:
import shutil

# Create a backup zip file
shutil.make_archive("fine_tuned_model_3_backup", 'zip', output_dir)
