# Task 3: NER Model Training

This notebook handles:
- Loading CoNLL data
- Training multiple NER models
- Model fine-tuning

In [1]:
import sys
sys.path.append('..')

from pathlib import Path
from src.training.ner_trainer import NERTrainer
from config.config import model_config

In [2]:
# Load CoNLL data
with open("../data/labeled/train_data.conll", 'r', encoding='utf-8') as f:
    conll_data = f.read()

print(f"Loaded CoNLL data: {len(conll_data)} characters")
print(f"Models to train: {model_config.model_names}")

Loaded CoNLL data: 17223 characters
Models to train: ['xlm-roberta-base', 'distilbert-base-multilingual-cased', 'bert-base-multilingual-cased']


## Train XLM-RoBERTa

In [3]:
# Train XLM-RoBERTa
model_name = 'xlm-roberta-base'
trainer = NERTrainer(model_name)
output_dir = f"../models/checkpoints/{model_name.replace('/', '_')}"

Path(output_dir).mkdir(parents=True, exist_ok=True)
trainer.train(conll_data, output_dir)

print(f"Training completed for {model_name}")

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/40 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,1.58069,0.0,0.0,0.0
2,No log,1.577596,0.0,0.0,0.0
3,No log,1.57225,0.0,0.0,0.0




Training completed for xlm-roberta-base


## Train DistilBERT

In [4]:
# Train DistilBERT
model_name = 'distilbert-base-multilingual-cased'
trainer = NERTrainer(model_name)
output_dir = f"../models/checkpoints/{model_name.replace('/', '_')}"

Path(output_dir).mkdir(parents=True, exist_ok=True)
trainer.train(conll_data, output_dir)

print(f"Training completed for {model_name}")

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/40 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,2.004856,0.005618,0.181818,0.010899
2,No log,1.996393,0.005618,0.181818,0.010899
3,No log,1.981733,0.005714,0.181818,0.01108




Training completed for distilbert-base-multilingual-cased


## Train mBERT

In [5]:
# Train mBERT
model_name = 'bert-base-multilingual-cased'
trainer = NERTrainer(model_name)
output_dir = f"../models/checkpoints/{model_name.replace('/', '_')}"

Path(output_dir).mkdir(parents=True, exist_ok=True)
trainer.train(conll_data, output_dir)

print(f"Training completed for {model_name}")

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/40 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,2.18401,0.012085,0.363636,0.023392
2,No log,2.154725,0.012158,0.363636,0.023529
3,No log,2.10432,0.012158,0.363636,0.023529




Training completed for bert-base-multilingual-cased


## Training Summary

In [6]:
# Check trained models
models_dir = Path("../models/checkpoints")
trained_models = [d.name for d in models_dir.iterdir() if d.is_dir()]

print("Trained Models:")
for model in trained_models:
    model_path = models_dir / model
    files = list(model_path.glob('*'))
    print(f"  {model}: {len(files)} files")

print("\nAll model training completed!")

Trained Models:
  bert-base-multilingual-cased: 11 files
  distilbert-base-multilingual-cased: 11 files
  xlm-roberta-base: 10 files

All model training completed!
