### Домашнее задание 4 - 10 баллов

В этом задании вам предстоит дообучить трансформерную модель для NER-задачи в различных форматах:

1. Обучите NER-модель

- Загрузите набор данных [Collection5](https://github.com/natasha/corus?tab=readme-ov-file#load_ne5) - **1 балл**
- Разбейте набор данных на train/test части
- Дообучите модель [rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) на train-части корпуса для решения NER-задачи, сделайте замеры качества NER-метрик до и после дообучения - **2 балла**

2. Попробуйте улучшить качество модели следующими способами:
- Предварительно дообучите на train-части в MLM режиме, а потом дообучите на NER-задачу - **2 балла**
- Сгенерируйте синтетическую разметку* подходящего**, на ваш взгляд, новостного корпуса большой и умной моделью для русскоязычного NER***, а затем использовав ее для дообучения rubert-tiny2 вместе с основным набором данных - **2 балла**

3. Финально сравните результаты различных подходов - **1 балл**

*прогоните датасет через NER-модель, получите ее предсказания и используйте их в качестве резметки

**Можно использовать уже знакомый вам датасет lenta-ru, объем данных лучше взять от 10_000 текстов

***Например, можно взять модель модель DeepPavlov ner_collection3_bert. Инструкция по запуску есть в [документации](https://docs.deeppavlov.ai/en/master/features/models/NER.html)

**Общее**

- Принимаемые решения обоснованы (почему выбрана определенная архитектура/гиперпараметр/оптимизатор/преобразование и т.п.) - **1 балл**
- Обеспечена воспроизводимость решения: зафиксированы random_state, ноутбук воспроизводится от начала до конца без ошибок - **1 балл**

**Формат сдачи ДЗ**

- Каждая домашняя работа – PR в отдельную ветку **hw_n**, где **n** - номер домашней работы
- Добавить ментора и pacifikus в reviewers
- Дождаться ревью, если все ок – мержим в main
- Если не ок – вносим исправления и снова отправляем на ревью

In [1]:
import os
import re
import numpy as np
import pandas as pd
import torch

from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight

from datasets import Dataset, concatenate_datasets
import evaluate

from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    AutoModelForMaskedLM,
    DataCollatorForTokenClassification,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer,
    pipeline
)

from seqeval.metrics import classification_report, accuracy_score, f1_score

# Загрузим данные и подготовим разметку

In [2]:
# with open('Collection5/156.txt', 'r', encoding='utf-8') as file:
#     content = file.read().replace('\n','\n ')
# len(content.split())

device = "cuda:0"
SEED = 22

In [3]:
def create_bio_markup(text_file, ann_file):
    with open(text_file, 'r', encoding='utf-8') as f:
        text = f.read()
    
    annotations = []
    with open(ann_file, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            parts = line.split('\t')
            if len(parts) < 3:
                continue
            tag_parts = parts[1].split()
            if len(tag_parts) < 3:
                continue
            tag = tag_parts[0]
            try:
                start = int(tag_parts[1])
                end = int(tag_parts[2])
            except ValueError:
                print(f"Ошибка при обработке индексов в строке: {line}")
                continue
            annotations.append({'tag': tag, 'start': start, 'end': end})
    
    tokens = []
    for match in re.finditer(r'\S+', text.replace('\n','\n ')):
        token = match.group()
        start_offset = match.start()
        end_offset = match.end()
        tokens.append((token, start_offset, end_offset))
    
    bio_labels = ["O"] * len(tokens)    
    # for ann in tqdm(annotations, desc="Разметка BIO"):
    for ann in annotations:
        matched_indices = []
        for i, (token, token_start, token_end) in enumerate(tokens):
            token_mid = (token_start + token_end) // 2
            if ann['start'] <= token_mid < ann['end']:
                matched_indices.append(i)
        if matched_indices:
            bio_labels[matched_indices[0]] = f"B-{ann['tag']}"
            for idx in matched_indices[1:]:
                bio_labels[idx] = f"I-{ann['tag']}"
    
    return tokens, bio_labels

def process_all_documents(directory: str):
    data = []

    files = [
        f for f in os.listdir(directory)
        if f.endswith(".txt") and os.path.splitext(f)[0].isdigit()
    ]
    files = sorted(files, key=lambda x: int(os.path.splitext(x)[0]))

    for file in tqdm(files, desc="Обработка документов"):
        text_file = os.path.join(directory, file)
        ann_file = text_file.replace(".txt", ".ann")
        
        if os.path.exists(ann_file):
            tokens, bio_labels = create_bio_markup(text_file, ann_file)
            tokens = [token[0] for token in tokens]
            data.append((tokens, bio_labels))
        else:
            print(f"Предупреждение: не найден файл разметки для {file}")
    
    return data

data = process_all_documents('Collection5')

Обработка документов: 100%|██████████████████| 816/816 [00:01<00:00, 513.61it/s]


In [4]:
def prepare_dataset(data):
    all_words = []
    all_labels = []

    for tokens, tags in data:
        words = [str(t) for t in tokens]
        labels = [str(l) for l in tags]
        
        all_words.append(words)
        all_labels.append(labels)

    return {"tokens": all_words, "ner_tags": all_labels}


dataset = prepare_dataset(data)
dataset.keys()

dict_keys(['tokens', 'ner_tags'])

In [5]:
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
train_dataset = Dataset.from_dict(prepare_dataset(train_data))
test_dataset = Dataset.from_dict(prepare_dataset(test_data))

# Дообучение модели rubert-tiny2

In [6]:
model_id = "cointegrated/rubert-tiny2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
label_list = sorted(list(set(label for ex in dataset['ner_tags'] for label in ex)))

label2id = {
    'O': 0,
    'B-GEOPOLIT': 1,
    'I-GEOPOLIT': 2,
    'B-MEDIA': 3,
    'I-MEDIA': 4,
    'B-LOC': 5,
    'I-LOC': 6,
    'B-ORG': 7,
    'I-ORG': 8,
    'B-PER': 9,
    'I-PER': 10,
}

id2label = {i: label for label, i in label2id.items()}
num_labels = len(label_list)



In [7]:
len(id2label.keys()), len(label2id.keys())

(11, 11)

In [8]:
# создадим label для каждого токена

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
        add_special_tokens=True
    )
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label2id[label[word_idx]])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs



train_dataset_tok = train_dataset.map(tokenize_and_align_labels, batched=True)
test_dataset_tok = test_dataset.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/652 [00:00<?, ? examples/s]

Map:   0%|          | 0/164 [00:00<?, ? examples/s]

In [11]:
def convert_tags_to_ids(dataset, label2id):    
    for example in dataset:
        return [[label2id[label] for label in example["ner_tags"]] for example in dataset]

train_num_ner_tags = convert_tags_to_ids(train_dataset, label2id)
test_num_ner_tags = convert_tags_to_ids(test_dataset, label2id)


all_labels = []
for example in train_num_ner_tags:
    all_labels.extend([label for label in example if label != -100])

In [12]:
classes = np.array(list(set(label2id.values())))

In [13]:
# Создадим кастомную метрику для Trainer



seqeval = evaluate.load("seqeval")
loss_weights = compute_class_weight(class_weight="balanced", classes=classes, y=all_labels)


class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        
        labels = inputs.pop("labels").to(device)  
        
        outputs = model(**inputs)                
        logits = outputs.get("logits").to(device)  

        
        if labels is not None:
            weights = loss_weights
            loss_fct = torch.nn.CrossEntropyLoss(weight=torch.Tensor(weights)).to(device)
            loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
            return (loss, outputs) if return_outputs else loss

def compute_metrics(p):        
        predictions, labels = p.predictions, p.label_ids
        preds = np.argmax(predictions, axis=2)
        _TAGS = list(id2label.values())
        
        true_labels = [
            [_TAGS[l] for (p, l) in zip(pred, label) if l != -100] 
            for pred, label in zip(preds, labels)
        ]
        true_preds = [
            [_TAGS[p] for (p, l) in zip(pred, label) if l != -100] 
            for pred, label in zip(preds, labels)
        ]

        results = seqeval.compute(predictions=true_preds, references=true_labels)
        
        return {
            "precision": results["overall_precision"],
            "recall": results["overall_recall"],
            "f1": results["overall_f1"],
            "accuracy": results["overall_accuracy"],
        }


In [14]:
model = AutoModelForTokenClassification.from_pretrained(
    model_id, 
    num_labels=num_labels,
    id2label=id2label, 
    label2id=label2id
).to(device)

training_args = TrainingArguments(
    "exp1-ner-rubert-tiny2",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=10,
    per_device_eval_batch_size=10,
    num_train_epochs=12,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch",
    load_best_model_at_end=True,
    seed=SEED
)

data_collator = DataCollatorForTokenClassification(tokenizer)

trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset_tok,
    eval_dataset=test_dataset_tok,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

Some weights of the model checkpoint at cointegrated/rubert-tiny2 were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at cointegrated/rubert-tiny2 and are new

In [15]:
ts_ds=test_dataset_tok.remove_columns(["tokens", "ner_tags"])

outputs = trainer.predict(ts_ds)
print("== Before fine-tuning ==")
print(compute_metrics(outputs))

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


== Before fine-tuning ==
{'precision': 0.017217514601127578, 'recall': 0.11556764106050306, 'f1': 0.02997002997002997, 'accuracy': 0.052194382812922775}


In [16]:
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33milike528149[0m ([33mr1char9[0m). Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,2.3177,2.106775,0.055223,0.340811,0.095046,0.225932
2,2.0384,1.812135,0.107792,0.479946,0.176045,0.470669
3,1.805,1.550176,0.192437,0.585769,0.289701,0.660236
4,1.5904,1.327061,0.265702,0.661455,0.379116,0.743168
5,1.3983,1.144892,0.320808,0.71584,0.443058,0.789464
6,1.2237,1.003715,0.351346,0.74847,0.478211,0.808079
7,1.1126,0.896132,0.379584,0.781101,0.510894,0.82196
8,0.9934,0.8187,0.393566,0.795604,0.526624,0.830077
9,0.9174,0.763236,0.407091,0.8092,0.541676,0.835705
10,0.8201,0.727122,0.413321,0.814185,0.548298,0.839142


TrainOutput(global_step=132, training_loss=1.2730676459543633, metrics={'train_runtime': 63.9156, 'train_samples_per_second': 122.411, 'train_steps_per_second': 2.065, 'total_flos': 132073898319072.0, 'train_loss': 1.2730676459543633, 'epoch': 12.0})

In [17]:
ts_ds=test_dataset_tok.remove_columns(["tokens", "ner_tags"])

outputs = trainer.predict(ts_ds)
print("== After fine-tuning ==")
print(compute_metrics(outputs))

== After fine-tuning ==
{'precision': 0.4178376495180583, 'recall': 0.8153183775209608, 'f1': 0.5525184275184275, 'accuracy': 0.841658098381947}


# Попробуйте улучшить качество модели следующими способами:
- Предварительно дообучите на train-части в MLM режиме, а потом дообучите на NER-задачу - 2 балла
- Сгенерируйте синтетическую разметку* подходящего**, на ваш взгляд, новостного корпуса большой и умной моделью для русскоязычного NER***, а затем использовав ее для дообучения rubert-tiny2 вместе с основным набором данных - 2 балла

In [18]:
train_full_text = [" ".join(data['tokens']) for data in train_dataset]
test_full_text = [" ".join(data['tokens']) for data in test_dataset]

train_mlm_dataset = Dataset.from_dict({"text": train_full_text})
test_mlm_dataset = Dataset.from_dict({"text": test_full_text})

In [19]:
def tokenize_mlm(example):
    return tokenizer(example['text'], return_special_tokens_mask=True)

train_tokenized_mlm_dataset = train_mlm_dataset.map(tokenize_mlm, batched=True, remove_columns=["text"])
test_tokenized_mlm_dataset = test_mlm_dataset.map(tokenize_mlm, batched=True, remove_columns=["text"])

Map:   0%|          | 0/652 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2308 > 2048). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/164 [00:00<?, ? examples/s]

In [20]:
block_size = 256

def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size

    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

train_mlm_dataset = train_tokenized_mlm_dataset.map(group_texts, batched=True)
test_mlm_dataset = test_tokenized_mlm_dataset.map(group_texts, batched=True)

Map:   0%|          | 0/652 [00:00<?, ? examples/s]

Map:   0%|          | 0/164 [00:00<?, ? examples/s]

In [21]:
from transformers import DataCollatorForLanguageModeling, AutoModelForMaskedLM, Trainer, TrainingArguments

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
model_mlm = AutoModelForMaskedLM.from_pretrained(model_id)

training_args_mlm = TrainingArguments(
    output_dir="./mlm-rubert-tiny2",
    evaluation_strategy="epoch", 
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    seed=SEED
)

trainer_mlm = Trainer(
    model=model_mlm,
    args=training_args_mlm,
    train_dataset=train_mlm_dataset,
    eval_dataset=test_mlm_dataset,
    data_collator=data_collator,
)

trainer_mlm.train()

Some weights of the model checkpoint at cointegrated/rubert-tiny2 were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch,Training Loss,Validation Loss
1,No log,3.093219
2,No log,3.11269
3,No log,3.034633
4,No log,3.034324
5,No log,2.983469
6,No log,2.952465
7,No log,2.969285
8,No log,2.995662
9,No log,3.046781
10,No log,2.978892




TrainOutput(global_step=90, training_loss=3.2770406087239583, metrics={'train_runtime': 79.7551, 'train_samples_per_second': 97.298, 'train_steps_per_second': 1.128, 'total_flos': 29611075338240.0, 'train_loss': 3.2770406087239583, 'epoch': 10.0})

In [22]:
eval_result = trainer_mlm.evaluate()
print(f"Perplexity: {np.exp(eval_result['eval_loss']):.2f}")



Perplexity: 21.20


### После MLM обучим модель на задачу NER

In [25]:
model = AutoModelForTokenClassification.from_pretrained(
    "mlm-rubert-tiny2/checkpoint-40", 
    num_labels=num_labels,
    id2label=id2label, 
    label2id=label2id
).to(device)

training_args = TrainingArguments(
    "exp2-ner-mlm-rubert-tiny2",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=10,
    per_device_eval_batch_size=10,
    num_train_epochs=12,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch",
    load_best_model_at_end=True,
    seed=SEED
)

data_collator = DataCollatorForTokenClassification(tokenizer)

trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset_tok,
    eval_dataset=test_dataset_tok,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

Some weights of the model checkpoint at mlm-rubert-tiny2/checkpoint-40 were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at mlm-rubert-tiny2/checkpoint-40 and are newly initialized: ['classifier.weight', 'classifier.

In [26]:
trainer.train()



Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,2.2698,2.037245,0.060122,0.384999,0.104003,0.210212
2,1.9927,1.746221,0.124706,0.540449,0.202651,0.498864
3,1.7591,1.496771,0.204677,0.626785,0.308585,0.675713
4,1.549,1.28053,0.28004,0.693406,0.398957,0.756399
5,1.3542,1.105619,0.332063,0.730569,0.456593,0.795227
6,1.2109,0.968489,0.356582,0.76116,0.48565,0.811732
7,1.0667,0.86336,0.383687,0.785633,0.515577,0.826019
8,0.9706,0.785594,0.411417,0.805121,0.544563,0.839737
9,0.8854,0.732605,0.417996,0.813732,0.552292,0.842632
10,0.7931,0.697218,0.428436,0.818717,0.56251,0.848287


TrainOutput(global_step=132, training_loss=1.2363191400513505, metrics={'train_runtime': 59.5053, 'train_samples_per_second': 131.484, 'train_steps_per_second': 2.218, 'total_flos': 130875234668400.0, 'train_loss': 1.2363191400513505, 'epoch': 12.0})

#### Попрбуем синтезировать данные на оснве lenta-ru и снова дообучить модель через ранее пройденный pipeline ( MLM + NER )

In [27]:
from deeppavlov import build_model

ner_model = build_model('ner_collection3_bert', download=True, install=True)

Ignoring transformers: markers 'python_version < "3.8"' don't match your environment


2025-04-14 09:21:04.92 INFO in 'deeppavlov.download'['download'] at line 138: Skipped http://files.deeppavlov.ai/v1/ner/ner_rus_bert_coll3_torch.tar.gz download because of matching hashes
Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you e

In [28]:
preds = ner_model(["Президент России Владимир Путин прибыл в Ташкент."])
print(preds)

[[['Президент', 'России', 'Владимир', 'Путин', 'прибыл', 'в', 'Ташкент', '.']], [['O', 'S-LOC', 'B-PER', 'E-PER', 'O', 'O', 'S-LOC', 'O']]]


In [29]:
df = pd.read_csv("cleaned-lenta-ru-news-100k.csv")
full_text = df['full_text'][:10000].tolist()

def prep_text(text:str):
    return text.replace('\xa0', ' ')

full_text = list(map(prep_text, full_text))


def split_text(text, max_tokens=512):
    if isinstance(text, list):
        text = " ".join(text)
    tokens = tokenizer.tokenize(text)
    chunks = []
    current_chunk = []
    for token in tqdm(tokens):
        current_chunk.append(token)
        if len(current_chunk) >= max_tokens:
            chunks.append(tokenizer.convert_tokens_to_string(current_chunk))
            current_chunk = []
    if current_chunk:
        chunks.append(tokenizer.convert_tokens_to_string(current_chunk))
    return chunks

In [30]:
chunks = split_text(full_text)

all_preds = []
for chunk in tqdm(chunks):
    try:
        preds = ner_model([chunk])
    except:
        continue
    all_preds.append(preds) 

100%|█████████████████████████████| 2689016/2689016 [00:02<00:00, 969133.23it/s]
100%|███████████████████████████████████████| 5252/5252 [03:03<00:00, 28.70it/s]


In [31]:
def replace_tag(tags: list):
    new_tag = None
    new_tag = list(map(lambda x: x.replace('E','I').replace('S','B').replace('PIR','PER'), tags))
    return new_tag

add_dataset = {
    'tokens': [],
    'ner_tags': []
}

for i in all_preds:
    ner_tag = replace_tag(i[1][0])
    add_dataset['tokens'].append(i[0][0])
    add_dataset['ner_tags'].append(ner_tag)

### Перейдем к обучению на уровне MLM

In [32]:
from sklearn.model_selection import train_test_split

train_data_tokens, test_data_tokens, train_data_tags, test_data_tags = train_test_split(
    add_dataset["tokens"], add_dataset["ner_tags"], test_size=0.2, random_state=42
)

train_add_dataset = {"tokens": train_data_tokens, "ner_tags": train_data_tags}
test_add_data_dataset = {"tokens": test_data_tokens, "ner_tags": test_data_tags}

In [33]:
from datasets import Dataset

train_add_dataset = Dataset.from_dict(train_add_dataset)
test_add_dataset = Dataset.from_dict(test_add_data_dataset)

In [34]:
train_add_dataset_tok = train_add_dataset.map(tokenize_and_align_labels, batched=True)
test_add_dataset_tok = test_add_dataset.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/3804 [00:00<?, ? examples/s]

Map:   0%|          | 0/951 [00:00<?, ? examples/s]

In [35]:
set(train_add_dataset_tok['labels'][0])

{-100, 0, 5, 6, 7, 8, 9, 10}

In [36]:
train_dataset_tok

Dataset({
    features: ['tokens', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 652
})

In [37]:
train_merged_dataset = concatenate_datasets([train_add_dataset_tok, train_dataset_tok])
test_merged_dataset = concatenate_datasets([test_add_dataset_tok, test_dataset_tok])

## Обучаем на уровне MLM

In [38]:
train_full_text = [" ".join(data['tokens']) for data in train_merged_dataset]
test_full_text = [" ".join(data['tokens']) for data in test_merged_dataset]

train_mlm_dataset = Dataset.from_dict({"text": train_full_text})
test_mlm_dataset = Dataset.from_dict({"text": test_full_text})

In [39]:
def tokenize_mlm(example):
    return tokenizer(example['text'], return_special_tokens_mask=True)

train_tokenized_mlm_dataset = train_mlm_dataset.map(tokenize_mlm, batched=True, remove_columns=["text"])
test_tokenized_mlm_dataset = test_mlm_dataset.map(tokenize_mlm, batched=True, remove_columns=["text"])

Map:   0%|          | 0/4456 [00:00<?, ? examples/s]

Map:   0%|          | 0/1115 [00:00<?, ? examples/s]

In [40]:
block_size = 256

def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size

    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

train_mlm_dataset = train_tokenized_mlm_dataset.map(group_texts, batched=True)
test_mlm_dataset = test_tokenized_mlm_dataset.map(group_texts, batched=True)

Map:   0%|          | 0/4456 [00:00<?, ? examples/s]

Map:   0%|          | 0/1115 [00:00<?, ? examples/s]

In [42]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
model_mlm = AutoModelForMaskedLM.from_pretrained(model_id)

training_args_mlm = TrainingArguments(
    output_dir="./exp3-mlm-rubert-tiny2",
    evaluation_strategy="epoch", 
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    seed=SEED
)

trainer_mlm = Trainer(
    model=model_mlm,
    args=training_args_mlm,
    train_dataset=train_mlm_dataset,
    eval_dataset=test_mlm_dataset,
    data_collator=data_collator,
)

trainer_mlm.train()

In [None]:
eval_result = trainer_mlm.evaluate()
print(f"Perplexity: {np.exp(eval_result['eval_loss']):.2f}")

In [None]:
ls exp3-mlm-rubert-tiny2/

## Перейдем к обучению в рамках NER задачи

In [None]:
model = AutoModelForTokenClassification.from_pretrained(
    "exp3-mlm-rubert-tiny2/checkpoint-3540/", 
    num_labels=num_labels,
    id2label=id2label, 
    label2id=label2id
).to(device)

training_args = TrainingArguments(
    "exp3-ner-rubert-tiny2",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=12,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch",
    load_best_model_at_end=True,
    seed=SEED,
    # gradient_accumulation_steps=4,
    # fp16=True
)

data_collator = DataCollatorForTokenClassification(tokenizer)

trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_merged_dataset,
    eval_dataset=test_merged_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

# Посмотрим что получилось

In [None]:
model_path = "exp3-ner-rubert-tiny2/checkpoint-2232"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForTokenClassification.from_pretrained(
    model_path, 
    num_labels=num_labels,
    id2label=id2label, 
    label2id=label2id
)

In [None]:
ner_pipeline = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

text = "Генеральный директор Сбербанка Герман Греф на конференции в Москве заявил, что сотрудничество с Яндексом в области искусственного интеллекта выходит на новый уровень. Он также отметил, что правительство Российской Федерации поддерживает развитие цифровой экономики, особенно в рамках Евразийского экономического союза."
results = ner_pipeline(text)

for entity in results:
    print(entity)

# Сравниваем
1) Дообучив модель на малом корпусе ( NER ):
- precision: 0.417838
- recall: 0.815318
- f1: 0.552518
- accuracy: 0.841658
2) Обучив модель на малом корпусе ( MLM и NER ):
- precision: 0.435142
- recall: 0.823249
- f1: 0.569346
- accuracy: 0.850425
4) Дообучив модель на куда высоком корпусе ( MLM и NER ):
- precision: 0.793
- recall: 0.914
- f1: 0.849
- accuracy: 0.972

Четвертый подход показал куда высокие результаты.