### Домашнее задание 4 - 10 баллов

В этом задании вам предстоит дообучить трансформерную модель для NER-задачи в различных форматах:

1. Обучите NER-модель

- Загрузите набор данных [Collection5](https://github.com/natasha/corus?tab=readme-ov-file#load_ne5) - **1 балл**
- Разбейте набор данных на train/test части
- Дообучите модель [rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) на train-части корпуса для решения NER-задачи, сделайте замеры качества NER-метрик до и после дообучения - **2 балла**

2. Попробуйте улучшить качество модели следующими способами:
- Предварительно дообучите на train-части в MLM режиме, а потом дообучите на NER-задачу - **2 балла**
- Сгенерируйте синтетическую разметку* подходящего**, на ваш взгляд, новостного корпуса большой и умной моделью для русскоязычного NER***, а затем использовав ее для дообучения rubert-tiny2 вместе с основным набором данных - **2 балла**

3. Финально сравните результаты различных подходов - **1 балл**

*прогоните датасет через NER-модель, получите ее предсказания и используйте их в качестве резметки

**Можно использовать уже знакомый вам датасет lenta-ru, объем данных лучше взять от 10_000 текстов

***Например, можно взять модель модель DeepPavlov ner_collection3_bert. Инструкция по запуску есть в [документации](https://docs.deeppavlov.ai/en/master/features/models/NER.html)

**Общее**

- Принимаемые решения обоснованы (почему выбрана определенная архитектура/гиперпараметр/оптимизатор/преобразование и т.п.) - **1 балл**
- Обеспечена воспроизводимость решения: зафиксированы random_state, ноутбук воспроизводится от начала до конца без ошибок - **1 балл**

**Формат сдачи ДЗ**

- Каждая домашняя работа – PR в отдельную ветку **hw_n**, где **n** - номер домашней работы
- Добавить ментора и pacifikus в reviewers
- Дождаться ревью, если все ок – мержим в main
- Если не ок – вносим исправления и снова отправляем на ревью

In [97]:
import os
import re
import numpy as np
import pandas as pd
import torch

from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight

from datasets import Dataset, concatenate_datasets
import evaluate

from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    AutoModelForMaskedLM,
    DataCollatorForTokenClassification,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer,
    pipeline
)

from seqeval.metrics import classification_report, accuracy_score, f1_score

# Загрузим данные и подготовим разметку

In [72]:
# with open('Collection5/156.txt', 'r', encoding='utf-8') as file:
#     content = file.read().replace('\n','\n ')
# len(content.split())

device = "cuda:0"
SEED = 22

In [73]:
def create_bio_markup(text_file, ann_file):
    with open(text_file, 'r', encoding='utf-8') as f:
        text = f.read()
    
    annotations = []
    with open(ann_file, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            parts = line.split('\t')
            if len(parts) < 3:
                continue
            tag_parts = parts[1].split()
            if len(tag_parts) < 3:
                continue
            tag = tag_parts[0]
            try:
                start = int(tag_parts[1])
                end = int(tag_parts[2])
            except ValueError:
                print(f"Ошибка при обработке индексов в строке: {line}")
                continue
            annotations.append({'tag': tag, 'start': start, 'end': end})
    
    tokens = []
    for match in re.finditer(r'\S+', text.replace('\n','\n ')):
        token = match.group()
        start_offset = match.start()
        end_offset = match.end()
        tokens.append((token, start_offset, end_offset))
    
    bio_labels = ["O"] * len(tokens)    
    # for ann in tqdm(annotations, desc="Разметка BIO"):
    for ann in annotations:
        matched_indices = []
        for i, (token, token_start, token_end) in enumerate(tokens):
            token_mid = (token_start + token_end) // 2
            if ann['start'] <= token_mid < ann['end']:
                matched_indices.append(i)
        if matched_indices:
            bio_labels[matched_indices[0]] = f"B-{ann['tag']}"
            for idx in matched_indices[1:]:
                bio_labels[idx] = f"I-{ann['tag']}"
    
    return tokens, bio_labels

def process_all_documents(directory: str):
    data = []

    files = [
        f for f in os.listdir(directory)
        if f.endswith(".txt") and os.path.splitext(f)[0].isdigit()
    ]
    files = sorted(files, key=lambda x: int(os.path.splitext(x)[0]))

    for file in tqdm(files, desc="Обработка документов"):
        text_file = os.path.join(directory, file)
        ann_file = text_file.replace(".txt", ".ann")
        
        if os.path.exists(ann_file):
            tokens, bio_labels = create_bio_markup(text_file, ann_file)
            tokens = [token[0] for token in tokens]
            data.append((tokens, bio_labels))
        else:
            print(f"Предупреждение: не найден файл разметки для {file}")
    
    return data

data = process_all_documents('Collection5')

Обработка документов: 100%|██████████████████| 816/816 [00:01<00:00, 530.51it/s]


In [76]:
def prepare_dataset(data):
    all_words = []
    all_labels = []

    for tokens, tags in data:
        words = [str(t) for t in tokens]
        labels = [str(l) for l in tags]
        
        all_words.append(words)
        all_labels.append(labels)

    return {"tokens": all_words, "ner_tags": all_labels}


dataset = prepare_dataset(data)
dataset.keys()

dict_keys(['tokens', 'ner_tags'])

In [77]:
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
train_dataset = Dataset.from_dict(prepare_dataset(train_data))
test_dataset = Dataset.from_dict(prepare_dataset(test_data))

# Дообучение модели rubert-tiny2

In [78]:
model_id = "cointegrated/rubert-tiny2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
label_list = sorted(list(set(label for ex in dataset['ner_tags'] for label in ex)))

label2id = {
    'O': 0,
    'B-GEOPOLIT': 1,
    'I-GEOPOLIT': 2,
    'B-MEDIA': 3,
    'I-MEDIA': 4,
    'B-LOC': 5,
    'I-LOC': 6,
    'B-ORG': 7,
    'I-ORG': 8,
    'B-PER': 9,
    'I-PER': 10,
}

id2label = {i: label for label, i in label2id.items()}
num_labels = len(label_list)

In [9]:
len(id2label.keys()), len(label2id.keys())

(11, 11)

In [79]:
# создадим label для каждого токена

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
        add_special_tokens=True
    )
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label2id[label[word_idx]])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs



train_dataset_tok = train_dataset.map(tokenize_and_align_labels, batched=True)
test_dataset_tok = test_dataset.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/652 [00:00<?, ? examples/s]

Map:   0%|          | 0/164 [00:00<?, ? examples/s]

In [81]:
def convert_tags_to_ids(dataset, label2id):    
    for example in dataset:
        return [[label2id[label] for label in example["ner_tags"]] for example in dataset]

train_num_ner_tags = convert_tags_to_ids(train_dataset, label2id)
test_num_ner_tags = convert_tags_to_ids(test_dataset, label2id)


all_labels = []
for example in train_num_ner_tags:
    all_labels.extend([label for label in example if label != -100])

In [82]:
classes = np.array(list(set(label2id.values())))

In [83]:
# Создадим кастомную метрику для Trainer



seqeval = evaluate.load("seqeval")
loss_weights = compute_class_weight(class_weight="balanced", classes=classes, y=all_labels)


class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        
        labels = inputs.pop("labels").to(device)  
        
        outputs = model(**inputs)                
        logits = outputs.get("logits").to(device)  

        
        if labels is not None:
            weights = loss_weights
            loss_fct = torch.nn.CrossEntropyLoss(weight=torch.Tensor(weights)).to(device)
            loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
            return (loss, outputs) if return_outputs else loss

def compute_metrics(p):        
        predictions, labels = p.predictions, p.label_ids
        preds = np.argmax(predictions, axis=2)
        _TAGS = list(id2label.values())
        
        true_labels = [
            [_TAGS[l] for (p, l) in zip(pred, label) if l != -100] 
            for pred, label in zip(preds, labels)
        ]
        true_preds = [
            [_TAGS[p] for (p, l) in zip(pred, label) if l != -100] 
            for pred, label in zip(preds, labels)
        ]

        results = seqeval.compute(predictions=true_preds, references=true_labels)
        
        return {
            "precision": results["overall_precision"],
            "recall": results["overall_recall"],
            "f1": results["overall_f1"],
            "accuracy": results["overall_accuracy"],
        }


In [85]:
model = AutoModelForTokenClassification.from_pretrained(
    model_id, 
    num_labels=num_labels,
    id2label=id2label, 
    label2id=label2id
).to(device)

training_args = TrainingArguments(
    "exp1-ner-rubert-tiny2",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=10,
    per_device_eval_batch_size=10,
    num_train_epochs=12,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch",
    load_best_model_at_end=True,
    seed=SEED
)

data_collator = DataCollatorForTokenClassification(tokenizer)

trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset_tok,
    eval_dataset=test_dataset_tok,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

Some weights of the model checkpoint at cointegrated/rubert-tiny2 were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at cointegrated/rubert-tiny2 and are new

In [86]:
ts_ds=test_dataset_tok.remove_columns(["tokens", "ner_tags"])

outputs = trainer.predict(ts_ds)
print("== Before fine-tuning ==")
print(compute_metrics(outputs))

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


== Before fine-tuning ==
{'precision': 0.01004879662558928, 'recall': 0.055064581917063225, 'f1': 0.01699597831788774, 'accuracy': 0.1335840684019698}


In [87]:
trainer.train()



Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,2.3455,2.120201,0.046588,0.287559,0.080185,0.195357
2,2.0649,1.827567,0.114644,0.493542,0.186066,0.466421
3,1.8296,1.571372,0.226326,0.598459,0.328442,0.709643
4,1.6203,1.349983,0.300841,0.656696,0.412644,0.782835
5,1.4228,1.164648,0.355865,0.712214,0.474594,0.819335
6,1.2489,1.022465,0.386856,0.740313,0.508166,0.837518
7,1.1394,0.909855,0.415569,0.779062,0.542015,0.847773
8,1.0154,0.832369,0.428171,0.796284,0.556894,0.852671
9,0.9384,0.776456,0.438878,0.811919,0.56977,0.856459
10,0.8403,0.739672,0.446211,0.82053,0.578065,0.857974


TrainOutput(global_step=132, training_loss=1.2967165907224019, metrics={'train_runtime': 61.0715, 'train_samples_per_second': 128.112, 'train_steps_per_second': 2.161, 'total_flos': 132073898319072.0, 'train_loss': 1.2967165907224019, 'epoch': 12.0})

In [88]:
ts_ds=test_dataset_tok.remove_columns(["tokens", "ner_tags"])

outputs = trainer.predict(ts_ds)
print("== After fine-tuning ==")
print(compute_metrics(outputs))

== After fine-tuning ==
{'precision': 0.4537773359840954, 'recall': 0.8275549512803082, 'f1': 0.5861487842067249, 'accuracy': 0.8610584988365171}


# Попробуйте улучшить качество модели следующими способами:
- Предварительно дообучите на train-части в MLM режиме, а потом дообучите на NER-задачу - 2 балла
- Сгенерируйте синтетическую разметку* подходящего**, на ваш взгляд, новостного корпуса большой и умной моделью для русскоязычного NER***, а затем использовав ее для дообучения rubert-tiny2 вместе с основным набором данных - 2 балла

In [89]:
train_full_text = [" ".join(data['tokens']) for data in train_dataset]
test_full_text = [" ".join(data['tokens']) for data in test_dataset]

train_mlm_dataset = Dataset.from_dict({"text": train_full_text})
test_mlm_dataset = Dataset.from_dict({"text": test_full_text})

In [90]:
def tokenize_mlm(example):
    return tokenizer(example['text'], return_special_tokens_mask=True)

train_tokenized_mlm_dataset = train_mlm_dataset.map(tokenize_mlm, batched=True, remove_columns=["text"])
test_tokenized_mlm_dataset = test_mlm_dataset.map(tokenize_mlm, batched=True, remove_columns=["text"])

Map:   0%|          | 0/652 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2308 > 2048). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/164 [00:00<?, ? examples/s]

In [91]:
block_size = 256

def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size

    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

train_mlm_dataset = train_tokenized_mlm_dataset.map(group_texts, batched=True)
test_mlm_dataset = test_tokenized_mlm_dataset.map(group_texts, batched=True)

Map:   0%|          | 0/652 [00:00<?, ? examples/s]

Map:   0%|          | 0/164 [00:00<?, ? examples/s]

In [93]:
from transformers import DataCollatorForLanguageModeling, AutoModelForMaskedLM, Trainer, TrainingArguments

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
model_mlm = AutoModelForMaskedLM.from_pretrained(model_id)

training_args_mlm = TrainingArguments(
    output_dir="./mlm-rubert-tiny2",
    evaluation_strategy="epoch", 
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    seed=SEED
)

trainer_mlm = Trainer(
    model=model_mlm,
    args=training_args_mlm,
    train_dataset=train_mlm_dataset,
    eval_dataset=test_mlm_dataset,
    data_collator=data_collator,
)

trainer_mlm.train()

Some weights of the model checkpoint at cointegrated/rubert-tiny2 were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch,Training Loss,Validation Loss
1,No log,3.093219
2,No log,3.11269
3,No log,3.034633
4,No log,3.034324
5,No log,2.983469
6,No log,2.952465
7,No log,2.969285
8,No log,2.995662
9,No log,3.046781
10,No log,2.978892




TrainOutput(global_step=90, training_loss=3.2770406087239583, metrics={'train_runtime': 77.614, 'train_samples_per_second': 99.982, 'train_steps_per_second': 1.16, 'total_flos': 29611075338240.0, 'train_loss': 3.2770406087239583, 'epoch': 10.0})

In [95]:
eval_result = trainer_mlm.evaluate()
print(f"Perplexity: {np.exp(eval_result['eval_loss']):.2f}")



Perplexity: 21.24


### После MLM обучим модель на задачу NER

In [None]:
model = AutoModelForTokenClassification.from_pretrained(
    "mlm-rubert-tiny2/checkpoint-40", 
    num_labels=num_labels,
    id2label=id2label, 
    label2id=label2id
).to(device)

training_args = TrainingArguments(
    "exp2-ner-mlm-rubert-tiny2",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=10,
    per_device_eval_batch_size=10,
    num_train_epochs=12,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch",
    load_best_model_at_end=True,
    seed=SEED
)

data_collator = DataCollatorForTokenClassification(tokenizer)

trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset_tok,
    eval_dataset=test_dataset_tok,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

#### Попрбуем синтезировать данные на оснве lenta-ru и снова дообучить модель через ранее пройденный pipeline ( MLM + FT )

In [17]:
from deeppavlov import build_model

ner_model = build_model('ner_collection3_bert', download=True, install=True)

Ignoring transformers: markers 'python_version < "3.8"' don't match your environment


2025-04-14 05:48:02.460 INFO in 'deeppavlov.download'['download'] at line 138: Skipped http://files.deeppavlov.ai/v1/ner/ner_rus_bert_coll3_torch.tar.gz download because of matching hashes
Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you 

In [18]:
preds = ner_model(["Президент России Владимир Путин прибыл в Ташкент."])
print(preds)

[[['Президент', 'России', 'Владимир', 'Путин', 'прибыл', 'в', 'Ташкент', '.']], [['O', 'S-LOC', 'B-PER', 'E-PER', 'O', 'O', 'S-LOC', 'O']]]


In [19]:
df = pd.read_csv("cleaned-lenta-ru-news-100k.csv")
full_text = df['full_text'][:10000].tolist()

def prep_text(text:str):
    return text.replace('\xa0', ' ')

full_text = list(map(prep_text, full_text))


def split_text(text, max_tokens=512):
    if isinstance(text, list):
        text = " ".join(text)
    tokens = tokenizer.tokenize(text)
    chunks = []
    current_chunk = []
    for token in tqdm(tokens):
        current_chunk.append(token)
        if len(current_chunk) >= max_tokens:
            chunks.append(tokenizer.convert_tokens_to_string(current_chunk))
            current_chunk = []
    if current_chunk:
        chunks.append(tokenizer.convert_tokens_to_string(current_chunk))
    return chunks

In [20]:
chunks = split_text(full_text)

all_preds = []
for chunk in tqdm(chunks):
    try:
        preds = ner_model([chunk])
    except:
        continue
    all_preds.append(preds) 

Token indices sequence length is longer than the specified maximum sequence length for this model (2689016 > 2048). Running this sequence through the model will result in indexing errors
100%|█████████████████████████████| 2689016/2689016 [00:02<00:00, 968454.60it/s]
100%|███████████████████████████████████████| 5252/5252 [02:57<00:00, 29.63it/s]


In [21]:
def replace_tag(tags: list):
    new_tag = None
    new_tag = list(map(lambda x: x.replace('E','I').replace('S','B').replace('PIR','PER'), tags))
    return new_tag

add_dataset = {
    'tokens': [],
    'ner_tags': []
}

for i in all_preds:
    ner_tag = replace_tag(i[1][0])
    add_dataset['tokens'].append(i[0][0])
    add_dataset['ner_tags'].append(ner_tag)

### Перейдем к обучению на уровне MLM

In [24]:
from sklearn.model_selection import train_test_split

train_data_tokens, test_data_tokens, train_data_tags, test_data_tags = train_test_split(
    add_dataset["tokens"], add_dataset["ner_tags"], test_size=0.2, random_state=42
)

train_add_dataset = {"tokens": train_data_tokens, "ner_tags": train_data_tags}
test_add_data_dataset = {"tokens": test_data_tokens, "ner_tags": test_data_tags}

In [25]:
from datasets import Dataset

train_add_dataset = Dataset.from_dict(train_add_dataset)
test_add_dataset = Dataset.from_dict(test_add_data_dataset)

In [26]:
train_add_dataset_tok = train_add_dataset.map(tokenize_and_align_labels, batched=True)
test_add_dataset_tok = test_add_dataset.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/3804 [00:00<?, ? examples/s]

Map:   0%|          | 0/951 [00:00<?, ? examples/s]

In [27]:
set(train_add_dataset_tok['labels'][0])

{-100, 0, 5, 6, 7, 8, 9, 10}

In [28]:
train_dataset_tok

Dataset({
    features: ['tokens', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 652
})

In [30]:
train_merged_dataset = concatenate_datasets([train_add_dataset_tok, train_dataset_tok])
test_merged_dataset = concatenate_datasets([test_add_dataset_tok, test_dataset_tok])

## Обучаем на уровне MLM

In [31]:
train_full_text = [" ".join(data['tokens']) for data in train_merged_dataset]
test_full_text = [" ".join(data['tokens']) for data in test_merged_dataset]

train_mlm_dataset = Dataset.from_dict({"text": train_full_text})
test_mlm_dataset = Dataset.from_dict({"text": test_full_text})

In [32]:
def tokenize_mlm(example):
    return tokenizer(example['text'], return_special_tokens_mask=True)

train_tokenized_mlm_dataset = train_mlm_dataset.map(tokenize_mlm, batched=True, remove_columns=["text"])
test_tokenized_mlm_dataset = test_mlm_dataset.map(tokenize_mlm, batched=True, remove_columns=["text"])

Map:   0%|          | 0/4456 [00:00<?, ? examples/s]

Map:   0%|          | 0/1115 [00:00<?, ? examples/s]

In [33]:
block_size = 256

def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size

    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

train_mlm_dataset = train_tokenized_mlm_dataset.map(group_texts, batched=True)
test_mlm_dataset = test_tokenized_mlm_dataset.map(group_texts, batched=True)

Map:   0%|          | 0/4456 [00:00<?, ? examples/s]

Map:   0%|          | 0/1115 [00:00<?, ? examples/s]

In [34]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
model_mlm = AutoModelForMaskedLM.from_pretrained(model_id)

training_args_mlm = TrainingArguments(
    output_dir="./exp3-mlm-rubert-tiny2",
    evaluation_strategy="epoch", 
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    seed=SEED
)

trainer_mlm = Trainer(
    model=model_mlm,
    args=training_args_mlm,
    train_dataset=train_mlm_dataset,
    eval_dataset=test_mlm_dataset,
    data_collator=data_collator,
)

trainer_mlm.train()

In [35]:
eval_result = trainer_mlm.evaluate()
print(f"Perplexity: {np.exp(eval_result['eval_loss']):.2f}")

In [36]:
ls exp3-mlm-rubert-tiny2/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[0m[01;34mcheckpoint-1062[0m/  [01;34mcheckpoint-1770[0m/  [01;34mcheckpoint-2832[0m/  [01;34mcheckpoint-3540[0m/
[01;34mcheckpoint-1416[0m/  [01;34mcheckpoint-2124[0m/  [01;34mcheckpoint-3186[0m/  [01;34mcheckpoint-531[0m/
[01;34mcheckpoint-177[0m/   [01;34mcheckpoint-2478[0m/  [01;34mcheckpoint-354[0m/   [01;34mcheckpoint-708[0m/


## Перейдем к обучению в рамках NER задачи

In [49]:
model = AutoModelForTokenClassification.from_pretrained(
    "exp3-mlm-rubert-tiny2/checkpoint-3540/", 
    num_labels=num_labels,
    id2label=id2label, 
    label2id=label2id
).to(device)

training_args = TrainingArguments(
    "exp3-ner-rubert-tiny2",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=12,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch",
    load_best_model_at_end=True,
    seed=SEED,
    # gradient_accumulation_steps=4,
    # fp16=True
)

data_collator = DataCollatorForTokenClassification(tokenizer)

trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_merged_dataset,
    eval_dataset=test_merged_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

Some weights of the model checkpoint at exp3-mlm-rubert-tiny2/checkpoint-3540/ were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at exp3-mlm-rubert-ti

In [50]:
trainer.train()



Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.4027,0.367783,0.603602,0.845057,0.704207,0.938915
2,0.2582,0.258535,0.639106,0.866841,0.735754,0.94495
3,0.204,0.213745,0.691225,0.886488,0.776773,0.955109
4,0.1622,0.190953,0.73379,0.88773,0.803453,0.962428
5,0.1441,0.182828,0.750827,0.901664,0.819362,0.965762
6,0.1209,0.183116,0.763709,0.905961,0.828776,0.967167
7,0.0974,0.17766,0.763815,0.906731,0.82916,0.967241
8,0.1079,0.174981,0.775119,0.909811,0.837081,0.969179
9,0.0864,0.179772,0.789218,0.9116,0.846006,0.971388
10,0.0815,0.179025,0.786179,0.911898,0.844385,0.970847


TrainOutput(global_step=2232, training_loss=0.19359060778119017, metrics={'train_runtime': 302.0502, 'train_samples_per_second': 177.03, 'train_steps_per_second': 7.39, 'total_flos': 447207895293888.0, 'train_loss': 0.19359060778119017, 'epoch': 12.0})

# Посмотрим что получилось

In [63]:
model_path = "exp3-ner-rubert-tiny2/checkpoint-2232"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForTokenClassification.from_pretrained(
    model_path, 
    num_labels=num_labels,
    id2label=id2label, 
    label2id=label2id
)

In [67]:
ner_pipeline = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

text = "Генеральный директор Сбербанка Герман Греф на конференции в Москве заявил, что сотрудничество с Яндексом в области искусственного интеллекта выходит на новый уровень. Он также отметил, что правительство Российской Федерации поддерживает развитие цифровой экономики, особенно в рамках Евразийского экономического союза."
results = ner_pipeline(text)

for entity in results:
    print(entity)

{'entity_group': 'ORG', 'score': 0.951569, 'word': 'Сбербанка', 'start': 21, 'end': 30}
{'entity_group': 'PER', 'score': 0.9922959, 'word': 'Герман Греф', 'start': 31, 'end': 42}
{'entity_group': 'LOC', 'score': 0.60198957, 'word': 'Москве', 'start': 60, 'end': 66}
{'entity_group': 'ORG', 'score': 0.6973838, 'word': 'Яндексом', 'start': 96, 'end': 104}
{'entity_group': 'GEOPOLIT', 'score': 0.9631994, 'word': 'Российской Федерации', 'start': 203, 'end': 223}
{'entity_group': 'ORG', 'score': 0.85091865, 'word': 'Евразийского экономического союза.', 'start': 284, 'end': 318}


In [None]:
# модель закомител в hf: r1char9/ner-rubert-tiny-RuNews