# Token Classification. Практическое задание (PJ)

Для закрепления материала модуля вам необходимо решить задачу NER для предоставленного датасета, используя любые доступные вам средства. Модель должна обучаться на файле train.txt, валидироваться на файле dev.txt, а её качество необходимо оценить на файле test.txt.

Для достижения наилучшего результата уделите внимание подбору гиперпарметров как в плане архитектуры, так и в плане обучения модели.

<hr>
**Критерии оценивания проекта:**

- [x] общее качество кода и следование PEP-8;
- [ ] использование рекуррентных сетей;
- [x] использованы варианты архитектур, близкие к state of the art для данной задачи;
- [x] произведен подбор гиперпараметров;
- [x] использованы техники изменения learning rate (lr scheduler);
- [x] использована адекватная задаче функция потерь;
- [x] использованы техники регуляризации;
- [x] корректно проведена валидация модели;
- [ ] использованы техники ensemble;
- [ ] использованы дополнительные данные;
- [x] итоговое значение метрики качества > 0.6 (f1).

<hr>

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

Прочитаем файлы и посмотрим, сколько какие NER-tags есть в тексте.

In [2]:
with open('train.txt', 'r', encoding="utf-8") as train:
    train_words = train.readlines()
with open('dev.txt', 'r', encoding="utf-8") as dev:
    dev_words = dev.readlines()
with open('test.txt', 'r', encoding="utf-8") as test:
    test_words = test.readlines()

In [3]:
tags = []
for word in train_words:
    if word != '\n':
        token, tag = word.split(' ')
        tags.append(tag.replace('\n', ''))

In [4]:
for word in dev_words:
    if word != '\n':
        token, tag = word.split(' ')
        tags.append(tag.replace('\n', ''))

In [5]:
from collections import Counter
Counter(tags)

Counter({'O': 176027,
         'B-PER': 8469,
         'B-ORG': 6713,
         'I-ORG': 4413,
         'B-LOC': 5729,
         'I-LOC': 1279,
         'I-PER': 5246})

Очевиден дисбаланс классов, но для трансофрмерных моделей не будем здесь ничего делать, они должны справиться без проблем.

В отдельный словарь выведем лейблы ner и присвоенные им цифровые значения. Воспользуемся готовым словарем из интернета, так имеющиеся данные не отличаются от них.

In [6]:
label2id = {'O': 0,
 'B-PER': 1,
 'I-PER': 2,
 'B-ORG': 3,
 'I-ORG': 4,
 'B-LOC': 5,
 'I-LOC': 6,
 'B-MISC': 7,
 'I-MISC': 8}

id2label={0: 'O',
 1: 'B-PER',
 2: 'I-PER',
 3: 'B-ORG',
 4: 'I-ORG',
 5: 'B-LOC',
 6: 'I-LOC',
 7: 'B-MISC',
 8: 'I-MISC'}

label_list = list(label2id.keys())

Обработаем загруженные файлы:
- разобъем их на предолжения по знаку \n
- для каждого предложения в отдельный список соберем его токены и ner_tags
- ner_tags заменим на его цифровое значение
- переведем все 3 файла в формате dataset

In [7]:
def split_into_sentences(lst):
    temp_list_tokens = []
    temp_list_ner_tags = []
    res = []
    count = 0
    for word in lst:
        if word == '\n':
            res.append({'id': count,
                        'tokens': temp_list_tokens,
                        'ner_tags': temp_list_ner_tags
                
            })
            temp_list_tokens = []
            temp_list_ner_tags = []
            count += 1
        else:
            token, ner_tag = word.split(' ')
            temp_list_tokens.append(token)
            temp_list_ner_tags.append(label2id[ner_tag.replace('\n', '')])


    return res

In [8]:
from datasets import Dataset, DatasetDict
import datasets

dataset = DatasetDict({
    "train": Dataset.from_pandas(pd.DataFrame(split_into_sentences(train_words))),
    "valid": Dataset.from_pandas(pd.DataFrame(split_into_sentences(dev_words))),
    'test': Dataset.from_pandas(pd.DataFrame(split_into_sentences(test_words)))
    })
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 7746
    })
    valid: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 2582
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 2582
    })
})

In [9]:
for token, ner_tag in zip(dataset['train'][0]['tokens'], dataset['train'][0]['ner_tags']):
    print(f'{token:_<40}{ner_tag}')

"_______________________________________0
Если____________________________________0
Миронов_________________________________1
занял___________________________________0
столь___________________________________0
оппозиционную___________________________0
позицию_________________________________0
,_______________________________________0
то______________________________________0
мне_____________________________________0
представляется__________________________0
,_______________________________________0
что_____________________________________0
для_____________________________________0
него____________________________________0
было____________________________________0
бы______________________________________0
порядочным______________________________0
и_______________________________________0
правильным______________________________0
уйти____________________________________0
в_______________________________________0
отставку________________________________0
с_________________________________

Все вроде бы хорошо, но есть небольшая техническая трудность, на которую надо обратить внимание при решении подобных задач. При токенизации, например, с помощью алгоритма WordPiece, часть слов может быть разбита на несколько токенов. Количество токенов перестанет совпадать с количеством классов для текста. То же самое касается специальных токенов, таких как [CLS], [SEP] и токенов для паддинга [PAD].

Придется использовать дополнительную функцию для выравнивания длины токенов после токенизатора.

Воспользуюсь русскоязычной моделью https://huggingface.co/DeepPavlov/rubert-base-cased

Посмотрим на пример одного предложения

In [10]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("DeepPavlov/rubert-base-cased")

In [11]:
example = dataset["train"][0]
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])

print(f'Token len from BERT tokenizer__{len(tokens)}')
print(f'Token len initial tokens__{len(example["tokens"])}\n')

print(' '.join(tokens))

Token len from BERT tokenizer__53
Token len initial tokens__47

[CLS] " Если Миронов занял столь оппозиционную позицию , то мне представляется , что для него было бы поряд ##очным и правильным уйти в отставку с заним ##аемого им поста , поста , который предоставлен ему сегодня " Единой Россией ' ' и никем больше ' ' , - заключает Исаев . [SEP]


In [12]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [13]:
tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True,
                               remove_columns = ['id', 'tokens', 'ner_tags'])
tokenized_dataset

Map:   0%|          | 0/7746 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/2582 [00:00<?, ? examples/s]

Map:   0%|          | 0/2582 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 7746
    })
    valid: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 2582
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 2582
    })
})

Посмотрим, как после функции проставились токены:

In [14]:
for token, label in zip(tokenizer.convert_ids_to_tokens(tokenized_dataset['train'][0]['input_ids']), 
                        tokenized_dataset['train'][0]['labels']):
    print(f'{token:_<40}{label}')

[CLS]___________________________________-100
"_______________________________________0
Если____________________________________0
Миронов_________________________________1
занял___________________________________0
столь___________________________________0
оппозиционную___________________________0
позицию_________________________________0
,_______________________________________0
то______________________________________0
мне_____________________________________0
представляется__________________________0
,_______________________________________0
что_____________________________________0
для_____________________________________0
него____________________________________0
было____________________________________0
бы______________________________________0
поряд___________________________________0
##очным_________________________________-100
и_______________________________________0
правильным______________________________0
уйти____________________________________0
в___________________________

Воспользуемся метрикой качества seqeval https://huggingface.co/spaces/evaluate-metric/seqeval

<hr>
**For info:** seqeval is a Python framework for sequence labeling evaluation. seqeval can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on.

Overall:
- accuracy: the average accuracy, on a scale between 0.0 and 1.0.
- precision: the average precision, on a scale between 0.0 and 1.0.
- recall: the average recall, on a scale between 0.0 and 1.0.
- f1: the average F1 score, which is the harmonic mean of the precision and recall. It also has a scale of 0.0 to 1.0.
<hr>

Далее 

In [15]:
import evaluate

seqeval = evaluate.load("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [16]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [17]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

In [18]:
LR_VALUES = (2e-5, 5e-5)
DECAY_VALUES = (1e-4, 0.1)

In [19]:
from transformers import TrainingArguments, Trainer

for i, learning_rate in enumerate(LR_VALUES):
    for j, weight_decay in enumerate(DECAY_VALUES):
            
            model = AutoModelForTokenClassification.from_pretrained(
                "DeepPavlov/rubert-base-cased", num_labels=9, id2label=id2label, label2id=label2id)
            print(f'Log: training for l_r:{learning_rate}, w_d:{weight_decay}...')
            
            training_args = TrainingArguments(
                output_dir="token_class_model",
                learning_rate=learning_rate,
                per_device_train_batch_size=16,
                per_device_eval_batch_size=16,
                num_train_epochs=3,
                weight_decay=weight_decay,
                evaluation_strategy="epoch",
                push_to_hub=False,
                save_strategy="no", 
                group_by_length=True,
                warmup_ratio=0.1,
                optim="adamw_torch",
                lr_scheduler_type="cosine",
            )

            trainer = Trainer(
                model=model,
                args=training_args,
                train_dataset=tokenized_dataset["train"],
                eval_dataset=tokenized_dataset["valid"],
                tokenizer=tokenizer,
                data_collator=data_collator,
                compute_metrics=compute_metrics,
            )

            trainer.train()

Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initializ

Log: training for l_r:2e-05, w_d:0.0001...


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.024101,0.96159,0.970261,0.965906,0.992869
2,0.217300,0.018902,0.97237,0.979087,0.975717,0.995428
3,0.012400,0.016491,0.978427,0.983308,0.980861,0.996054


Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initializ

Log: training for l_r:2e-05, w_d:0.1...


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.023871,0.957632,0.971412,0.964473,0.993104
2,0.206100,0.019181,0.975708,0.978703,0.977203,0.995604
3,0.012500,0.018225,0.979143,0.981773,0.980456,0.996034


Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initializ

Log: training for l_r:5e-05, w_d:0.0001...


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.027511,0.963979,0.970453,0.967205,0.992537
2,0.149300,0.018151,0.975568,0.980622,0.978088,0.995643
3,0.012300,0.016807,0.980865,0.9835,0.98218,0.996444


Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initializ

Log: training for l_r:5e-05, w_d:0.1...


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.032065,0.959488,0.963354,0.961417,0.991345
2,0.150100,0.021206,0.974388,0.978127,0.976254,0.994471
3,0.011100,0.016206,0.981266,0.984843,0.983051,0.996601


Итак, в принципе задача NER classification для RuBert на данном датасет оказалась очень легкой. При всех выбранных парах learning_rate + weight decay модель справилась неплохо. Если сравнивать по метрике F1, то немного лучше сработали параметры:

- learning rate = 5e-05
- weight decay = 0.1

Возьмем эти параметры для тестирования на тестовой выборке.

In [26]:
model = AutoModelForTokenClassification.from_pretrained(
    "DeepPavlov/rubert-base-cased", num_labels=9, id2label=id2label, label2id=label2id)
print(f'Log: training for l_r:{learning_rate}, w_d:{weight_decay}...')

training_args = TrainingArguments(
    output_dir="token_class_model",
    learning_rate=5e-05,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.1,
    evaluation_strategy="epoch",
    push_to_hub=False,
    save_strategy="no", 
    group_by_length=True,
    warmup_ratio=0.1,
    optim="adamw_torch",
    lr_scheduler_type="cosine",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["valid"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initializ

Log: training for l_r:5e-05, w_d:0.1...


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.02848,0.95509,0.96297,0.959014,0.991599
2,0.156200,0.017952,0.977029,0.979279,0.978153,0.995565
3,0.011000,0.015034,0.981818,0.984267,0.983041,0.996796


TrainOutput(global_step=1455, training_loss=0.05859080335938234, metrics={'train_runtime': 173.8455, 'train_samples_per_second': 133.67, 'train_steps_per_second': 8.37, 'total_flos': 323491566549528.0, 'train_loss': 0.05859080335938234, 'epoch': 3.0})

In [27]:
predictions = trainer.predict(test_dataset=tokenized_dataset["test"])

Для оценки качества на тестовой выборки посмотрим метрику по итогу работы trainer.predict, а также визуально оценим результаты на нескольких текстах. Для этого воспользуемся pipeline из трансформеров.

In [21]:
predictions.metrics

{'test_loss': 0.016201525926589966,
 'test_precision': 0.9802249637155298,
 'test_recall': 0.9848705796573095,
 'test_f1': 0.9825422804146208,
 'test_accuracy': 0.9964176628077982,
 'test_runtime': 5.3586,
 'test_samples_per_second': 481.839,
 'test_steps_per_second': 30.232}

In [28]:
import torch
device = torch.device ('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [29]:
from transformers import pipeline
ner = pipeline("ner", model=model, tokenizer=tokenizer, device = 0)

In [33]:
ner('Меня зовут Шахова Екатерина и я родом из Москвы, работаю в компании Ромашки')

[{'entity': 'B-PER',
  'score': 0.9993605,
  'index': 3,
  'word': 'Шах',
  'start': 11,
  'end': 14},
 {'entity': 'I-PER',
  'score': 0.91396403,
  'index': 4,
  'word': '##ова',
  'start': 14,
  'end': 17},
 {'entity': 'I-PER',
  'score': 0.99700934,
  'index': 5,
  'word': 'Екатерина',
  'start': 18,
  'end': 27},
 {'entity': 'B-LOC',
  'score': 0.9990299,
  'index': 10,
  'word': 'Москвы',
  'start': 41,
  'end': 47},
 {'entity': 'B-PER',
  'score': 0.9988951,
  'index': 16,
  'word': 'Ромаш',
  'start': 68,
  'end': 73},
 {'entity': 'I-PER',
  'score': 0.94312793,
  'index': 17,
  'word': '##ки',
  'start': 73,
  'end': 75}]

In [25]:
ner(dataset['test'][2]['tokens'])

[[],
 [],
 [],
 [{'entity': 'B-ORG',
   'score': 0.95313144,
   'index': 1,
   'word': 'Yahoo',
   'start': 0,
   'end': 5}],
 [],
 [{'entity': 'B-ORG',
   'score': 0.98629606,
   'index': 3,
   'word': 'OR',
   'start': 2,
   'end': 4},
  {'entity': 'I-ORG',
   'score': 0.92113096,
   'index': 4,
   'word': '##G',
   'start': 4,
   'end': 5}],
 [],
 [{'entity': 'B-PER',
   'score': 0.66430175,
   'index': 1,
   'word': 'Барт',
   'start': 0,
   'end': 4}],
 [],
 [{'entity': 'B-PER',
   'score': 0.87051094,
   'index': 1,
   'word': 'Джерри',
   'start': 0,
   'end': 6}],
 [],
 [],
 [],
 [],
 []]

# Финальные выводы:

Задача классификации сущностей оказалась довольно легкой для RuBert, несмотря на серьезный дисбаланс классов. Метрика на тестовом датасете довольно высокая и визуальные примеры показали хороший результаты, хотя компанию Ромашка модель отнесла к персоне, а не к организации.