# Task 1. Comparison Identification
## Yurchenko Vladislav
## @moomin_dad
<img src="img/dad.png" width="75" >

## Technical Report


Мои эксперименты:

| Checkpoint | Additional info | F1-average on dev | F1 on test |
| :- | :- | -: | -: |
| LSTM | lilaspourpre-baseline | - | 0.535994 |
| dslim/bert-base-NER | - | 0.724906 | - |
| distilbert-base-uncased | - | 0.741844 | 0.826632 |
| distilbert-base-uncased | Augment 1 word |  0.718097 | - |
| distilbert-base-uncased | Augment 3 word |  0.728500 | - |
| cointegrated/rubert-tiny2 | - | 0.612306 | |
| nsi319/distilbert-base-uncased-finetuned-app | - | 0.717692 | - |
| bert-base-uncased | - | 0.744809 | - |
| bert-base-uncased | lr=2e-5, epoch=7 |  0.755394 | 0.840702 |
| bert-base-uncased | dropout=0.5, 0.2 |  0.760007 | 0.84868 |
| **bert-base-uncased** | **dropout=0.5, 0.5, use dev set in train** |  0.760007 | **0.8515** |
| xlm-roberta-base | - | 0.654820 | - |
| liaad-srl-en_xlmr-large_report| - | 0.363216 | - |
| sberbank-ai-ruRoberta-large| - | 0.745017 | - |

В таблице приведены неиболее интересные результаты, промежуточные варианты опущены.

Общие замечания:
 - Хорошо обучается BERT-base-uncased (не мультиязычный),
 - Модель достаточно быстро дообучается под датасет, оптимальными параметрами выглядят:
   - config.hidden_dropout_prob = 0.5
   - attention_probs_dropout_prob = 0.2
   - epoch = 7
   - learning_rate = 2e-5
   - batch_size = 16
 - В принципе можно использовать более аггресивный lr=5e-5, но epochs = 3
 - При увеличении размера батча до 32 качество падает.
 - Аугментация датасета не дает значимого прироста, приводит к быстрому переобучению модели.
 - Усложнение модели не дает роста в качестве модели. Берт из коробки сразу дал хорошее качество, его тюнинг улучшил показатели. Кажется, что более сложные модели находят больше Object и Aspect, но они в разметке помечены как O. 
 - В целом возникло множество вопросов к разметке.

**Датасет**
- Датасет достаточно небольшой и несбалансирован по классам. Хотелось доразметить самому.
- На dev метрики просаживаются. На test хорошо коррелируют с train, если модель не переобучается!
- Распределение токенов - коррелирует с метриками качества на лидерборде по типам объектов - чем меньше объектов в выборке, тем ниже метрика

| Type | Count | Percent |
| :- | -: | -: |
| O | 48512 | 0.7943% |
| B-Object | 6174 | 0.1011% |
| B-Predicate | 3109 | 0.05091% |
| B-Aspect | 2069 | 0.03388% |
| I-Aspect | 591 | 0.009677% |
| I-Predicate | 427 | 0.006992%|
| I-Object | 192 | 0.003144%|


Какие идеи еще можно было проработать более детально:

**Идея 1**. Аугментация:
 - берем слова помеченные тегами в разметке
 - маскируем случайным образом от 1 до 3-х слов токеном [MASK]
 - с помощью модели bert-base-uncased восстанавливаем слово
 - записываем в датасет
Переобучение на такой аугментации возникает из-за того, что я адекватно не делил train / test, чтобы аугментированные версии одного и того же предложения не попадали в разные наборы.

**Идея 2**. Предсказания модели в ряде случаев начинались с некорректного тега - I-Object вместо B-Object. Для таких кейсов подправлена разметка функцией fix_bio_in_file

**Идея 3**. Использовать доп. информацию о pos_tags из nltk
 - B-Object относится к тегу NN
 - B-Predicate - к JJ
 - B-Aspect - к VB

**Идея 4**. Использовать KeyBERT для извлечения из предложения устойчивых словосочетаний (важных для предложения) и считать их кандидатами на разметку. Вместе с идеей 3 это может дать увеличение метрики.


In [None]:
!pip install nlpaug transformers datasets seqeval

In [None]:
!nvidia-smi

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from tqdm import tqdm
from sklearn.metrics import f1_score

In [None]:
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

In [None]:
def read_dataset(filename, splitter="\t"):
    data = []
    sentence = []
    tags = []
    with open(filename) as f:
        for line in f:
            if not line.isspace():
                word, tag = line.split(splitter)
                sentence.append(word)
                tags.append(tag.strip())
            else:
                data.append((sentence, tags))
                sentence = []
                tags = []
    return data

In [None]:
training_data = read_dataset("task1/train.tsv")
validation_data = read_dataset("task1/dev.tsv")

In [None]:
import pandas as pd
ner_data = pd.DataFrame(training_data, columns=['tokens', 'tags'])
val_data = pd.DataFrame(validation_data, columns=['tokens', 'tags'])
ner_data.shape, val_data.shape

In [None]:
from transformers import pipeline

def create_pipeline(model_name):
    return pipeline('fill-mask', model=model_name, device=0)

predict_mask = create_pipeline("bert-base-uncased")

In [None]:
import numpy as np
import itertools

def augment_sent_3_token(s, tag):
    bio = [i for i, s in enumerate(tag) if not s.startswith('O')]
    sent = s.copy()
    slen = len(sent)

    ch = np.random.choice(bio, 3,  replace=False)
    ch = sorted(ch)
    # print(ch)
    masked_words = []
    for c in ch:
        masked_words.append(sent[c])
        sent[c] = '[MASK]'
    # print(sent)
    # print(masked_words)
    result = predict_mask(" ".join(sent))  

    res_words = []
    for idx, mask in enumerate(result):
        p = []
        for pos in mask:
            if pos['score'] > 0.10:
                # print(pos)
                p.append(pos['token_str'])
        res_words.append(p)
        # print('---')
    # print(res_words)
    
    ret = []
    for pairs in itertools.product(res_words[0], res_words[1], res_words[2]):
        for idx, w in enumerate(pairs):
            sent[ch[idx]] = w
            # print(w)
        ret.append(sent.copy())
        # print(' '.join(s))
    return ret

pos = 22
augs = augment_sent_3_token(ner_data['tokens'][pos], ner_data['tags'][pos])
print(ner_data['tokens'][pos],'\n')
for au in augs:
    print(au)

In [None]:
import nlpaug.augmenter.word as naw

def augment_sent_nlpaug(sent):
    text = " ".join(sent)
    aug = naw.ContextualWordEmbsAug(
        model_path='bert-base-uncased', action="substitute")
    augmented_text = aug.augment(text)
    return [sent.split(" ") for sent in augmented_text]
    

pos = 24
augs = augment_sent_nlpaug(ner_data['tokens'][pos]) 
print(ner_data['tokens'][pos])
print(augs)

In [None]:
import numpy as np

def augment_sent(s, tag):
    bio = [i for i, s in enumerate(tag) if not s.startswith('O')]
    augmented = []
    for mask_token in bio:
        sent = s.copy()
        masked_word = sent[mask_token] 
        sent[mask_token] = '[MASK]'
        # print(masked_word)
        result = predict_mask(" ".join(sent))
        for r in result:
            if r['token_str'] != masked_word and r['score'] > 0.15:
                # print(r)
                sent[mask_token] = r['token_str'].replace("#", "")
                augmented.append(sent)
    return augmented

pos = 15
augs = augment_sent(ner_data['tokens'][pos], ner_data['tags'][pos])
print(ner_data['tokens'][pos],'\n')
for au in augs:
    print(au)
# for sent in ner_data['tokens']:
#     print(sent, '\n', augment_sent(sent), '\n\n')

# predict_mask("men should wear shirt and [MASK] for tomorrow's event.")

In [None]:
from tqdm import tqdm

training_data_2 = []
t = tqdm(total = len(ner_data['tokens']))
for idx, (sent, tags) in enumerate(zip(ner_data['tokens'], ner_data['tags'])):
    aug_sents = augment_sent_3_token(sent, tags)
    for i, aug_sent in enumerate(aug_sents):
        training_data_2.append((aug_sent, tags))
    aug_sents = augment_sent(sent, tags)
    for i, aug_sent in enumerate(aug_sents):
        training_data_2.append((aug_sent, tags))
    # if idx > 10:
    #     break
    t.update(1)
t.close()
         
ner_data_aug = pd.DataFrame(training_data_2, columns=['tokens', 'tags'])
ner_data = pd.concat([ner_data, ner_data_aug])
ner_data

In [None]:
from collections import Counter

label_list = []
for item in ner_data['tags']:
    label_list.extend(item)

c = Counter(label_list)
for (name, num) in c.most_common(7):
    print("{}\t{}\t{:.4}%".format(name, num, num/sum(c.values())))

In [None]:
from collections import Counter

label_list = []
for item in val_data['tags']:
    label_list.extend(item)

c = Counter(label_list)
for (name, num) in c.most_common(7):
    print("{}\t{}\t{:.4}%".format(name, num, num/sum(c.values())))

In [None]:
ner_data['str'] = ner_data['tokens'].apply(lambda x: " ".join(x))
val_data['str'] = val_data['tokens'].apply(lambda x: " ".join(x))

In [None]:
ner_data = ner_data.drop_duplicates(subset=['str'])
ner_data.shape

In [None]:
ner_data.reset_index().to_csv('task1/augmented.tsv', sep='\t', index=None)

In [None]:
import pandas as pd
ner_data = pd.read_csv('task1/augmented.tsv', sep='\t', index_col=None)

In [None]:
ner_data.tokens = ner_data.tokens.apply(eval)
ner_data.tags = ner_data.tags.apply(eval)

In [25]:
# случай без аугментации
## Далее перезагрузим данные -- аугментацию не будем использовать
training_data = read_dataset("task1/train.tsv")
validation_data = read_dataset("task1/dev.tsv")

In [26]:
import pandas as pd
ner_data = pd.DataFrame(training_data, columns=['tokens', 'tags'])
val_data = pd.DataFrame(validation_data, columns=['tokens', 'tags'])
ner_data.shape, val_data.shape

((2334, 2), (283, 2))

In [39]:
label_list = []
for item in ner_data['tags']:
    label_list.extend(item)
label_list = list(set(label_list))
if 'O' in label_list:
    label_list.remove('O')
    label_list = ['O'] + label_list
label_list

['O',
 'B-Aspect',
 'B-Object',
 'I-Aspect',
 'I-Predicate',
 'I-Object',
 'B-Predicate']

In [28]:
# val_data содержит много примеров которые не ловит train - можно попробовать сжульничать и подмешать dev
# ner_data = pd.concat([ner_data, val_data])

In [29]:
from sklearn.model_selection import train_test_split
ner_train, ner_test = train_test_split(ner_data, test_size=0.2, random_state=1)
# ner_train = ner_data
# ner_test  = val_data

In [30]:
ner_train.head(3)

Unnamed: 0,tokens,tags
485,"[i, also, preferred, the, psp, controls, to, t...","[O, O, O, O, B-Object, O, O, O, O, O, O, O, O,..."
1524,"[durability, solid, concrete, construction, en...","[O, O, B-Object, O, O, O, B-Predicate, O, B-As..."
366,"[bawalker, -, dude, you, know, ibm, has, maybe...","[O, O, O, O, O, B-Object, O, O, O, O, O, O, O,..."


In [40]:
from datasets import load_dataset, load_metric
from datasets import Dataset, DatasetDict

In [41]:
ner_dataset = DatasetDict({
    'train': Dataset.from_pandas(pd.DataFrame(ner_train)),
    'test': Dataset.from_pandas(pd.DataFrame(ner_test))
})
ner_dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'tags', '__index_level_0__'],
        num_rows: 1867
    })
    test: Dataset({
        features: ['tokens', 'tags', '__index_level_0__'],
        num_rows: 467
    })
})

In [42]:
# from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

# model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, 
#                                                         num_labels=len(label_list),
#                                                        ignore_mismatched_sizes=True)
id2label = dict(enumerate(label_list))
label2id = {v: k for k, v in id2label.items()}
print(id2label, label2id)

{0: 'O', 1: 'B-Aspect', 2: 'B-Object', 3: 'I-Aspect', 4: 'I-Predicate', 5: 'I-Object', 6: 'B-Predicate'} {'O': 0, 'B-Aspect': 1, 'B-Object': 2, 'I-Aspect': 3, 'I-Predicate': 4, 'I-Object': 5, 'B-Predicate': 6}


In [43]:
batch_size = 16
from transformers import AutoTokenizer, BertForTokenClassification, DistilBertTokenizer, DistilBertModel, DistilBertForTokenClassification
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer, AutoConfig

# model_checkpoint = 'dslim/bert-base-NER'
# model_checkpoint = 'distilbert-base-uncased'
# model_checkpoint = "cointegrated/rubert-tiny2"
# model_checkpoint = 'nsi319/distilbert-base-uncased-finetuned-app'
model_checkpoint = 'bert-base-uncased'
# model_checkpoint = 'xlm-roberta-base'

result_name = model_checkpoint.replace("/", "-")

if model_checkpoint == "bert-base-uncased":
    # configuration = AutoConfig.from_pretrained(model_checkpoint)
    # print(configuration)
    # configuration.hidden_dropout_prob = 0.5
    # configuration.attention_probs_dropout_prob = 0.5
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
    model = BertForTokenClassification.from_pretrained('bert-base-uncased', 
                                                       num_labels=len(id2label),
                                                       id2label=id2label,
                                                       label2id=label2id,
                                                       # config=configuration
                                                      )
    model.config.hidden_dropout_prob = 0.5
    model.config.attention_probs_dropout_prob = 0.5
    
else:    
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=False)
    model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, 
                                                            num_labels=len(label_list),
                                                            ignore_mismatched_sizes=True,
                                                            id2label=id2label,
                                                            label2id=label2id
                                                            )
    model.config.hidden_dropout_prob = 0.5
    model.config.attention_probs_dropout_prob = 0.5


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

In [44]:
tokenized_input = tokenizer(ner_dataset['test']["tokens"][10], is_split_into_words=True,  return_offsets_mapping=True, )
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
if model_checkpoint == 'xlm-roberta-base':
    PRE_WORD = '_'
else:
    PRE_WORD = '##'
# SOS_TOKEN = "[CLS]"
# EOS_TOKEN = "[SEP]"
print(tokenized_input)
print(tokens)

{'input_ids': [101, 3384, 2030, 11297, 2089, 2907, 2039, 2488, 2084, 3536, 2043, 6086, 2000, 4542, 1010, 3612, 1010, 1998, 2152, 8310, 1997, 3103, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 5), (0, 2), (0, 6), (0, 3), (0, 4), (0, 2), (0, 6), (0, 4), (0, 4), (0, 4), (0, 7), (0, 2), (0, 4), (0, 1), (0, 4), (0, 1), (0, 3), (0, 4), (0, 7), (0, 2), (0, 3), (0, 1), (0, 0)]}
['[CLS]', 'metal', 'or', 'cement', 'may', 'hold', 'up', 'better', 'than', 'wood', 'when', 'exposed', 'to', 'rain', ',', 'wind', ',', 'and', 'high', 'amounts', 'of', 'sun', '.', '[SEP]']


In [45]:
def tokenize_and_align_labels(examples, label_all_tokens=False):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples['tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        label_ids = [label_list.index(idx) if isinstance(idx, str) else idx for idx in label_ids]

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [47]:
tokenized_datasets = ner_dataset.map(tokenize_and_align_labels, batched=True)
# tokenized_datasets['train'][0]

Map:   0%|          | 0/1867 [00:00<?, ? examples/s]

Map:   0%|          | 0/467 [00:00<?, ? examples/s]

In [48]:
batch_size = 16

args = TrainingArguments(
    "ner",
    evaluation_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=7,
    warmup_steps=500, 
    weight_decay=0.01,
    save_strategy='no',
    report_to='none',
    include_inputs_for_metrics=True,
)

In [49]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

In [50]:
metric = load_metric("seqeval")

  """Entry point for launching an IPython kernel.


In [51]:
example = ner_dataset['train'][4]
labels = example['tags']
metric.compute(predictions=[labels], references=[labels])

{'Aspect': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'Object': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 2},
 'Predicate': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 1.0,
 'overall_f1': 1.0,
 'overall_accuracy': 1.0}

In [52]:
import numpy as np

def compute_metrics(p):
    predictions, labels, inputs = p.predictions, p.label_ids, p.inputs
    predictions = np.argmax(p.predictions, axis=2)

    # send only the first token of each word to the evaluation
    true_predictions = []
    true_labels = []
    for prediction, label, tokens in zip(predictions, labels, inputs):
        true_predictions.append([])
        true_labels.append([])
        # print(prediction, label, tokens)
        for (p, l, t) in zip(prediction, label, tokens):
            # print(l, p, tokenizer.convert_ids_to_tokens(int(t)))
            if l != -100 and not tokenizer.convert_ids_to_tokens(int(t)).startswith(PRE_WORD):
                # print('append')
                true_predictions[-1].append(label_list[p])
                true_labels[-1].append(label_list[l])
    # print(true_predictions, true_labels)

    results = metric.compute(predictions=true_predictions, references=true_labels, zero_division=0)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [53]:
# разморозка
for param in model.parameters():
    param.requires_grad = True

In [54]:
batch_size = 16
args = TrainingArguments(
    "ner",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=7,
    warmup_steps=200, 
    weight_decay=0.01,
    save_strategy='no',
    report_to='none',
    include_inputs_for_metrics=True,
)

In [55]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [56]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.287683,0.812857,0.753642,0.782131,0.915362
2,No log,0.182841,0.804729,0.871523,0.836795,0.939603
3,No log,0.163398,0.838791,0.882119,0.85991,0.945102
4,No log,0.174594,0.813445,0.90287,0.855828,0.942269
5,0.360100,0.178483,0.849873,0.884768,0.86697,0.947101
6,0.360100,0.188503,0.838923,0.894481,0.865812,0.947184
7,0.360100,0.193259,0.833265,0.895806,0.863404,0.946435


TrainOutput(global_step=819, training_loss=0.2432996845361805, metrics={'train_runtime': 38.1675, 'train_samples_per_second': 342.412, 'train_steps_per_second': 21.458, 'total_flos': 374257155869784.0, 'train_loss': 0.2432996845361805, 'epoch': 7.0})

In [57]:
trainer.evaluate()

{'eval_loss': 0.1932593286037445,
 'eval_precision': 0.833264887063655,
 'eval_recall': 0.8958057395143488,
 'eval_f1': 0.863404255319149,
 'eval_accuracy': 0.946434521826058,
 'eval_runtime': 0.4828,
 'eval_samples_per_second': 967.249,
 'eval_steps_per_second': 62.136,
 'epoch': 7.0}

In [58]:
trainer.save_model(f'task1/models/{result_name}')

In [497]:
model = AutoModelForTokenClassification.from_pretrained(f"task1/models/{result_name}").to('cuda:0')

In [None]:
DEV = True
if DEV:
    test_data = read_dataset("task1/dev_no_answers.tsv", splitter="\n")
else:
    test_data = read_dataset("task1/test_no_answers.tsv", splitter="\n")

In [None]:
label_list = []
for item in test_data['tags']:
    label_list.extend(item)

c = Counter(label_list)
for (name, num) in c.most_common(7):
    print("{}\t{}\t{:.4}%".format(name, num, num/sum(c.values())))

In [103]:
def predict_sentence(sentence):
    inputs = tokenizer(sentence,
                        is_split_into_words=True, 
                        return_offsets_mapping=True, 
                        padding='max_length', 
                        truncation=True, 
                        return_tensors="pt")
    # move to gpu
    ids = inputs["input_ids"].to('cuda:0')
    mask = inputs["attention_mask"].to('cuda:0')
    # forward pass
    outputs = model(ids, attention_mask=mask)
    logits = outputs[0]

    active_logits = logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
    flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size*seq_len,) - predictions at the token level

    tokens = tokenizer.convert_ids_to_tokens(ids.squeeze().tolist())
    token_predictions = [label_list[i] for i in flattened_predictions.cpu().numpy()]
    wp_preds = list(zip(tokens, token_predictions)) # list of tuples. Each tuple = (wordpiece, prediction)

    prediction = []
    # print(inputs["offset_mapping"].squeeze().tolist())
    for token_pred, mapping, logs in zip(wp_preds, inputs["offset_mapping"].squeeze().tolist(), active_logits):
        # print(torch.nn.functional.sigmoid(logs))
        # only predictions on first word pieces are important
        if mapping[0] == 0 and mapping[1] != 0:
            prediction.append(token_pred[1])
        else:
            continue
    return prediction

# sentence = ner_dataset['train'][15]['tokens'] #"@HuggingFace is a company based in New York, but is also has employees working in Paris"
# print(sentence)
# print(predict_sentence(sentence))

print(test_data[2][0])
print(predict_sentence(test_data[2][0]))

['(', 'of', 'course', ',', 'fox', 'may', 'be', 'even', 'worse', 'than', 'cnn', '.)']
['O', 'O', 'O', 'O', 'B-Object', 'O', 'O', 'O', 'B-Predicate', 'O', 'B-Object', 'O']
['i', 'have', 'tried', 'windows', '8', 'and', 'it', "'", 's', 'lighter', 'than', 'windows', 'xp', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Predicate', 'O', 'B-Object', 'O', 'O']


In [61]:
def predict_sentence_w_probs(sentence):
    inputs = tokenizer(sentence,
                        is_split_into_words=True, 
                        return_offsets_mapping=True, 
                        padding='max_length', 
                        truncation=True, 
                        return_tensors="pt")
    # move to gpu
    ids = inputs["input_ids"].to('cuda:0')
    mask = inputs["attention_mask"].to('cuda:0')
    # forward pass
    outputs = model(ids, attention_mask=mask)
    logits = outputs[0]

    active_logits = logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
    flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size*seq_len,) - predictions at the token level

    tokens = tokenizer.convert_ids_to_tokens(ids.squeeze().tolist())
    token_predictions = [label_list[i] for i in flattened_predictions.cpu().numpy()]
    wp_preds = list(zip(tokens, token_predictions)) # list of tuples. Each tuple = (wordpiece, prediction)

    prediction = []
    probs = []
    # print(inputs["offset_mapping"].squeeze().tolist())
    for token_pred, mapping, logs in zip(wp_preds, inputs["offset_mapping"].squeeze().tolist(), 
                                         nn.functional.sigmoid(active_logits).detach().cpu().numpy()):
        # print(torch.nn.functional.sigmoid(logs))
      #only predictions on first word pieces are important
        if mapping[0] == 0 and mapping[1] != 0:
            prediction.append(token_pred[1])
            probs.append(logs)
        else:
            continue
    return prediction, probs

# sentence = ner_dataset['train'][15]['tokens'] #"@HuggingFace is a company based in New York, but is also has employees working in Paris"
# print(sentence)
# labels, probs = predict_sentence_w_probs(sentence)
# for w, label, prob in zip(sentence, labels, probs):
#     print(w, label, prob)


In [62]:
with open(f"task1/{result_name}-logs.tsv", "w") as f:
    with torch.no_grad():
        for sentence in tqdm(test_data):
            # print(sentence[0])
            prediction, probs = predict_sentence_w_probs(sentence[0])
            for w,t,p in zip(sentence[0], prediction, probs):
                # print(w, '\t', t)
                f.write(w+'\t'+t+'\t'+'\t'.join(map(str, p)) +'\n')
            f.write('\n')


100%|██████████| 283/283 [00:02<00:00, 134.37it/s]


In [63]:
id2label

{0: 'O',
 1: 'B-Aspect',
 2: 'B-Object',
 3: 'I-Aspect',
 4: 'I-Predicate',
 5: 'I-Object',
 6: 'B-Predicate'}

In [64]:
# идея взять и дообучиться на вероятностях от 3-х моделей. 
# не успел по времени 
logits_files = ['dslim-bert-base-NER-logs.tsv',
                'cointegrated-rubert-tiny2-logs.tsv',
                'bert-base-uncased-logs.tsv']

match = []
for lf in logits_files:
    with open('task1/'+lf, 'r') as f:
        match.append(f.readlines())

        
for e1, e2, e3 in zip(match[0], match[1], match[2]):
    r1 = e1.strip('\n').split("\t")
    r2 = e2.strip('\n').split("\t")
    r3 = e3.strip('\n').split("\t")
    if len(r1) > 2:
        score1 = r1[2:]
        score2 = r2[2:]
        score3 = r3[2:]
        res_score = [np.mean([float(s1), float(s2), float(s3)]) for s1, s2, s3 in zip(score1, score2, score3)]
        # print(r1[:2], id2label[np.argmax(res_score)])
    else:
        pass
    
    

In [109]:
with open(f"task1/{result_name}.tsv", "w") as f:
    with torch.no_grad():
        for sentence in tqdm(test_data):
            prediction = predict_sentence(sentence[0])
            # prediction = predict_w_keyword(sentence[0])            
            for w,t in zip(sentence[0], prediction):
                # print(w, '\t', t)
                f.write(w+'\t'+t+'\n')
            f.write('\n')


100%|██████████| 283/283 [00:51<00:00,  5.54it/s]


In [110]:
def fix_bio_in_file(fn, fnout):
    words = []
    tags = []
    prev_tags = []
    with open(fn, "r") as f:
        for line in f:
            line = line.strip("\r\n").split("\t")
            if len(line) > 1:
                words.append(line[0])
                tags.append(line[1])
            else:
                words.append(line[0])
                tags.append('')

    with open(fnout, "w") as f:
        prev_tags = tags.copy()
        prev_tags.insert(0, 'O')
        # print(prev_tags)
        for w, t, p in zip(words, tags, prev_tags):
            # print(w,t,p)
            if p.startswith('O') and t.startswith('I'):
                print("correct I->B")
                print(w, t, p)
                t = 'B' + t[1:]
                print(w, t, p)
            if t.startswith('B') and t == p:
                print("correct B->I")
                t = 'I' + t[1:]
            if w == '':
                f.write('\n')
            else:
                f.write(w + '\t'+ t + '\n')

fix_bio_in_file(f"task1/{result_name}.tsv", f"task1/{result_name}-fix.tsv")             

correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct I->B
school I-Aspect O
school B-Aspect O
correct B->I
correct I->B
carolina I-Object O
carolina B-Object O
correct B->I
correct B->I
correct I->B
out I-Aspect O
out B-Aspect O
correct B->I
correct B->I
correct I->B
carolina I-Object O
carolina B-Object O
correct B->I
correct B->I
correct I->B
of I-Aspect O
of B-Aspect O
correct I->B
of I-Aspect O
of B-Aspect O
correct B->I
correct I->B
a I-Aspect O
a B-Aspect O
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct I->B
to I-Aspect O
to B-Aspect O
correct B->I
correct B->I
correct I->B
government I-Aspect O
government B-Aspect O
correct I->B
illness I-Aspect O
illness B-Aspect O
correct B->I
correct I->B
of I-Aspect O
of B-Aspect O
correct B->I
correct B->I
correct B->I
correct I->B
calls I-Aspect O
calls B-Aspect O
correct B->I
co

In [111]:
from task1.evaluation.evaluate_f1_partial import main

In [112]:
main('task1/dev.tsv', f'task1/{result_name}-fix.tsv', f'task1/{result_name}_report-no-fix.txt')

In [69]:
from zipfile import ZipFile

In [70]:
with ZipFile(f'task1/{result_name}-fix.zip', 'w') as zipObj2:
   # Add multiple files to the zip
   zipObj2.write(f'task1/{result_name}-fix.tsv', arcname=f'{result_name}-fix.tsv')

In [441]:
# набросок идеи, чтоб просто использовать результаты от 5 моделей, и использовать 
def read_res_file(fn):
    words = []
    tags = []
    with open(fn, "r") as f:
        for line in f:
            line = line.strip("\r\n").split("\t")
            if len(line) > 1:
                words.append(line[0])
                tags.append(line[1])
            else:
                words.append(line[0])
                tags.append('')
    return words, tags
    
    
words, tags1 = read_res_file("task1/bert-base-uncased-fix.tsv")
words, tags2 = read_res_file("task1/nsi319-distilbert-base-uncased-finetuned-app-fix.tsv")
words, tags3 = read_res_file("task1/distilbert-base-uncased-fix.tsv")
words, tags4 = read_res_file("task1/dslim-bert-base-NER-fix.tsv")
words, tags5 = read_res_file("task1/xlm-roberta-base-fix.tsv")

In [442]:
label2id = {v:k for k,v in enumerate(label_list)}
label2id

{'O': 0,
 'I-Aspect': 1,
 'B-Aspect': 2,
 'B-Predicate': 3,
 'B-Object': 4,
 'I-Object': 5,
 'I-Predicate': 6}

In [443]:
with open("task1/majority-class.tsv", "w") as f:
    for w, t1, t2, t3, t4, t5 in zip(words, tags1, tags2, tags3, tags4, tags5):
        if w != '':
            token_win = max(label2id[t1], label2id[t2], label2id[t3],label2id[t4],label2id[t5],)
            print(w, t1, t2, t3, t4, t5, '-->', label_list[token_win])
            f.write(w+"\t"+label_list[token_win]+"\n")
        else:
            f.write("\n")
            print('empty line')

meanwhile O O O O O --> O
, O O O O O --> O
though O O O O O --> O
windows B-Object B-Object B-Object B-Aspect O --> B-Object
8 O O O O B-Object --> B-Object
is O O O O I-Object --> I-Object
significantly O O O O O --> O
at O O O O O --> O
greater B-Predicate B-Predicate B-Predicate B-Predicate O --> B-Predicate
risk B-Aspect B-Aspect B-Aspect B-Aspect B-Predicate --> B-Predicate
( O O O O B-Aspect --> B-Aspect
1 O O O O O --> O
. O O O O O --> O
73 O O O O O --> O
percent O O O O O --> O
) O O O O O --> O
compared O O O O O --> O
to O O O O O --> O
windows B-Object O O O O --> B-Object
8 O O O O O --> O
. O O O O B-Object --> B-Object
1 O O O O O --> O
, O O O O O --> O
according O O O O O --> O
to O O O O O --> O
redmond O O O O O --> O
' O O O O O --> O
s O O O O O --> O
report O O O O O --> O
, O O O O O --> O
it O O O O O --> O
' O O O O O --> O
s O O O O O --> O
still O O O O O --> O
significantly O O O O O --> O
safer B-Predicate B-Predicate B-Predicate B-Predicate O --> B-Predi

In [444]:
fix_bio_in_file(f"task1/majority-class.tsv", f"task1/majority-class-fix.tsv")  

correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I
correct B->I

In [445]:
main('task1/dev.tsv', f'task1/majority-class.tsv', f'task1/majority-class-report.txt')

In [71]:
ner_data['token_str'] = ner_data['tokens'].apply(lambda x: ' '.join(x))
ner_data['tag_str']   = ner_data['tags'].apply(lambda x: ' '.join(x))
val_data['token_str'] = val_data['tokens'].apply(lambda x: ' '.join(x))
val_data['tag_str']   = val_data['tags'].apply(lambda x: ' '.join(x))

In [72]:
val_data['token_str'][0]

"meanwhile , though windows 8 is significantly at greater risk ( 1 . 73 percent ) compared to windows 8 . 1 , according to redmond ' s report , it ' s still significantly safer than windows 7 , windows xp , or windows vista ."

## Keyword Extraction (test)

In [85]:
!pip install keybert keybert[flair] nltk



In [88]:
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal).
         A supervised learning algorithm analyzes the training data and produces an inferred function,
         which can be used for mapping new examples. An optimal scenario will allow for the
         algorithm to correctly determine the class labels for unseen instances. This requires
         the learning algorithm to generalize from the training data to unseen situations in a
         'reasonable' way (see inductive bias).
      """
# kw_model = KeyBERT()

import nltk
nltk.download('averaged_perceptron_tagger')

from flair.embeddings import TransformerDocumentEmbeddings
roberta = TransformerDocumentEmbeddings('roberta-base')
kw_model = KeyBERT(model=roberta)

keywords = kw_model.extract_keywords(val_data['token_str'][0])
keywords

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('73', 0.9964),
 ('report', 0.9964),
 ('percent', 0.9964),
 ('windows', 0.9964),
 ('compared', 0.9964)]

In [None]:
def predict_w_keyword(words):
    query = ' '.join(words)
    keywords = kw_model.extract_keywords(query, keyphrase_ngram_range=(1, 2))
    keywords_dict = {k:rel for k, rel in keywords if rel > 0.2}
    pred = predict_sentence(words)
    # print(query , '\n', keywords_dict, '\n')
    recs = []
    ret = []
    for w, p, pos in zip(words, pred, nltk.pos_tag(words)):
        if w in keywords_dict:
            k = 'K'
        else:
            k = 'O'
        correct = p
        pos_tag = pos[1]
        if k == 'K' and pos_tag.startswith('NN'):
            correct = 'B-Object'
        if k == 'K' and pos_tag.startswith('JJ'):
            correct = 'B-Predicate'
        if k == 'K' and pos_tag.startswith('VB'):
            correct = 'B-Aspect'
        ret.append(correct)
    return ret

pos = 14
words  = val_data['tokens'][pos]
tags  = val_data['tags'][pos]

for (w, tt, t) in zip(words, tags, predict_w_keyword(words)):
    print(w,' true=' ,tt, ' pred=', t)
