# BERT for sequence labelling tasks (PyTorch example)
<sup>This notebook is a part of Natural Language Processing class at the University of Ljubljana, Faculty for computer and information science. Please contact [slavko.zitnik@fri.uni-lj.si](mailto:slavko.zitnik@fri.uni-lj.si) for any comments.</sub>

We will use a [Kaggle dataset](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus) which is based on Groningen Meaning Bank dataset for named entity recognition.

The model example was inspired and parts of code are taken from [Tobias Sterbak's blog post](https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import os
os.chdir("/content/drive/MyDrive/colab/Final")

In [None]:
!pip install transformers
!pip install seqeval

Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 11.9 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 39.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 32.8 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 2.8 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed huggingface-hub-0

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm, trange

import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertConfig

from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

import transformers
from transformers import BertForTokenClassification, AdamW

from seqeval.metrics import accuracy_score
from seqeval.metrics import classification_report
from seqeval.metrics import f1_score
import json
print(f"Transformers version: {transformers.__version__}")
print(f"PyTorch version: {torch.__version__}")

Transformers version: 4.19.2
PyTorch version: 1.11.0+cu113


In [None]:
def convert_data(data, test=False):
    dat = {"Sentence #": [], "Word": [], "POS": [], "Tag": []}
    picked = ['GENUS', 'HAS_FORM', 'HAS_LOCATION', 'HAS_CAUSE', 'COMPOSITION_MEDIUM', 'HAS_SIZE', 'HAS_FUNCTION','GENUS_rev', 'HAS_FORM_rev', 'HAS_LOCATION_rev', 'HAS_CAUSE_rev', 'COMPOSITION_MEDIUM_rev', 'HAS_SIZE_rev', 'HAS_FUNCTION_rev']
    # picked = ['GENUS']

    for i, obj in enumerate(data):
        # if obj["relation"] not in picked:
        #     continue

        dat["Sentence #"].append("Sentence: " + str(i+1))
        dat["Word"] = dat["Word"] + obj["token"]
        dat["POS"] = dat["POS"] + obj["stanford_pos"]
        dat["Sentence #"] = dat["Sentence #"] + [""] * (len(obj["token"]) - 1)

        tag = ["O"] * len(obj["token"])

        for i, relation in enumerate(obj["relation"]):
            if relation in picked:
                start = obj["subj_start"][i]
                end = obj["subj_end"][i]
                tag[start] = "B-" + relation
                for i in range(start+1, end + 1):
                    tag[i] = "I-" + relation

        tag[obj["obj_start"]] = "B-DEFINIENDUM"
        for i in range(obj["obj_start"]+1, obj["obj_end"] + 1):
            tag[i] = "I-DEFINIENDUM"

        dat["Tag"] = dat["Tag"] + tag
        if len(obj["token"]) != len(obj["stanford_pos"]):
            print("FAK")
    return dat


f = open("/content/drive/MyDrive/colab/Final/sl/karst_slo_tagger_train.json")
data_train = json.load(f)
f.close()
df_train = pd.DataFrame(convert_data(data_train))
df_train.to_csv("./karst_train_data_loc.csv", na_rep="", index=False)

f = open("/content/drive/MyDrive/colab/Final/sl/karst_slo_tagger_test.json")
data_test = json.load(f)
f.close()
df_test = pd.DataFrame(convert_data(data_test, test=True))
df_test.to_csv("./karst_test_data_loc.csv", na_rep="", index=False)

In [None]:
def convert_data(data, test=False):
    dat = {"Sentence #": [], "Word": [], "POS": [], "Tag": []}
    picked = ['GENUS', 'HAS_FORM', 'HAS_LOCATION', 'HAS_CAUSE', 'COMPOSITION_MEDIUM', 'HAS_SIZE', 'HAS_FUNCTION','GENUS_rev', 'HAS_FORM_rev', 'HAS_LOCATION_rev', 'HAS_CAUSE_rev', 'COMPOSITION_MEDIUM_rev', 'HAS_SIZE_rev', 'HAS_FUNCTION_rev']
    # picked = ['HAS_FUNCTION']

    for i, obj in enumerate(data):
        # if obj["relation"] not in picked:
        #     continue

        dat["Sentence #"].append("Sentence: " + str(i+1))
        dat["Word"] = dat["Word"] + obj["token"]
        dat["POS"] = dat["POS"] + obj["stanford_pos"]
        dat["Sentence #"] = dat["Sentence #"] + [""] * (len(obj["token"]) - 1)

        tag = ["O"] * len(obj["token"])

        for i, relation in enumerate(obj["relation"]):
            if relation in picked:
                relation = "SUBJECT"
                start = obj["subj_start"][i]
                end = obj["subj_end"][i]
                tag[start] = "B-" + relation
                for i in range(start+1, end + 1):
                    tag[i] = "I-" + relation

        tag[obj["obj_start"]] = "B-DEFINIENDUM"
        for i in range(obj["obj_start"]+1, obj["obj_end"] + 1):
            tag[i] = "I-DEFINIENDUM"

        dat["Tag"] = dat["Tag"] + tag
        if len(obj["token"]) != len(obj["stanford_pos"]):
            print("FAK")
    return dat


f = open("/content/drive/MyDrive/colab/Final/sl/karst_slo_tagger_train.json")
data_train = json.load(f)
f.close()
df_train = pd.DataFrame(convert_data(data_train))
df_train.to_csv("./karst_train_data_objsub.csv", na_rep="", index=False)

f = open("/content/drive/MyDrive/colab/Final/sl/karst_slo_tagger_test.json")
data_test = json.load(f)
f.close()
df_test = pd.DataFrame(convert_data(data_test, test=True))
df_test.to_csv("./karst_test_data_objsub.csv", na_rep="", index=False)

In [None]:
if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"Found GPU device: {torch.cuda.get_device_name(i)}")

Found GPU device: Tesla K80


In [None]:
df_data = pd.read_csv("karst_train_data_objsub.csv", encoding="utf8").fillna(method="ffill")
df_data.shape

(18937, 4)

In [None]:
print(df_data.head)

<bound method NDFrame.head of           Sentence #         Word     POS            Tag
0        Sentence: 1  prenikajoča  Appfsn  B-DEFINIENDUM
1        Sentence: 1         voda   Ncfsn  I-DEFINIENDUM
2        Sentence: 1            ,       Z              O
3        Sentence: 1  pronicajoča  Appfsn              O
4        Sentence: 1         voda   Ncfsn              O
...              ...          ...     ...            ...
18932  Sentence: 747          ali      Cc      I-SUBJECT
18933  Sentence: 747            v      Sa      I-SUBJECT
18934  Sentence: 747        višjo  Agcfsa      I-SUBJECT
18935  Sentence: 747      votlino   Ncfsa      I-SUBJECT
18936  Sentence: 747            .       Z              O

[18937 rows x 4 columns]>


In [None]:
df_data_test = pd.read_csv("karst_test_data_objsub.csv", encoding="utf8").fillna(method="ffill")
df_data_test.shape

(1812, 4)

In [None]:
tag_list = df_data.Tag.unique()
tag_list = np.append(tag_list, "PAD")
print(f"Tags: {', '.join(map(str, tag_list))}")

Tags: B-DEFINIENDUM, I-DEFINIENDUM, O, B-SUBJECT, I-SUBJECT, PAD


In [None]:
x_train = df_data
x_test = df_data_test
x_train.shape, x_test.shape

((18937, 4), (1812, 4))

In [None]:
agg_func = lambda s: [ [w,t] for w,t in zip(s["Word"].values.tolist(),s["Tag"].values.tolist())]

In [None]:
x_train_grouped = x_train.groupby("Sentence #").apply(agg_func)
x_test_grouped = x_test.groupby("Sentence #").apply(agg_func)

In [None]:
x_train_sentences = [[s[0] for s in sent] for sent in x_train_grouped.values]
x_test_sentences = [[s[0] for s in sent] for sent in x_test_grouped.values]

In [None]:
x_train_tags = [[t[1] for t in tag] for tag in x_train_grouped.values]
x_test_tags = [[t[1] for t in tag] for tag in x_test_grouped.values]

In [None]:
x_train_sentences[0]

['prenikajoča',
 'voda',
 ',',
 'pronicajoča',
 'voda',
 ':',
 'Padavinska',
 'voda',
 ',',
 'ki',
 'v',
 'aeracijski',
 'coni',
 'počasi',
 'premika',
 'skozi',
 'vodoprepustne',
 'skalne',
 'gmote',
 '.',
 'Voda',
 ',',
 'ki',
 'kaplja',
 'z',
 'jamskega',
 'stropa',
 ';',
 'od',
 'tod',
 'jamarski',
 'izraz',
 'kapnica',
 '.',
 'Združeno',
 'v',
 'potoček',
 'jo',
 'imenujemo',
 'tudi',
 'vodni',
 'curek',
 ',',
 'v',
 'nasprotju',
 'od',
 'globinskih',
 'žil',
 '.']

In [None]:
x_train_tags[0]

['B-DEFINIENDUM',
 'I-DEFINIENDUM',
 'O',
 'O',
 'O',
 'O',
 'B-SUBJECT',
 'I-SUBJECT',
 'O',
 'O',
 'B-SUBJECT',
 'I-SUBJECT',
 'I-SUBJECT',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O']

In [None]:
label2code = {label: i for i, label in enumerate(tag_list)}
code2label = {v: k for k, v in label2code.items()}
label2code

{'B-DEFINIENDUM': 0,
 'B-SUBJECT': 3,
 'I-DEFINIENDUM': 1,
 'I-SUBJECT': 4,
 'O': 2,
 'PAD': 5}

In [None]:
num_labels = len(label2code)
print(f"Number of labels: {num_labels}")

Number of labels: 6


In [None]:
MAX_LENGTH = 128
BATCH_SIZE = 32

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()

if torch.cuda.is_available():
    print(f"GPU device: {torch.cuda.get_device_name(0)}")

GPU device: Tesla K80


In [None]:
tokenizer = BertTokenizer.from_pretrained('EMBEDDIA/crosloengual-bert', do_lower_case=False)

Downloading:   0%|          | 0.00/321k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/46.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

In [None]:
def convert_to_input(sentences,tags):
    input_id_list = []
    attention_mask_list = []
    label_id_list = []
    
    for x,y in tqdm(zip(sentences,tags),total=len(tags)):
        tokens = []
        label_ids = []
        
        for word, label in zip(x, y):
            word_tokens = tokenizer.tokenize(word)
            tokens.extend(word_tokens)
            # Use the real label id for the first token of the word, and padding ids for the remaining tokens
            label_ids.extend([label2code[label]] * len(word_tokens))

        input_ids = tokenizer.convert_tokens_to_ids(tokens)

        input_id_list.append(input_ids)
        label_id_list.append(label_ids)

    input_id_list = pad_sequences(input_id_list,
                          maxlen=MAX_LENGTH, dtype="long", value=0.0,
                          truncating="post", padding="post")
    label_id_list = pad_sequences(label_id_list,
                     maxlen=MAX_LENGTH, value=label2code["PAD"], padding="post",
                     dtype="long", truncating="post")
    attention_mask_list = [[float(i != 0.0) for i in ii] for ii in input_id_list]

    return input_id_list, attention_mask_list, label_id_list

In [None]:
input_ids_train, attention_masks_train, label_ids_train = convert_to_input(x_train_sentences, x_train_tags)
input_ids_test, attention_masks_test, label_ids_test = convert_to_input(x_test_sentences, x_test_tags)

100%|██████████| 747/747 [00:02<00:00, 249.96it/s]
100%|██████████| 97/97 [00:00<00:00, 337.57it/s]


In [None]:
np.shape(input_ids_train), np.shape(attention_masks_train), np.shape(label_ids_train)

((747, 128), (747, 128), (747, 128))

In [None]:
# np.shape(input_ids_val), np.shape(attention_masks_val), np.shape(label_ids_val)

In [None]:
np.shape(input_ids_test), np.shape(attention_masks_test), np.shape(label_ids_test)

((97, 128), (97, 128), (97, 128))

In [None]:
train_inputs = torch.tensor(input_ids_train)
train_tags = torch.tensor(label_ids_train)
train_masks = torch.tensor(attention_masks_train)

# val_inputs = torch.tensor(input_ids_val)
# val_tags = torch.tensor(label_ids_val)
# val_masks = torch.tensor(attention_masks_val)

test_inputs = torch.tensor(input_ids_test)
test_tags = torch.tensor(label_ids_test)
test_masks = torch.tensor(attention_masks_test)

In [None]:
train_data = TensorDataset(train_inputs, train_masks, train_tags)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)

# valid_data = TensorDataset(val_inputs, val_masks, val_tags)
# valid_sampler = SequentialSampler(valid_data)
# valid_dataloader = DataLoader(valid_data, sampler=valid_sampler, batch_size=BATCH_SIZE)

test_data = TensorDataset(test_inputs, test_masks, test_tags)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=BATCH_SIZE)

In [None]:
model = BertForTokenClassification.from_pretrained(
    "EMBEDDIA/crosloengual-bert",
    num_labels=len(label2code),
    output_attentions = False,
    output_hidden_states = False
)


Downloading:   0%|          | 0.00/476M [00:00<?, ?B/s]

Some weights of the model checkpoint at EMBEDDIA/crosloengual-bert were not used when initializing BertForTokenClassification: ['cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at EM

In [None]:
if torch.cuda.is_available():
    model.cuda()

In the part below we must pass all the parameters that can be finetuned to the optimizer. If we set *FULL_FINETUNING* to False, we will finetune just the model head. Otherwise the whole model weights will be updated. 

Gamma and beta are parameters by the *BERTLayerNorm* and should not be regularized. We can include also all parameters to the regularization and will achieve similar results.

In [None]:
FULL_FINETUNING = True
if FULL_FINETUNING:
    param_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'gamma', 'beta']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.01},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.0}
    ]
else:
    param_optimizer = list(model.classifier.named_parameters())
    optimizer_grouped_parameters = [{"params": [p for n, p in param_optimizer]}]

optimizer = AdamW(
    optimizer_grouped_parameters,
    lr=3e-5,
    eps=1e-8
)




In [None]:
model_parameters = filter(lambda p: p.requires_grad, model.parameters())
params = sum([np.prod(p.size()) for p in model_parameters])
print(f"The model has {params} trainable parameters")

model_classifier_parameters = filter(lambda p: p.requires_grad, model.classifier.parameters())
params_classifier = sum([np.prod(p.size()) for p in model_classifier_parameters])
print(f"The classifier-only model has {params_classifier} trainable parameters")

The model has 123548934 trainable parameters
The classifier-only model has 4614 trainable parameters


In [None]:
from transformers import get_linear_schedule_with_warmup

epochs = 20
max_grad_norm = 1.0

# Total number of training steps is number of batches * number of epochs.
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)


In [None]:
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=2).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)


In [None]:
## Store the average loss after each epoch so we can plot them.
loss_values, test_loss_values = [], []

for epoch_id in range(epochs):
    print(f"Epoch {epoch_id+1}")
    # ========================================
    #               Training
    # ========================================
    # Perform one full pass over the training set.

    # Put the model into training mode.
    model.train()
    # Reset the total loss for this epoch.
    total_loss = 0

    # Training loop
    for step, batch in tqdm(enumerate(train_dataloader)):
        # add batch to gpu
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        # Always clear any previously calculated gradients before performing a backward pass.
        model.zero_grad()
        # forward pass
        # This will return the loss (rather than the model output)
        # because we have provided the `labels`.
        outputs = model(b_input_ids, token_type_ids=None,
                        attention_mask=b_input_mask, labels=b_labels)
        # get the loss
        loss = outputs[0]
        # Perform a backward pass to calculate the gradients.
        loss.backward()
        # track train loss
        total_loss += loss.item()
        # Clip the norm of the gradient
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=max_grad_norm)
        # update parameters
        optimizer.step()
        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over the training data.
    avg_train_loss = total_loss / len(train_dataloader)
    print("Average train loss: {}".format(avg_train_loss))

    # Store the loss value for plotting the learning curve.
    loss_values.append(avg_train_loss)


Epoch 1


24it [00:14,  1.65it/s]


Average train loss: 0.5654871910810471
Epoch 2


24it [00:14,  1.64it/s]


Average train loss: 0.24584624357521534
Epoch 3


24it [00:14,  1.62it/s]


Average train loss: 0.18612874299287796
Epoch 4


24it [00:15,  1.59it/s]


Average train loss: 0.15213345053295294
Epoch 5


24it [00:15,  1.56it/s]


Average train loss: 0.12872941512614489
Epoch 6


24it [00:16,  1.49it/s]


Average train loss: 0.10060382665445407
Epoch 7


24it [00:15,  1.54it/s]


Average train loss: 0.08227872553591926
Epoch 8


24it [00:15,  1.52it/s]


Average train loss: 0.07074639325340588
Epoch 9


24it [00:16,  1.49it/s]


Average train loss: 0.05808649848525723
Epoch 10


24it [00:16,  1.46it/s]


Average train loss: 0.052795977952579655
Epoch 11


24it [00:16,  1.44it/s]


Average train loss: 0.04649575691049298
Epoch 12


24it [00:16,  1.42it/s]


Average train loss: 0.04333012279433509
Epoch 13


24it [00:16,  1.43it/s]


Average train loss: 0.03929912896516422
Epoch 14


24it [00:16,  1.42it/s]


Average train loss: 0.03670693258754909
Epoch 15


24it [00:16,  1.43it/s]


Average train loss: 0.033254939674710236
Epoch 16


24it [00:17,  1.40it/s]


Average train loss: 0.03020313537369172
Epoch 17


24it [00:17,  1.39it/s]


Average train loss: 0.028654250937203567
Epoch 18


24it [00:17,  1.41it/s]


Average train loss: 0.028086219254570704
Epoch 19


24it [00:17,  1.40it/s]


Average train loss: 0.026446854307626683
Epoch 20


24it [00:17,  1.39it/s]

Average train loss: 0.027228290447965264





In [None]:
# Save model
torch.save(model, 'karst_bert_all_subj_obj.pt')

In [None]:
# Loading a model (see docs for different options)
model = torch.load('karst_bert_all_subj_obj.pt', map_location=torch.device('cpu'))

In [None]:
# TEST
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Pytorch is using: {device}")

predictions , true_labels = [], []
for batch in tqdm(test_dataloader):
    b_input_ids, b_input_mask, b_labels = batch
    
    b_input_ids.to(device)
    b_input_mask.to(device)
    b_labels.to(device)
    
    with torch.no_grad():
        outputs = model(b_input_ids, token_type_ids=None,
                        attention_mask=b_input_mask, labels=b_labels)

    logits = outputs[1].detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()

    predictions.extend([list(p) for p in np.argmax(logits, axis=2)])
    true_labels.extend(label_ids)

results_predicted = [[code2label[p_i] for (p_i, l_i) in zip(p, l) if code2label[l_i] != "PAD"] 
                                      for p, l in zip(predictions, true_labels)]
results_true = [[code2label[l_i] for l_i in l if code2label[l_i] != "PAD"] 
                                 for l in true_labels]

Pytorch is using: cuda


100%|██████████| 4/4 [00:37<00:00,  9.44s/it]


In [None]:
print(f"F1 score: {f1_score(results_true, results_predicted)}")
print(f"Accuracy score: {accuracy_score(results_true, results_predicted)}")
print(classification_report(results_true, results_predicted))

F1 score: 0.6057692307692308
Accuracy score: 0.6919482386772107
              precision    recall  f1-score   support

          AD       0.00      0.00      0.00         0
 DEFINIENDUM       0.86      0.90      0.88       244
     SUBJECT       0.42      0.44      0.43       365

   micro avg       0.59      0.62      0.61       609
   macro avg       0.42      0.44      0.43       609
weighted avg       0.59      0.62      0.61       609



  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
# MAKE PREDICTION CSV
import re
def make_pred_db(sentences, tags):
    dat = {"Sentence #": [], "Word": [], "Tag_predicted": []}
    for i, sent in enumerate(sentences):
        # tag = tags[i]

        dat["Sentence #"].append("Sentence: " + str(i+1))

        # dat["Word"] = dat["Word"] + sent
        # dat["Tag_real"] = dat["Tag_real"] + tag

        sentence_s = " ".join(sent)
        sentence_s = re.sub(r'\s([?.!"](?:\s|$))', r'\1', sentence_s)

        tokenized_sentence = tokenizer.encode(sentence_s)
        input_ids = torch.tensor([tokenized_sentence])
        output = model(input_ids)
        label_indices = np.argmax(output[0].to('cpu').detach().numpy(), axis=2)

        tokens = tokenizer.convert_ids_to_tokens(input_ids.to('cpu').numpy()[0])
        new_tokens, new_labels = [], []
        for token, label_idx in zip(tokens, label_indices[0]):
            if token.startswith("##"):
                new_tokens[-1] = new_tokens[-1] + token[2:]
            else:
                new_labels.append(code2label[label_idx])
                new_tokens.append(token)
        
        for j, s in enumerate(new_tokens):
            if s[0] == "[":
                new_tokens.pop(j)
                new_labels.pop(j)

        dat["Sentence #"] = dat["Sentence #"] + [""] * (len(new_tokens) - 1)
        dat["Word"] = dat["Word"] + new_tokens
        dat["Tag_predicted"] = dat["Tag_predicted"] + new_labels

    return dat

df = make_pred_db(x_test_sentences, x_test_tags)
df = pd.DataFrame(df)
df.to_csv("./karst_predicted_subj_obj.csv", na_rep="", index=False)

In [None]:
# MAKE TXT
import re

def make_pred_db(sentences, tags):
    res = ""
    for i, sent in enumerate(sentences):
        sentence_s = " ".join(sent)
        sentence_s = re.sub(r'\s([?.!"](?:\s|$))', r'\1', sentence_s)

        tokenized_sentence = tokenizer.encode(sentence_s)
        input_ids = torch.tensor([tokenized_sentence])
        output = model(input_ids)
        label_indices = np.argmax(output[0].to('cpu').detach().numpy(), axis=2)

        tokens = tokenizer.convert_ids_to_tokens(input_ids.to('cpu').numpy()[0])

        new_tokens, new_labels = [], []
        for token, label_idx in zip(tokens, label_indices[0]):
            if token.startswith("##"):
                new_tokens[-1] = new_tokens[-1] + token[2:]
            else:
                new_labels.append(code2label[label_idx])
                new_tokens.append(token)
        
        for j, s in enumerate(new_tokens):
            if s[0] == "[":
                new_tokens.pop(j)
                new_labels.pop(j)

        tab = new_labels.copy()
        tab_lab = new_tokens.copy()
        for j, t in enumerate(tab):
            if t != "O" and t != "PAD":
                tab[j] = t[2:]
            if t == "PAD":
                tab[j] = "O"
        subjStart = True
        objStart = True
        j = 0
        indx_open = []
        indx_close = []
        while j < len(tab):
            if (tab[j] == "SUBJECT" and j == 0) or (tab[j] == "SUBJECT" and tab[j-1] != "SUBJECT"):
                tab.insert(j, "<e1>")
                tab_lab.insert(j, "<e1>")
                indx_open.append(j)
                j += 1
            if j != 0 and tab[j-1] == "SUBJECT" and tab[j] != "SUBJECT":
                tab.insert(j, "</e1>")
                tab_lab.insert(j, "</e1>")
                indx_close.append(j)
                j += 1

            if (tab[j] == "DEFINIENDUM" and j == 0) or (tab[j] == "DEFINIENDUM" and tab[j-1] != "DEFINIENDUM"):
                tab.insert(j, "<e2>")
                tab_lab.insert(j, "<e2>")
                indx_open.append(j)
                j += 1
            if j != 0 and tab[j-1] == "DEFINIENDUM" and tab[j] != "DEFINIENDUM":
                tab.insert(j, "</e2>")
                tab_lab.insert(j, "</e2>")
                indx_close.append(j)
                j += 1

            j += 1

        sentence = ""
        for j, s in enumerate(tab_lab):
            if j in indx_open:
                sentence += s
            elif j in indx_close:
                sentence = sentence[:-1]
                sentence += s
                sentence += " "
            else:
                sentence += f"{s} "
        
        sentence = sentence[:-1]
        sentence = re.sub(r'\s([?.!"](?:\s|$))', r'\1', sentence)

        res +=f"{i+1}\t\"{sentence}\"\n\n"


    return res
txt = make_pred_db(x_test_sentences, x_test_tags)

with open('subj_obj_input.txt', 'w') as f:
    f.write(txt)

<e1>Vsako obdobje otoplitve</e1> in s tem zmanjšanje obsega <e1>ledenikov</e1> imenujemo <e2>medledena doba</e2> ali <e2>interglacial</e2> .
Takšnim meandrom pravimo tudi <e2>ujeti</e2> meandri in jih najdemo tudi v nekaterih drugih slovenskih pokrajinah .
V leksikonu je <e2>spodmol</e2> definiran kot " <e1>kratka votlina z visoko previsno steno na vhodu</e1> " in kot " <e1>zaradi delovanja valov izpodjeden previs na morski</e1> , <e1>jezerski obali</e1> "
V geologiji je <e2>zemeljski plaz</e2> definiran kot <e1>območje preperine , usedline ali kamnine</e1> , <e1>ki se je hitro ali počasi premaknila s prvotnega kraja in ima vidno spremenjeno površje</e1> .
V najstarejši literaturi izraz <e2>spodmol</e2> pomeni <e1>majhno izravnavo na povešenih gorskih hrbtih</e1> in <e1>slemenih</e1> .
<e1>Spodmole</e1> opišejo kot <e1>vodoravne vdolbine</e1> / <e1>zajede v obliki črke C v navpičnem prerezu</e1> , <e1>izdolbene na pobočjih in v skalnih stenah</e1> . skalnih
Uporablja se pojem <e2>prebo

In [None]:
test_sentence = """
An aquifer is defined as a body of rock or unconsolidated sediment that has sufficient permeability to allow water to flow through it.
"""

In [None]:
tokenized_sentence = tokenizer.encode(test_sentence)

if torch.cuda.is_available():
    print("asd")
    input_ids = torch.tensor([tokenized_sentence]).to('cuda')
else:
    input_ids = torch.tensor([tokenized_sentence])




asd


In [None]:
input_ids = torch.tensor([tokenized_sentence])
output = model(input_ids)
label_indices = np.argmax(output[0].to('cpu').detach().numpy(), axis=2)


In [None]:
# Join BPE split tokens
tokens = tokenizer.convert_ids_to_tokens(input_ids.to('cpu').numpy()[0])
new_tokens, new_labels = [], []
for token, label_idx in zip(tokens, label_indices[0]):
    if token.startswith("##"):
        new_tokens[-1] = new_tokens[-1] + token[2:]
    else:
        new_labels.append(code2label[label_idx])
        new_tokens.append(token)


In [None]:
new_tokens

['[CLS]',
 'An',
 'aquifer',
 'is',
 'defined',
 'as',
 'a',
 'body',
 'of',
 'rock',
 'or',
 'unconsolidated',
 'sediment',
 'that',
 'has',
 'sufficient',
 'permeability',
 'to',
 'allow',
 'water',
 'to',
 'flow',
 'through',
 'it',
 '.',
 '[SEP]']

In [None]:
for token, label in zip(new_tokens, new_labels):
    print("{}\t{}".format(label, token))


PAD	[CLS]
B-DEFINIENDUM	An
I-DEFINIENDUM	aquifer
O	is
O	defined
O	as
O	a
B-GENUS	body
B-COMPOSITION_MEDIUM	of
I-COMPOSITION_MEDIUM	rock
O	or
I-COMPOSITION_MEDIUM	unconsolidated
I-GENUS	sediment
O	that
O	has
O	sufficient
O	permeability
O	to
I-HAS_FUNCTION	allow
I-HAS_FUNCTION	water
O	to
I-HAS_FUNCTION	flow
I-HAS_FUNCTION	through
I-HAS_CAUSE	it
O	.
O	[SEP]


The expected output should recognize entities (by the algorithm):

* (PER) Dr. Marko Robnik-Šikonja
* (ORG) NLP
* (ORG) University of Ljubljana
* (GEO) Slovenia
* (PER) Dr. Žitnik
* (TIM) Tuesday
* (TIM) Wednesday and Thursday
* (ORG) Televizija Slovenija

## Coreference resolution

Coreference resolution can be approached in ways to exploit mention pairs for classification. We have already seen the [huggingface' example](https://huggingface.co/coref/) which is [open-sourced](https://github.com/huggingface/neuralcoref). Their model architecture looks as follows and training is described in [their medium post](https://medium.com/huggingface/how-to-train-a-neural-coreference-model-neuralcoref-2-7bb30c1abdfe):

<img src="huggingface-coref.png" width="60%" />

Another simple approach would be a combination of BERT embeddings of mention pairs and additional features combined using dense layers:

<img src="bert-pair-coref.png" width="60%" />

The model is described along with two additional baselines in the report [BERT for Coreference Resolution by Arthi Sureb](bert-pair-coref.pdf) (Stanford's NLP/AI class). 

## Relationship extraction

There exist no (general) relationship extraction corpus for Slovene. We conducted an analysis with (semi-)automatic corpus creation and model training (see [https://github.com/RSDO-DS3/SloREL](https://github.com/RSDO-DS3/SloREL)) - Miha Štravs (MSc. thesis, 2022, to appear). The model and training is based on [R-BERT](https://github.com/monologg/R-BERT), unofficial implementation of [Enriching Pre-trained Language Model with Entity Information for Relation Classification](https://arxiv.org/abs/1905.08284).

The architecture of a models looks as follows:

<img src="r-bert.png" width="60%" />

## Aspect-based sentiment analysis

This task is quite novel and therefore I propose to try some architectures of your own. 

Still you are free to transform the task to sequence classification. For example, you can represent a sequence of a persion based on sequence of mentions with additional word neighbourhoods.

There exist some BERT-based approaches ([an example](ABSA.pdf)) that deal with aspect-based sentiment analysis. Still, their task is a bit different and is based on SemEval 2015 and SemEval 2016 tasks. Those task are investigating different sentiment aspects for a given entity type in a review text. 

## References

* [Transformers NER examples](https://github.com/huggingface/transformers/tree/master/examples/ner)
* [NER example in Tensorflow](https://androidkt.com/name-entity-recognition-with-bert-in-tensorflow/)
* [DeepPavlov models](http://docs.deeppavlov.ai/en/master/features/models/ner.html)

## Other interesting examples

* [Data Science workshop by Andrej Miščič and Luka Vranješ](https://github.com/andrejmiscic/NLP-workshop)