## Group 1

## Alireza Mousavizadeh - 97106284

## Fatemeh Tohidian - 97100354

## Amin Kashiri - 97101026

# Initialization

### Install Packages

install required packages using requirement.txt file.

In [1]:
# !pip install transformers
# !pip install tqdm
# !pip install torch
# !pip install evaluate
# !pip install wandb -qU

### Import Libraries

In [2]:
import torch
import json
import evaluate as e
# from google.colab import drive
from transformers import AutoTokenizer, BertForTokenClassification
from torch.utils.data import DataLoader
from tqdm import tqdm
from torch.optim import SGD

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["WANDB_API_KEY"] = "1d6bdaf3f9f088abf0915e5e5cb6689e4c7e7476"



### Check whether cuda is available

Check whether cuda is available and based on this, device object is built that is used in for pytorch tensors computation.

In [3]:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

### Hyper-Parameter Setting

In this section, hyper-parameters that used in bert fine-tuning are defined. hyper-parameter optimization (HPO) will be done in the next parts.


In [4]:
EXPERIMENT_NUM = 3
MAX_LEN = 128
LEARNING_RATE = 1e-4
EPOCHS = 5
BATCH_SIZE = 8
BIO = True

### Initialise WANDB

In [5]:
import wandb

wandb.login()

wandb.init(
  project="NER-Detection",
  name=f"experiment_{EXPERIMENT_NUM}",
  config={
      "experiment": EXPERIMENT_NUM,
      "learning_rate": LEARNING_RATE,
      "batch_size": BATCH_SIZE,
      "BIO": BIO,
      "epochs": EPOCHS,
      "max_len": MAX_LEN,
  })

config = wandb.config

[34m[1mwandb[0m: Currently logged in as: [33msamousavizade[0m. Use [1m`wandb login --relogin`[0m to force relogin


# Load Data

load train, validation, test dataset with json.

In [6]:
# path = "/content/drive/MyDrive/Colab Notebooks/dataset_annotated_splited.json"
# path = "/content/drive/MyDrive/NLP/HW5/dataset_annotated_splited.json"
path = "dataset_annotated_splited.json"
with open(path, "r") as f:
    data = json.load(f)
    train_data = data["train"][:]
    test_data = data["test"][:]
    val_data = data["eval"][:]

print(train_data[0].keys())
print(train_data[0]['annotations'])

dict_keys(['header', 'text', 'annotations'])
[{'header': [{'name': 'DAT', 'range': [27, 37]}], 'text': [{'name': 'ORG', 'range': [9, 24]}, {'name': 'ORG', 'range': [32, 49]}, {'name': 'PER', 'range': [51, 62]}, {'name': 'ORG', 'range': [68, 79]}, {'name': 'ORG', 'range': [153, 164]}, {'name': 'PER', 'range': [210, 222]}, {'name': 'DAT', 'range': [269, 280]}, {'name': 'ORG', 'range': [349, 360]}, {'name': 'PER', 'range': [369, 375]}, {'name': 'PER', 'range': [414, 426]}, {'name': 'TIM', 'range': [478, 485]}, {'name': 'DAT', 'range': [465, 470]}, {'name': 'ORG', 'range': [510, 521]}]}, {'header': [], 'text': [{'name': 'PER', 'range': [51, 62]}, {'name': 'PER', 'range': [210, 222]}, {'name': 'PER', 'range': [369, 375]}, {'name': 'PER', 'range': [414, 426]}]}]


### Label to ID Mapping

In this section, labels to ids and ids to labels are built for next usage in bert fine-tuning training.

In [7]:
if config.BIO:
    label_list = [
        "O", 
        "B_ORG", "B_PER", "B_DAT", "B_TIM", "B_LOC", "B_EVE", "B_mainLOC", "B_NAT",
        "I_ORG", "I_PER", "I_DAT", "I_TIM", "I_LOC", "I_EVE", "I_mainLOC", "I_NAT"
    ]
else:
    label_list = ["O", "ORG", "PER", "DAT", "TIM", "LOC", "EVE", "mainLOC", "NAT"]
labels_to_ids = {k: v for v, k in enumerate(label_list)}
ids_to_labels = {v: k for v, k in enumerate(label_list)}

LABELS = len(label_list)

print(labels_to_ids)
print(ids_to_labels)

{'O': 0, 'B_ORG': 1, 'B_PER': 2, 'B_DAT': 3, 'B_TIM': 4, 'B_LOC': 5, 'B_EVE': 6, 'B_mainLOC': 7, 'B_NAT': 8, 'I_ORG': 9, 'I_PER': 10, 'I_DAT': 11, 'I_TIM': 12, 'I_LOC': 13, 'I_EVE': 14, 'I_mainLOC': 15, 'I_NAT': 16}
{0: 'O', 1: 'B_ORG', 2: 'B_PER', 3: 'B_DAT', 4: 'B_TIM', 5: 'B_LOC', 6: 'B_EVE', 7: 'B_mainLOC', 8: 'B_NAT', 9: 'I_ORG', 10: 'I_PER', 11: 'I_DAT', 12: 'I_TIM', 13: 'I_LOC', 14: 'I_EVE', 15: 'I_mainLOC', 16: 'I_NAT'}


### Initialise Bert Tokenizer

In this section, **ParsBERT(v2.0)** tokenizer is used for tokenization. ParsBERT (v2.0) is a Transformer-based Model for Persian Language Understanding that reconstructed the vocabulary and fine-tuned the ParsBERT v1.1 on the new Persian corpora in order to provide some functionalities for using ParsBERT in other scopes! Follow the ParsBERT repo for the latest information about previous and current models. Persian Text Classification [DigiMag, Persian News] The task target is labeling texts in a supervised manner in both existing datasets DigiMag and Persian News. A total of 8,515 articles scraped from **Digikala** Online Magazine. This dataset includes seven different classes.

In [8]:
tokenizer = AutoTokenizer.from_pretrained(
    "HooshvareLab/bert-fa-base-uncased-clf-digimag"
)

### Handle Overlaps between Named Entity Tags

In this section, some functions are defined to handle overlapping ner tags in such a way that the inner tags are removed and only the outermost tags are kept (Larger tag is keeped and smaller is removed).

In [9]:
def has_intersection(first, second):
    if first[0] < second[0]:
        if first[1] <= second[0]:
            return False
        else:
            return True
    else:
        if first[0] >= second[1]:
            return False
        else:
            return True

def remove_annotation_overlap(annotations):
    annotations = sorted(annotations, key=lambda x: x["range"][0])
    n = len(annotations)
    if n == 0:
        return []
    i = 0
    j = 1
    while i < n and j < n:
        first = annotations[i]
        first_range = first["range"]
        second = annotations[j]
        second_range = second["range"]
        if has_intersection(first_range, second_range):
            new = first if (first_range[1]-first_range[0]) > (second_range[1]-second_range[0]) else second
            annotations[i]= new
            annotations[j] = None
        else:
            i = j
        j += 1

    annotations = list(filter(lambda x: not x is None, annotations))
    return annotations


### Character Level to Token Level Indexing

In this section, some functions are defined to handle token level indexing. to overcome token level indexing, CLS and END must be considered. 

In [10]:
def get_starting_token_index(tag_start, word_index, token_offsets):
    while word_index <= config.max_len - 1 and token_offsets[word_index][0] < tag_start:
        word_index += 1
    return word_index

def get_ending_token_index(tag_stop, word_index, token_offsets):
    while (
        word_index <= config.max_len - 1
        and token_offsets[word_index][1] < tag_stop
        and token_offsets[word_index][1] != 0
    ):
        word_index += 1
    return word_index

def get_final_label(encoding, annotation):
    token_offsets = encoding["offset_mapping"]
    input_ids = encoding["input_ids"]
    end_element = torch.argmin(input_ids)
    final_labels = [-100] * config.max_len
    final_labels[1:end_element] = [0] * (end_element - 1)

    annotations_without_overlap = remove_annotation_overlap(annotation)

    word_index = 1
    for label in annotations_without_overlap:
        interval = label["range"]
        label_name = label["name"]
        word_index = get_starting_token_index(interval[0], word_index, token_offsets)
        start_token_index = word_index
        if start_token_index == config.max_len:
          break
        word_index = get_ending_token_index(interval[1], word_index, token_offsets)
        end_token_index = word_index
        if config.BIO:
            final_labels[start_token_index:end_token_index+1] = [labels_to_ids["I_"+label_name]] * (end_token_index-start_token_index+1)
            final_labels[start_token_index] = labels_to_ids["B_"+label_name]
        else:
            final_labels[start_token_index:end_token_index+1] = [labels_to_ids[label_name]] * (end_token_index-start_token_index+1)

        word_index += 1
    return final_labels



### Define DataSequence and DataLoader

In this section, DataSequence and DataLoader that used in bert fine-tuning are defined.

In [11]:
class DataSequence(torch.utils.data.Dataset):
    def __init__(self, news_list):
        labels = []
        texts = []
        for news in news_list:
            header = news["header"]
            text = news["text"]
            header_annotaition = []
            text_annotation = []
            for i in range(len(news["annotations"])): 
                header_annotaition.extend(news["annotations"][i]["header"])
                text_annotation.extend(news["annotations"][i]["text"])

            for t, annotation in [(header,header_annotaition), (text, text_annotation)]:
                encoding = tokenizer(
                    t,
                    return_offsets_mapping=True,
                    padding='max_length',
                    max_length=config.max_len, # including [CLS] end [SEP]
                    truncation=True,
                    return_tensors="pt",
                )
                for key in ['input_ids', 'attention_mask', 'token_type_ids', 'offset_mapping']:
                    encoding[key] = encoding[key][0]
                label = get_final_label(encoding, annotation)
                texts.append(encoding)
                labels.append(label)


        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def get_batch_data(self, idx):
        return self.texts[idx]

    def get_batch_labels(self, idx):
        return torch.LongTensor(self.labels[idx])

    def __getitem__(self, idx):
        batch_data = self.get_batch_data(idx)
        batch_labels = self.get_batch_labels(idx)
        return batch_data, batch_labels

def to_device(data, device):
    if isinstance(data, (list,tuple)):
        return [to_device(x, device) for x in data]
    return data.to(device)

class DeviceDataLoader():
    def __init__(self, dl, device):
        self.dl = dl
        self.device = device

    def __iter__(self):
        for b in self.dl: 
            yield to_device(b, self.device)

    def __len__(self):
        return len(self.dl)

### Initialise DataSequence and DataLoader Object

In [12]:
def get_dataloader(data, batch=None, cuda=True):
    dataset = DataSequence(data)
    print('len ds: ', len(dataset))
    if batch is None:
        batch = len(dataset)

    dataloader = DataLoader(
        dataset, num_workers=1, batch_size=batch, shuffle=True
    )
    if cuda:
        dataloader = DeviceDataLoader(dataloader, device)
    return dataloader

train_dataloader = get_dataloader(train_data, config.batch_size)
val_dataloader = get_dataloader(val_data, config.batch_size)
test_dataloader = get_dataloader(test_data, config.batch_size)

len ds:  2700
len ds:  150
len ds:  150


In [13]:
print(len(train_dataloader))
print(len(val_dataloader))
print(len(test_dataloader))

338
19
19


# Define Bert NER Model

In [14]:
from transformers import BertForTokenClassification

class BertNER(torch.nn.Module):
    def __init__(self):
        super(BertNER, self).__init__()
        self.bert = BertForTokenClassification.from_pretrained(
            "HooshvareLab/bert-fa-base-uncased-clf-digimag",
            num_labels=LABELS,
            ignore_mismatched_sizes=True,
        )

    def forward(self, input_batch, labels):
        input_ids = input_batch["input_ids"]
        mask = input_batch["attention_mask"]
        output = self.bert(
            input_ids=input_ids, attention_mask=mask, labels=labels, return_dict=False
        )
        return output

In [15]:
model = BertNER()
if use_cuda:
    model = model.cuda()

Some weights of BertForTokenClassification were not initialized from the model checkpoint at HooshvareLab/bert-fa-base-uncased-clf-digimag and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([7, 768]) in the checkpoint and torch.Size([17, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([7]) in the checkpoint and torch.Size([17]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Training

In [16]:
from torch.optim import SGD, Adam

def train_loop(model, train_dataloader, val_dataloader):
    optimizer = SGD(model.parameters(), lr=config.learning_rate)
    for epoch in range(config.epochs):
        total_loss_train = 0
        true_labels_list_train = []
        predictions_list_train = []

        model.train()
        for input_batch, batch_labels in tqdm(train_dataloader):
            optimizer.zero_grad()
            loss, logits = model(input_batch, batch_labels)
            logits_clean = logits[batch_labels != -100]
            true_labels = batch_labels[batch_labels != -100]

            predictions = logits_clean.argmax(dim=1)

            total_loss_train += loss.item()

            loss.backward()
            optimizer.step()

            predictions_list_train.append(predictions.tolist())
            true_labels_list_train.append(true_labels.tolist())

        prefix = "train"
        train_metrics = evaluate_metrics(predictions_list_train, true_labels_list_train, prefix=prefix)

        train_metrics[f"{prefix}/loss"] = total_loss_train / len(predictions_list_train)

        wandb.log(train_metrics, step=epoch + 1)

        total_loss_validation = 0
        true_labels_list_validation = []
        predictions_list_validation = []

        model.eval()
        for input_batch, batch_labels in val_dataloader:
            loss, logits = model(input_batch, batch_labels)
            logits_clean = logits[batch_labels != -100]
            true_labels = batch_labels[batch_labels != -100]

            predictions = logits_clean.argmax(dim=1)

            total_loss_validation += loss.item()

            predictions_list_validation.append(predictions.tolist())
            true_labels_list_validation.append(true_labels.tolist())

        prefix = "validation"
        validation_metrics = evaluate_metrics(predictions_list_validation, true_labels_list_validation, prefix=prefix)

        validation_metrics[f"{prefix}/loss"] = total_loss_validation / len(predictions_list_validation)

        wandb.log(validation_metrics, step=epoch + 1)


# Evaluation

For Evaluation task, evaluate library is used. Evaluate is a library that makes evaluating and comparing models and reporting their performance easier and more standardized.



In [17]:
import evaluate
from pprint import pprint

f1_metric = evaluate.load("f1")
accuracy_metric = evaluate.load("accuracy")

In [18]:
def evaluate_metrics(predictions, label_clean, prefix):
    metrics = {}

    n = len(predictions)

    for i in range(n):
        f1_metric.add_batch(
            references=label_clean[i],
            predictions=predictions[i]
        )
    metrics[f'{prefix}/f1_macro'] = f1_metric.compute(average="macro")["f1"]

    for i in range(n):
        f1_metric.add_batch(
            references=label_clean[i],
            predictions=predictions[i]
        )
    metrics[f'{prefix}/f1_micro'] = f1_metric.compute(average="micro")["f1"]

    for i in range(n):
        f1_metric.add_batch(
            references=label_clean[i],
            predictions=predictions[i]
        )
    metrics[f'{prefix}/f1_weighted'] = f1_metric.compute(average="weighted")["f1"]

    for i in range(n):
        accuracy_metric.add_batch(
            references=label_clean[i],
            predictions=predictions[i]
        )

    metrics[f'{prefix}/accuracy'] = accuracy_metric.compute()["accuracy"]

    return metrics


In [19]:
print("training ...")

train_loop(model, train_dataloader, val_dataloader)

training ...


100%|██████████| 338/338 [01:38<00:00,  3.42it/s]
100%|██████████| 338/338 [01:43<00:00,  3.27it/s]
100%|██████████| 338/338 [01:39<00:00,  3.38it/s]
100%|██████████| 338/338 [01:40<00:00,  3.37it/s]
100%|██████████| 338/338 [01:42<00:00,  3.29it/s]


# Test

In [20]:
total_loss_test = 0
true_labels_list_test = []
predictions_list_test = []

model.eval()
for input_batch, batch_labels in test_dataloader:
    loss, logits = model(input_batch, batch_labels)
    logits_clean = logits[batch_labels != -100]
    true_labels = batch_labels[batch_labels != -100]

    predictions = logits_clean.argmax(dim=1)

    total_loss_test += loss.item()

    predictions_list_test.append(predictions.tolist())
    true_labels_list_test.append(true_labels.tolist())

prefix = "test"
test_metrics = evaluate_metrics(predictions_list_test, true_labels_list_test, prefix=prefix)

test_metrics[f"{prefix}/loss"] = total_loss_test / len(predictions_list_test)

for metric, value in test_metrics.items():
    wandb.run.summary[metric] = value

## Finish WANDB Session

results will be stored in wab cloud storage and can be retrieved any time from [this](https://wandb.ai/samousavizade/NER-Detection?workspace=user-samousavizade) link.

In [21]:
wandb.finish()

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
train/accuracy,▁████
train/f1_macro,█▁▁▂▇
train/f1_micro,▁████
train/f1_weighted,▁▇▇▇█
train/loss,█▃▃▂▁
validation/accuracy,▁▁▁▂█
validation/f1_macro,▁▁▁▂█
validation/f1_micro,▁▁▁▂█
validation/f1_weighted,▁▁▁▂█
validation/loss,█▇▅▄▁

0,1
test/accuracy,0.64597
test/f1_macro,0.05614
test/f1_micro,0.64597
test/f1_weighted,0.51532
test/loss,1.35312
train/accuracy,0.64782
train/f1_macro,0.05164
train/f1_micro,0.64782
train/f1_weighted,0.51437
train/loss,1.42178


In [22]:
# from transformers import pipeline
# nlp = pipeline("ner", model=model.bert.to('cpu'), tokenizer=tokenizer)
# # example = "حسین تقوی به سازمان جهاد کشاورزی رفت."
# # example = 'امین به ایران آمد.'
# example = "به گزارش خبرنگار مهر، نماینده ولی فقیه در آذربایجان شرقی پیش از ظهر امروز در مراسم بزرگداشت یوم الله ۱۲ بهمن که در تالار اجتماعات مصلی اعظم امام خمینی ره برگزار شد گفت: مشکلات اقتصادی و تحریم ها در کشور وجود دارد اما باید قدردان انقلاب اسلامی ایران بود و به همین دلیل، باید توگه بیشتری به موفقیت های به دست آمده در طول دوران انقلاب اسلامی داشت. حجت الاسلام و المسلمین سید محمد علی آل هاشم ادامه داد: سرعت پیشرفت علم در ایران بعد از انقلاب اسلامی و هم اکنون، ۱۱ برابر دنیاست؛ امروزه جمهوری اسلامی ایران، هشتمین کشور تولید کننده اورانیوم ۲۰ درصد جهان است"
# # example = "به گزارش برنا؛ تقریبا از اسفند ماه سال گذشته واکسیناسیون عمومی در کشور با واردات واکسن های خارجی کرونا انجام شد و این روند به صورتی بود که محموله های جدید واکسن پس از خریداری شدن به کشور وارد می شد و تزریق ها برای گروه های اولویت دار انجام می گرفت البته در این میان جهش های جدیدی از ویروس در کشور زیاد شد و در مقابل واردات واکسن های خارجی با مشکلاتی مواجه بود و مسیر این اقدام با پستی ها و بلندی های زیادی رو به رو شد اما در حال حاضر با وجود همه اتفاقات بنا به گفته مسئولان ستاد مقابله با کرونا دو هفته ای از برنامه واکسیناسیون عقب هستیم و دلیل اصلی این اتفاق محدودیت وجود واکسن است. مسعود یونسیان، استاد اپیدمیولوژی دانشگاه علوم پزشکی تهران در گفت وگو با خبرنگار برنا درباره خرید واکسن های خارجی توسط شرکت های خصوصی گفت: دولت از خرید واکسن شرکت های خصوصی استقبال می کند و اصولا بسیاری از کشور های دیگر نیز واردات واکسن های خارجی را به شرکت های خصوصی سپرده اند اما در نهایت تحویل وزارت بهداشت می شود"
# ner_results = nlp(example)
# print(ner_results)
# pprint(list(zip(
#     list(map(lambda x: x['word'], ner_results)),
#     list(map(lambda x: ids_to_labels[int(x['entity'].split('_')[1])], ner_results))
# )))