<img src="../docs/sa_logo.png" width="250" align="left">

# Named Entity Recognition with HuggingFace BERT and SuperAnnotate

## Introduction

This tutorial shows an example of solving ```Named Entity Recognition task``` with [SuperAnnotate](https://www.superannotate.com/) and [HuggingFace](https://huggingface.co/).

The main goal of this tutorial is to show how one could annotate some part of data with ```SuperAnnotate``` tools and then build a model with ```HuggingFace``` to automatically annotate the rest of data and upload new annotations to [SuperAnnotate platform](https://app.superannotate.com/). These automatically generated annotations may be additionaly checked and modified manually.

All the experiments described in this tutorial were done with [Legal NER](https://paperswithcode.com/dataset/legal-ner) dataset. It is a corpus of 46545 annotated legal named entities mapped to 14 legal entity types. It is designed for named entity recognition in indian court judgement.

![](../docs/legal-ner/folders_legal_ner.png)

The tutorial starts with the assumption that we have partially annotated dataset of texts.
The data is stored on S3 bucket and splitted into two parts: 
* **train** (~40%) $-$ annotated data for training
* **unlabeled** (~60%) $-$ data that will be annotated by the model

These folders are connected with existing project on [SuperAnnotate platform](https://app.superannotate.com/) and train dataset has already been annotated manually. 

![](../docs/legal-ner/ner_text_example.png)

In the examples below we used ```SuperAnnotate SDK```, ```Boto3 SDK``` and ```HuggingFace```. $\ $
Some parts of code used here are provided as examples in [SuperAnnotate](https://doc.superannotate.com/docs/getting-started), [Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) and  [HuggingFace](https://huggingface.co/) documentations.

In this tutorial we will go through the following steps:

$\textbf{1.}$ [Environmental setup](#environmental_setup)

$\textbf{1.1}$ [User Variables Setup](#user_variables)

$\textbf{1.1}$ [Constants Setup](#constants_setup)

$\textbf{2.}$ [Download documents and labels from SuperAnnotate](#download_data)

$\textbf{2.1}$ [Get links to all files in S3 bucket](#list_all_files_s3)

$\textbf{2.2}$ [Download files](#download_files)

$\textbf{2.3}$ [Download labels from SuperAnnotate](#download_labels_from_sa)
   
$\textbf{3.}$ [Prepare data for Bert NER model](#prepare_data_for_bert_model)

$\textbf{4.}$ [Train model](#train_model)

$\textbf{5.}$ [Evaluate model](#evaluate_model)

$\textbf{6.}$ [Get predictions for unlabeled texts](#get_predictions_for_unlabeled_texts)

$\textbf{7.}$ [Make annotations in SuperAnnotate format](#make_annotations_sa_format)

$\textbf{8.}$ [Upload new annotations to SuperAnnotate platform](#upload_new_annotations_to_sa_platform)


## 1. Environmental setup
<a id='environmental_setup'></a>

In [None]:
! pip install superannotate==4.4.7 #SA SDK installation
! pip install boto3 # install boto3 client
! pip install transformers # HuggingFace transformers
! pip install seqeval # model evaluation

In [None]:
import boto3
import glob
import os
import pandas as pd
import torch

from collections import Counter, defaultdict
from seqeval.metrics import accuracy_score
from seqeval.metrics import classification_report
from seqeval.metrics import f1_score
from seqeval.scheme import IOB2
from sklearn.model_selection import train_test_split
from superannotate import SAClient
from transformers import BertTokenizerFast
from tqdm.notebook import tqdm
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.optim import SGD, Adam, NAdam
from tqdm.notebook import tqdm
from transformers import BertForTokenClassification

### 1.1 User Variables Setup
<a id='user_variables'></a>

In [None]:
#SuperAnnotate SDK token
SA_TOKEN = "ADD_YOUR_TOKEN"

In [None]:
SA_PROJECT_NAME = "ADD_SUPERANNOTATE_PROJECT_NAME"

### 1.2 Constants Setup
<a id='constants_setup'></a>

SuperAnnotate Python SDK functions work within the team scope of the platform, so a team-level authorization is required.

To authorize the package in a given team scope, get the authorization token from the team settings page.

In [None]:
sa_client = SAClient(token=SA_TOKEN) ## SuperAnnotate client

## 2. Download documents and labels from SuperAnnotate
<a id='download_data'></a>

In [None]:
s3_client = boto3.client('s3')
bucket_name = 'sa-public-datasets'

Data that is shown on SuperAnnotate page is actually stored on AWS S3 Bucket.
Here we provide name of this bucket.

In [None]:
bucket_name = "ADD_YOUR_BUCKET_NAME" # bucket where the data is stored

We should also create client to be able to work with AWS S3.

In [None]:
s3_client = boto3.client('s3') ## S3 client


### 2.1. Get links to all files in S3 bucket
<a id='list_all_files_s3'></a>

Texts shown on SuperAnnotate page are stored in S3 bucket.
We can download them to local computer and train our model for legal entities recognition.

Before that we should get links to all of them.
Since S3 SDK could list only 1000 objects per step, we could do it iteratively.

In [None]:
subset_names = ['train', 'unlabeled']

data_links_dict = {'train': [],
                   'unlabeled': []}

BUCKET_FOLDER_PATH = '/path/to/data/'

start_key = ''

for subset_name in subset_names:
    print("Processing", subset_name)
    while True:
        response = s3_client.list_objects_v2(Bucket=bucket_name,
                                             Prefix=f'{BUCKET_FOLDER_PATH}/{subset_name}/',
                                             StartAfter=start_key)
        objects = response['Contents']
        for obj in objects:
            data_links_dict[subset_name].append(obj['Key'])
        print(f"\t{len(data_links_dict[subset_name])} files in {subset_name}")
        start_key = objects[-1]['Key']
        if len(objects) < 1000:
            start_key = ''
            break

### 2.2. Download files
<a id='download_files'></a>

Now we will use these links to download all the files from S3 bucket.

In [None]:
for subset_name in subset_names:
    print(f"Loading {subset_name} docs")
    save_dir = f'./{subset_name}_sa_docs'
    if not os.path.exists(save_dir):
        os.mkdir(save_dir)
    for file_key in tqdm(data_links_dict[subset_name]):
        if not '.txt' in file_key:
            continue
        filename = os.path.basename(file_key)
        s3_client.download_file(Bucket=bucket_name, 
                                Key=file_key,
                                Filename=os.path.join(save_dir, filename))
        

### 2.3 Download labels from SuperAnnotate
<a id='download_labels_from_sa'></a>

Now we can download labels from SuperAnnotate for the train texts that were annotated manually. The annotations will be downloaded in [SuperAnnotate format](https://doc.superannotate.com/docs/sdk-export-annotations).

In [None]:
token = "PUT_YOUR_TOKEN_HERE"

sa_client = SAClient(token = token)

In [None]:
sa_response = sa_client.get_annotations(project="Legal-NER/train",
                                        items=[os.path.basename(x) for x \
                                               in data_links_dict['train']])

annotations = [i['instances'] for i in sa_response]

In [None]:
unique_labels = set([entity['className'] for a in annotations for entity in a])
unique_labels.add('O')

print("All unique labels found in training data: ")
for label in unique_labels:
    print(f"\t{label}")

We will map each label into its id and id into label for the BERT model.

In [None]:
label2id = {k: v for v, k in enumerate(sorted(unique_labels))}
id2label = {v: k for v, k in enumerate(sorted(unique_labels))}

for i,l in id2label.items():
    print(f"{i} : {l}")

## 3. Prepare data for Bert NER model
<a id='prepare_data_for_bert_model'></a>

We will use pretrained tokenizer bert-base-cased for our data.

In [None]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

PAD_TOKEN_ID = -100

We should align named entities that we downloaded in [SuperAnnotate format](https://doc.superannotate.com/docs/sdk-export-annotations) with tokens we get from tokenizer.

In [None]:
def align_label(txt_tokenized, entities, label2id):
    label_ids = []
    cnt = 0
    for word_idx, (start,end) in zip(txt_tokenized.word_ids(), txt_tokenized['offset_mapping'][0]):
        if word_idx is None:
            label_ids.append(PAD_TOKEN_ID)
            continue
        found_entity = False
        for entity in entities:
            if entity['start'] <= int(start) and entity['end'] >= int(end) and not found_entity:
                label = entity['className']
                label_ids.append(label2id[label])
                found_entity = True
                break
        if not found_entity:
            label_ids.append(label2id['O'])
    return label_ids

And now we can create class for our dataset.

In [None]:
class DataSequence(torch.utils.data.Dataset):
    def __init__(self, texts, entities=None, label2id=None):
        if not entities:
            entities = [[] for t in texts]
        configured_tokenizer = lambda text: tokenizer(str(i),
                                                      padding='max_length',
                                                      max_length=512,
                                                      truncation=True,
                                                      return_tensors="pt",
                                                      return_offsets_mapping=True,
                                                      return_length=True)
        self.texts = [configured_tokenizer(text) for text in texts]
        self.labels = [align_label(i, j, label2id) for i,j in zip(self.texts, entities)]

        
    def __len__(self):
        return len(self.labels)

    def get_batch_data(self, idx):
        return self.texts[idx]

    def get_batch_labels(self, idx):
        return torch.LongTensor(self.labels[idx])

    def __getitem__(self, idx):
        batch_data = self.get_batch_data(idx)
        batch_labels = self.get_batch_labels(idx)
        return batch_data, batch_labels

Now we upload train texts that we downloaded from S3 bucket and split them into train, validation and test samples.

In [None]:
TRAIN_DOCS_FOLDER = f'./train_sa_docs'

texts = []

for filename in glob.glob(TRAIN_DOCS_FOLDER):
    with open(filename) as f:
        line = f.read()
        texts.append(l)

In [None]:
train_texts, valid_texts, train_entities, valid_entities = train_test_split(texts, annotations, test_size=0.2)
val_texts, test_texts, val_entities, test_entities = train_test_split(valid_texts, valid_entities, test_size=0.5)

## 4. Train model
<a id='train_model'></a>

Let's now declare class for our token classification model and implement the training loop.

In [None]:
class BertModel(torch.nn.Module):

    def __init__(self):
        super(BertModel, self).__init__()
        self.bert = BertForTokenClassification.from_pretrained('bert-base-cased', num_labels=len(unique_labels))

    def forward(self, input_id, mask, label):
        output = self.bert(input_ids=input_id,
                           attention_mask=mask,
                           labels=label,
                           return_dict=False)

        return output

In [None]:
def train_loop(model, train_texts, train_entities, val_texts, val_entities, label2id):

    train_dataset = DataSequence(train_texts, train_entities, label2id)
    val_dataset = DataSequence(val_texts, val_entities, label2id)

    train_dataloader = torch.utils.data.DataLoader(train_dataset,
                                                   num_workers=4,
                                                   batch_size=BATCH_SIZE,
                                                   shuffle=True)
    
    val_dataloader = torch.utils.data.DataLoader(val_dataset,
                                                 num_workers=4,
                                                 batch_size=BATCH_SIZE)
    
    train_size = len(train_texts)
    val_size = len(val_texts)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")
    
    optimizer = NAdam(model.parameters(), lr=LEARNING_RATE)

    if use_cuda:
        model = model.cuda()

    best_acc = 0
    best_loss = 1000

    for epoch_num in range(EPOCHS):

        total_acc_train = 0
        total_loss_train = 0

        model.train()

        for train_data, train_label in tqdm(train_dataloader):
            train_label = train_label.to(device)
            mask = train_data['attention_mask'].squeeze(1).to(device)
            input_id = train_data['input_ids'].squeeze(1).to(device)

            optimizer.zero_grad()
            loss, logits = model(input_id, mask, train_label)

            for i in range(logits.shape[0]):

                logits_clean = logits[i][train_label[i] != PAD_TOKEN_ID]
                label_clean = train_label[i][train_label[i] != PAD_TOKEN_ID]

                predictions = logits_clean.argmax(dim=1)
                acc = (predictions == label_clean).float().mean()
                total_acc_train += acc
                total_loss_train += loss.item()

            loss.backward()
            optimizer.step()

        model.eval()

        total_acc_val = 0
        total_loss_val = 0

        for val_data, val_label in val_dataloader:

            val_label = val_label.to(device)
            mask = val_data['attention_mask'].squeeze(1).to(device)
            input_id = val_data['input_ids'].squeeze(1).to(device)

            loss, logits = model(input_id, mask, val_label)

            for i in range(logits.shape[0]):
                logits_clean = logits[i][val_label[i] != PAD_TOKEN_ID]
                label_clean = val_label[i][val_label[i] != PAD_TOKEN_ID]

                predictions = logits_clean.argmax(dim=1)
                acc = (predictions == label_clean).float().mean()
                total_acc_val += acc
                total_loss_val += loss.item()

        val_accuracy = total_acc_val / len(val_texts)
        val_loss = total_loss_val / len(val_texts)

        print(
            f'Epochs: {epoch_num + 1} | Loss: {total_loss_train / train_size: .3f} | Accuracy: {total_acc_train / train_size: .3f} | Val_Loss: {total_loss_val / val_size: .3f} | Accuracy: {total_acc_val / val_size: .3f}')

In [None]:
LEARNING_RATE = 5e-3
EPOCHS = 7
BATCH_SIZE = 4

In [None]:
%env TOKENIZERS_PARALLELISM=true

In [None]:
model = BertModel()
train_loop(model, train_texts, train_entities, val_texts, val_entities, label2id)

## 5. Evaluate model
<a id='evaluate_model'></a>

After the training is done we could evaluate our model on test data. 
We could use [seqeval](https://huggingface.co/spaces/evaluate-metric/seqeval) to get span-based metrics.

In [None]:
def evaluate(model, test_texts, test_labels, label2id):
    
    return_data = []

    test_dataset = DataSequence(test_texts, test_labels, label2id)
    
    test_dataloader = DataLoader(test_dataset, num_workers=4, batch_size=1)
    
    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:
        model = model.cuda()
    model = model.to(device)

    for test_data, test_label in test_dataloader:
            test_label = test_label.to(device)
            mask = test_data['attention_mask'].squeeze(1).to(device)

            input_id = test_data['input_ids'].squeeze(1).to(device)

            loss, logits = model(input_id, mask, test_label)

            for i in range(logits.shape[0]):
                logits_clean = logits[i][test_label[i] != PAD_TOKEN_ID]
                label_clean = test_label[i][test_label[i] != PAD_TOKEN_ID]
                predictions = logits_clean.argmax(dim=1)
                return_data.append((test_data, predictions, label_clean))
    
    return return_data

In [None]:
evaluation_data = evaluate(model, test_texts, test_entities, label2id)

In [None]:
test_texts = [x[0] for x in evaluation_data]
predictions = [x[1] for x in evaluation_data]
label_clean = [x[2] for x in evaluation_data]

In [None]:
predictions_iob = [[id2label[int(i)] for i in sent_lb] for sent_lb in predictions]
label_clean_iob = [[id2label[int(i)] for i in sent_lb] for sent_lb in label_clean]

In [None]:
print(classification_report(predictions_iob, label_clean_iob))

## 6. Get predictions for unlabeled texts
<a id='get_predictions_for_unlabeled_texts'></a>

In [None]:
unlabeled_texts = []
names = []
for filename in glob.glob('./unlabeled_sa_docs/*.txt'):
    with open(filename) as f:
        unlabeled_texts.append(f.read())
        names.append(os.path.basename(filename)) 

In [None]:
output_unlabeled = evaluate(model=model,
                            test_texts=unlabeled_texts[:100],
                            test_labels=[],
                            label2id=label2id)

In [None]:
tokenized_texts = [x[0] for x in output_unlabeled]
predictions = [x[1] for x in output_unlabeled]

## 7. Make annotations in SuperAnnotate format
<a id='make_annotations_sa_format'></a>

Based on predictions made by the model we should now create annotations in SuperAnnotate format to be able to upload them to SuperAnnotate.

In [None]:
def bert_pred_to_annotations(tokenized_texts, predictions, ids_to_labels, names):
    annotations = []
    for text, labels, name in zip(tokenized_texts, predictions, names):
        entities = []
        for i, label_id in enumerate(labels):
            start, end = text['offset_mapping'][0][0][i+1]
            label = ids_to_labels[label_id.item()]
            if not label == 'O':
                entities.append({"type": "entity",
                                 "className": label,
                                 "start": start.item(),
                                 "end": end.item() + 1,
                                 "attributes": []
                                 })
        annotations.append({'instances': entities,
                            'metadata': {'name' : name}})
    return annotations

In [None]:
new_annotations = bert_pred_to_annotations(tokenized_texts,predictions,id2label,names)

In [None]:
ANNOTATIONS_FOLDER = 'PATH/TO/LOCAL/DIR/' # local folder to store .json files with annotations
for annotation in new_annotations:
    filename = annotation['metadata']['name']
    with open(f'{ANNOTATIONS FOLDER}/{filename}.json','w') as f:
        json.dump(js_annotation, f)

## 8. Upload new annotations to SuperAnnotate platform
<a id='upload_new_annotations_to_sa_platform'></a>

Now we could upload annotations generated on the previous step back to SuperAnnnotate.

In [None]:
def read_js(filename):
    with open(filename) as f:
        js = json.load(f)
    return js 

In [None]:
outputs = []
files = os.listdir(ANNOTATIONS_FOLDER)
files_per_step = 500
steps = len(files) // files_per_step + 1

for step in range(steps):
    start = step * files_per_step
    end = min((step + 1)* files_per_step, len(files))

    batch = [read_js(os.path.join(ANNOTATIONS_FOLDER, f)) for f in files[start: end]]

    outputs.append(sa_client.upload_annotations(project=f'{SA_PROJECT_NAME}/unlabeled/', annotations=batch))

Now we can look at unlabeled folder at the SuperAnnotate page and see the predictions made by our model.


![](../docs/legal-ner/labeled_unlabeled.png)

All files in unlabeled folder changed their status.

![](../docs/legal-ner/new_labels_example.png)