# NER
    - Text & Labels
    - Text $$\implies$$ vocab (unique words in the corpus) + unknown token + padding token
        - word2id and id2word
    - Labels $\implies$ tags (unique tags)
        - tag2id and id2tag
    
    - conversion of texts i.e. sentences into tokens
    - conversion of tokens into input ids with padding (maximum length of the sentences)
    
    - conversion of labels into label ids
    - addition of padded tokens with label ids
    
    - If using Attention, make sure of attention masks
    - Make sure of masks during loss calculation and metric calculation

## Basic LSTM

## Data

- `words.txt` and `tags.txt` $\rightarrow$ used to create the vocab and ner-tag maps
- `train`, `test` and `val` folders each containing two files. For replication purposes.
    - `sentences.txt` $\rightarrow$ sentences in each line
    - `labels.txt` $\rightarrow$ corresponding ner-tags of each sentences
    
**Note: As the dataset is already partitioned the dataset class that will manage the data operations can directly read the data and address the necessary modifications.**

## Vocab & Tag Maps

In [1]:
words_path = '../data/words.txt'
tags_path = '../data/tags.txt'

In [2]:
vocab = {}
with open(words_path) as f:
    for i, l in enumerate(f.read().splitlines()):
        vocab[l] = i
vocab['<PAD>'] = len(vocab)

print("Thousands: \t", vocab['Thousands'], "\tUknown:\t", vocab['UNK'], "\tPAD\t", vocab['<PAD>'])
print()
c = 0
for i in vocab.items():
    print(i)
    print('.'*50)
    c += 1
    if c > 5: break

Thousands: 	 0 	Uknown:	 35179 	PAD	 35180

('Thousands', 0)
..................................................
('of', 1)
..................................................
('demonstrators', 2)
..................................................
('have', 3)
..................................................
('marched', 4)
..................................................
('through', 5)
..................................................


- A `<PAD>` token is set which will be used to pad the sentences upto the maximum length of the sequences
- An `UNK` token is added to address out of vocabulary words/tokens

In [3]:
tag_map = {}
with open(tags_path) as f:
    for i, t in enumerate(f.read().splitlines()):
        tag_map[t] = i

tag_map

{'O': 0,
 'B-geo': 1,
 'B-gpe': 2,
 'B-per': 3,
 'I-geo': 4,
 'B-org': 5,
 'I-org': 6,
 'B-tim': 7,
 'B-art': 8,
 'I-art': 9,
 'I-per': 10,
 'I-gpe': 11,
 'I-tim': 12,
 'B-nat': 13,
 'B-eve': 14,
 'I-eve': 15,
 'I-nat': 16}

## Loading Training data and performing Transformations
- Same will be applicable for both test and validation sets

In [4]:
# Load Data in lists
with open("../data/train/sentences.txt", "r") as f:
    sentences = f.read().splitlines()
    
for i in sentences[:5]:
    print(i)
    print('.' * 100)

Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .
....................................................................................................
Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as " Bush Number One Terrorist " and " Stop the Bombings . "
....................................................................................................
They marched from the Houses of Parliament to a rally in Hyde Park .
....................................................................................................
Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000 .
....................................................................................................
The protest comes on the eve of the annual conference of Britain 's ruling Labor Party in the southern English seaside resort of 

In [5]:
# Load Data in lists
with open("../data/train/labels.txt", "r") as f:
    labels = f.read().splitlines()
    
for i in labels[:5]:
    print(i)
    print('.' * 100)

O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O
....................................................................................................
O O O O O O O O O O O O O O O O O O B-per O O O O O O O O O O O
....................................................................................................
O O O O O O O O O O O B-geo I-geo O
....................................................................................................
O O O O O O O O O O O O O O O
....................................................................................................
O O O O O O O O O O O B-geo O O B-org I-org O O O B-gpe O O O B-geo O
....................................................................................................


## Data Conversion
- Convert sentences into list of id's based on the word2id $\rightarrow$ vocab mapping
- Convert tags in to list of tag id's based on the tag2id $\rightarrow$ tags mapping

In [6]:
from typing import List, Dict
import random
random.seed(518123)

In [7]:
def convert_sentence2id(sentence: str, vocab_map: Dict) -> List:
    sentence_id = []
    for token in sentence.split(' '):
        if token in vocab_map:
            sentence_id.append(vocab_map[token])
        else:
            sentence_id.append(vocab_map['UNK'])
    return sentence_id            

In [8]:
def convert_tags2id(tag_list: str, tag_map: Dict) -> List:
    tag_id = []
    for label in tag_list.split(' '):
        tag_id.append(tag_map[label])
    return tag_id

In [9]:
rand_id = random.choice(range(len(sentences)))
rand_sent = sentences[rand_id]
print(rand_sent)
print('.'*len(rand_sent))
print(convert_sentence2id(rand_sent, vocab_map=vocab))
print()
rand_labels = labels[rand_id]
print(rand_labels)
print('.'*len(rand_labels))
print(convert_tags2id(rand_labels, tag_map=tag_map))

In a statement Monday , Mr. Peres said " there exists no basis in reality for the claims published " by the British newspaper , The Guardian .
..............................................................................................................................................
[345, 45, 1171, 1564, 93, 816, 8887, 172, 35, 596, 10871, 388, 2051, 11, 7814, 223, 9, 2865, 2573, 35, 191, 9, 16, 1765, 93, 61, 2646, 21]

O O O B-tim O B-per I-per O O O O O O O O O O O O O O O B-gpe O O B-org I-org O
...............................................................................
[0, 0, 0, 7, 0, 3, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 5, 6, 0]


In [10]:
assert len(sentences) == len(labels)
for s, l in zip(sentences, labels):
    try:
        assert len(s.split(' ')) == len(l.split(' '))
    except AssertionError:
        print(s, l)
        continue

## Dataclass
- Pass the lists into torch dataset/data-generator class

In [11]:
train_sentence = "../data/train/sentences.txt"
train_labels = "../data/train/labels.txt"

valid_sentence = "../data/val/sentences.txt"
valid_labels = "../data/val/labels.txt"

test_sentence = "../data/test/sentences.txt"
test_labels = "../data/test/labels.txt"

In [12]:
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

In [13]:
class ner_data(Dataset):
    
    def __init__(self, sentences_path: str, labels_path: str, vocab_map: Dict, tags_map: Dict) -> None:
        super().__init__()
        
        with open(sentences_path, "r") as f:
            self.sentences = f.read().splitlines()
        
        with open(labels_path, "r") as f:
            self.labels = f.read().splitlines()
            
        self.max_len = max([len(sentence) for sentence in self.sentences])
        self.vocab_map = vocab_map
        self.tags_map = tags_map
            
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        sentence_padded = np.array(self.max_len * [vocab["<PAD>"]])
        labels_padded = np.array(self.max_len * [vocab["<PAD>"]])
        
        sentence = convert_sentence2id(self.sentences[idx], self.vocab_map)
        labels = convert_tags2id(self.labels[idx], self.tags_map)
        
        assert len(sentence) == len(labels)
        
        sentence_padded[:len(sentence)] = sentence
        labels_padded[:len(labels)] = labels
        
        return torch.tensor(sentence_padded, dtype=torch.long), torch.tensor(labels_padded, dtype=torch.long)

In [14]:
ner_train = ner_data(sentences_path=train_sentence,
                     labels_path=train_labels,
                     vocab_map=vocab,
                     tags_map=tag_map)

In [15]:
ner_train.max_len, len(ner_train)

(541, 33570)

In [16]:
for i in range(10):
    data = ner_train.__getitem__(i)
    print(data[0].shape, data[1].shape)

torch.Size([541]) torch.Size([541])
torch.Size([541]) torch.Size([541])
torch.Size([541]) torch.Size([541])
torch.Size([541]) torch.Size([541])
torch.Size([541]) torch.Size([541])
torch.Size([541]) torch.Size([541])
torch.Size([541]) torch.Size([541])
torch.Size([541]) torch.Size([541])
torch.Size([541]) torch.Size([541])
torch.Size([541]) torch.Size([541])


In [17]:
batch_size = 512
train_loader = DataLoader(dataset=ner_train, batch_size=batch_size, shuffle=True, num_workers=4)

In [18]:
tl = iter(train_loader)

In [19]:
data_batch = next(tl)

In [20]:
data_batch[0].shape, data_batch[1].shape

(torch.Size([512, 541]), torch.Size([512, 541]))

In [21]:
def create_ner_data_loader(sentences_path, labels_path, vocab_map, tags_map,
                           batch_size=8, shuffle=True, num_workers=4):
    
    ner = ner_data(sentences_path, labels_path, vocab_map, tags_map)
    
    return DataLoader(dataset=ner, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers)

In [22]:
train_loader = create_ner_data_loader(train_sentence, train_labels, vocab, tag_map, batch_size=batch_size)
valid_loader = create_ner_data_loader(valid_sentence, valid_labels, vocab, tag_map, batch_size=batch_size)
test_loader = create_ner_data_loader(test_sentence, test_labels, vocab, tag_map, batch_size=batch_size)

In [23]:
vl = iter(valid_loader)
valid_data = next(vl)

In [24]:
valid_data[0].shape, valid_data[1].shape

(torch.Size([512, 424]), torch.Size([512, 424]))

## Model

In [25]:
from torch import nn
import torch.nn.functional as F

class NER_Tagger(nn.Module):
    
    def __init__(self, vocab_size: int, embedding_size: int, dense_output_size: int, device: str = 'cpu') -> None:
        super().__init__()
        
        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_size)
        self.lstm = nn.LSTM(embedding_size, embedding_size//2, batch_first=True)
        self.dense = nn.Linear(embedding_size//2, dense_output_size)
        
        self.to(device)

    def forward(self, X):
        """
        processed --> will have 3 outputs
            - hidden states for each input sequence
            - final hidden state for each element in the sequence
            - final cell state for each element in the sequence
            
        processed[0].shape = (batch_size, sequence_length, h_out_size)
        cessed[1][0].shape = (1, batch_size, h_out_size)
        processed[1][1].shape = (1, batch_size, h_out_size)
        """
        
        embedded = self.embedding(X)
        processed = self.lstm(embedded)
        processed = self.dense(processed[0])
        return F.log_softmax(processed, dim=-1)

In [26]:
tagger = NER_Tagger(len(vocab), 50, len(tag_map))
tagger

NER_Tagger(
  (embedding): Embedding(35181, 50)
  (lstm): LSTM(50, 25, batch_first=True)
  (dense): Linear(in_features=25, out_features=17, bias=True)
)

In [27]:
predicted_tags = tagger(data_batch[0])
predicted_tags

tensor([[[-3.0022, -2.9203, -2.9438,  ..., -2.9484, -2.8224, -2.7614],
         [-2.9889, -2.6462, -2.9306,  ..., -2.7803, -2.8264, -2.8393],
         [-2.9105, -2.7030, -2.9933,  ..., -2.6913, -2.7873, -2.9711],
         ...,
         [-2.7820, -3.0221, -3.1874,  ..., -2.6801, -2.5884, -2.8529],
         [-2.7820, -3.0221, -3.1874,  ..., -2.6801, -2.5884, -2.8529],
         [-2.7820, -3.0221, -3.1874,  ..., -2.6801, -2.5884, -2.8529]],

        [[-3.1038, -2.7061, -2.7388,  ..., -2.8788, -2.8103, -2.7535],
         [-2.9590, -2.7851, -2.7770,  ..., -2.7818, -2.8435, -2.7850],
         [-3.0146, -2.7110, -2.7457,  ..., -2.7651, -2.8989, -2.7158],
         ...,
         [-2.7820, -3.0221, -3.1874,  ..., -2.6801, -2.5884, -2.8529],
         [-2.7820, -3.0221, -3.1874,  ..., -2.6801, -2.5884, -2.8529],
         [-2.7820, -3.0221, -3.1874,  ..., -2.6801, -2.5884, -2.8529]],

        [[-2.8991, -2.8902, -2.7869,  ..., -2.8971, -2.7908, -2.6116],
         [-2.9473, -2.8656, -2.9165,  ..., -2

In [28]:
predicted_tags.shape, predicted_tags[0].shape

(torch.Size([512, 541, 17]), torch.Size([541, 17]))

In [29]:
torch.argmax(predicted_tags, dim=-1).shape

torch.Size([512, 541])

In [30]:
data_batch[1]

tensor([[    0,     0,     0,  ..., 35180, 35180, 35180],
        [    0,     3,    10,  ..., 35180, 35180, 35180],
        [    0,     0,     0,  ..., 35180, 35180, 35180],
        ...,
        [    2,     0,     0,  ..., 35180, 35180, 35180],
        [    0,     0,     0,  ..., 35180, 35180, 35180],
        [    2,     0,     1,  ..., 35180, 35180, 35180]])

In [31]:
(torch.argmax(predicted_tags, dim=-1) == data_batch[1])

tensor([[False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        ...,
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False]])

In [32]:
pad_mask = data_batch[1] != vocab['<PAD>']
pad_mask

tensor([[ True,  True,  True,  ..., False, False, False],
        [ True,  True,  True,  ..., False, False, False],
        [ True,  True,  True,  ..., False, False, False],
        ...,
        [ True,  True,  True,  ..., False, False, False],
        [ True,  True,  True,  ..., False, False, False],
        [ True,  True,  True,  ..., False, False, False]])

In [33]:
((torch.argmax(predicted_tags, dim=-1) == data_batch[1]) * pad_mask).sum(), pad_mask.sum(), ((torch.argmax(predicted_tags, dim=-1) == data_batch[1]) * pad_mask).sum() / pad_mask.sum()

(tensor(134), tensor(10733), tensor(0.0125))

In [34]:
data_batch[1] * pad_mask

tensor([[ 0,  0,  0,  ...,  0,  0,  0],
        [ 0,  3, 10,  ...,  0,  0,  0],
        [ 0,  0,  0,  ...,  0,  0,  0],
        ...,
        [ 2,  0,  0,  ...,  0,  0,  0],
        [ 0,  0,  0,  ...,  0,  0,  0],
        [ 2,  0,  1,  ...,  0,  0,  0]])

In [35]:
type(data_batch[0])

torch.Tensor

## Check to see the operations on dimensions

In [36]:
X = torch.rand((2, 4, 3))
X

tensor([[[0.2670, 0.3764, 0.5874],
         [0.6756, 0.2998, 0.3733],
         [0.1682, 0.3479, 0.3727],
         [0.2941, 0.7707, 0.8221]],

        [[0.8323, 0.9966, 0.3166],
         [0.6257, 0.4427, 0.6233],
         [0.7246, 0.2789, 0.5361],
         [0.2861, 0.7712, 0.1433]]])

In [37]:
test_log_softmax = F.log_softmax(X, dim=-1)
test_log_softmax

tensor([[[-1.2509, -1.1414, -0.9305],
         [-0.8862, -1.2620, -1.1885],
         [-1.2307, -1.0510, -1.0263],
         [-1.4601, -0.9834, -0.9321]],

        [[-1.0209, -0.8566, -1.5366],
         [-1.0404, -1.2234, -1.0428],
         [-0.9036, -1.3494, -1.0922],
         [-1.2502, -0.7652, -1.3931]]])

In [38]:
torch.exp(test_log_softmax).sum(dim=-1)

tensor([[1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000]])

## Cost Function and Optimizers

### Device Setup, Learning Rate

In [39]:
from tqdm.notebook import tqdm

for i in tqdm(range(1000), dynamic_ncols=True):
    pass

  0%|                                                                                                         …

In [40]:
def prediction_evaluation(predictions, true_labels, pad_value):
    """
    Inputs:
        pred: prediction array with shape -> (num_examples, max sentence length in batch)
        labels: array of size (batch_size, seq_len)
        pad: integer representing pad character
    Outputs:
        accuracy: float
    """
    # Create mask matrix equal to the shape of the true-labels matrix
    pad_mask = true_labels != pad_value
    
    # Calculate Accuracy
    accuracy = ((predictions == true_labels) * pad_mask).sum() / pad_mask.sum()
    
    return accuracy

In [41]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# device = 'cpu'


lr = 1e-2
lr_step_size = 2           # every 2 epochs
lr_reduction_pc = 0.95     # 95% reduction of lr

grad_clip_threshold = 0.3

epochs = 5

tagger = NER_Tagger(vocab_size=len(vocab), embedding_size=50, dense_size=len(tag_map), device=device)

In [42]:
# Loss/Cost Function
loss_function = nn.CrossEntropyLoss(ignore_index=vocab['<PAD>']).to(device)

# Optimizer --> model parameters and learning rate
# Class weights can also be specified if necessary
optimizer = torch.optim.Adam(params=tagger.parameters(), lr=lr)

# learning rate scheduler --> slow reduction of learning rate from a higher value
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer=optimizer, step_size=lr_step_size, gamma=lr_reduction_pc)

# dl tricks
#    gradient clippings --> to avoid vanishing gradient or gradient explosion
gradient_clip = nn.utils.clip_grad_value_
gradient_clip(parameters=tagger.parameters(), clip_value=grad_clip_threshold)

## Train the Model

In [43]:
validation_metric_tracker = []
steps = 0

# Epoch Loop --> 1 epoch is one forward pass through all the training data
for epoch in tqdm(range(epochs), dynamic_ncols=True, desc='Epoch Progress'):

    # set model to training mode
    tagger.train()
    epoch_loss, train_predictions = [], []

    # training loop for each mini-batch
    for sentences, true_tags in tqdm(train_loader, dynamic_ncols=True, desc='Training Progress'):
        # set grads to zero
        optimizer.zero_grad()

        # forward pass
        sentences, true_tags = sentences.to(device), true_tags.to(device)
        predicted_tags = tagger(sentences)

        # loss calculation
        loss = loss_function(predicted_tags.view(-1, predicted_tags.shape[2]),
                             true_tags.view(-1))

        # Track Loss for each epoch
        epoch_loss.append(loss.item())

        # calculate training metrics
        train_accuracy = prediction_evaluation(torch.argmax(predicted_tags, dim=-1), true_tags, pad_value=vocab['<PAD>'])
        train_predictions.append(train_accuracy.item())

        # backward pass i.e. gradient calculations
        loss.backward()

        # optimizer step
        optimizer.step()

        # increment training steps
        steps += 1

    print('Training {:>4d} steps  --- Accuracy {:>5.4f}  |  Epoch Loss  {:>5.4f}'.format(steps, np.mean(train_predictions),
                                                                                     np.mean(epoch_loss)))
    
    # validation loop after each training epoch
    # set no grad
    with torch.no_grad():
        tagger.eval()
        validation_predictions = []
        
        # perform validation
        # validation loop for each mini-batch
        for val_sentences, val_true_tags in tqdm(train_loader, dynamic_ncols=True, desc='Validation Progress'):
            
            val_sentences, val_true_tags = val_sentences.to(device), val_true_tags.to(device)
            predicted_tags = tagger(val_sentences)
            
            # calculate validation metrics
            validation_accuracy = prediction_evaluation(torch.argmax(predicted_tags, dim=-1), val_true_tags, pad_value=vocab['<PAD>'])
            validation_predictions.append(validation_accuracy.item())
        
        print('Validation Accuracy {:>5.4f}'.format(np.mean(validation_predictions)))
    
    lr_scheduler.step()

# save checkpoint (conditioned on performance or time-steps)

Epoch Progress:   0%|                                                                                         …

Training Progress:   0%|                                                                                      …

Training   66 steps  --- Accuracy 0.8179  |  Epoch Loss  0.8612


Validation Progress:   0%|                                                                                    …

Validation Accuracy 0.8819


Training Progress:   0%|                                                                                      …

Training  132 steps  --- Accuracy 0.9111  |  Epoch Loss  0.3355


Validation Progress:   0%|                                                                                    …

Validation Accuracy 0.9393


Training Progress:   0%|                                                                                      …

Training  198 steps  --- Accuracy 0.9470  |  Epoch Loss  0.1983


Validation Progress:   0%|                                                                                    …

Validation Accuracy 0.9568


Training Progress:   0%|                                                                                      …

Training  264 steps  --- Accuracy 0.9586  |  Epoch Loss  0.1481


Validation Progress:   0%|                                                                                    …

Validation Accuracy 0.9638


Training Progress:   0%|                                                                                      …

Training  330 steps  --- Accuracy 0.9642  |  Epoch Loss  0.1235


Validation Progress:   0%|                                                                                    …

Validation Accuracy 0.9680


## Testing

In [44]:
validator = iter(test_loader)

In [45]:
x_test, y_test = validator.next()

In [46]:
x_test = x_test.to('cuda')
predicted_test = tagger(x_test)

In [47]:
prediction_evaluation(torch.argmax(predicted_test, dim=-1).detach().cpu(), y_test, pad_value=vocab['<PAD>'])

tensor(0.9502)

In [48]:
pred_tags = torch.argmax(predicted_test, dim=-1).detach().cpu()

In [49]:
mask = y_test[0] != vocab['<PAD>']

In [50]:
reverse_tag_map = {v:k for k, v in tag_map.items()}
reverse_tag_map

{0: 'O',
 1: 'B-geo',
 2: 'B-gpe',
 3: 'B-per',
 4: 'I-geo',
 5: 'B-org',
 6: 'I-org',
 7: 'B-tim',
 8: 'B-art',
 9: 'I-art',
 10: 'I-per',
 11: 'I-gpe',
 12: 'I-tim',
 13: 'B-nat',
 14: 'B-eve',
 15: 'I-eve',
 16: 'I-nat'}

In [51]:
reverse_vocab_map = {v:k for k, v in vocab.items()}
reverse_vocab_map

{0: 'Thousands',
 1: 'of',
 2: 'demonstrators',
 3: 'have',
 4: 'marched',
 5: 'through',
 6: 'London',
 7: 'to',
 8: 'protest',
 9: 'the',
 10: 'war',
 11: 'in',
 12: 'Iraq',
 13: 'and',
 14: 'demand',
 15: 'withdrawal',
 16: 'British',
 17: 'troops',
 18: 'from',
 19: 'that',
 20: 'country',
 21: '.',
 22: 'Families',
 23: 'soldiers',
 24: 'killed',
 25: 'conflict',
 26: 'joined',
 27: 'protesters',
 28: 'who',
 29: 'carried',
 30: 'banners',
 31: 'with',
 32: 'such',
 33: 'slogans',
 34: 'as',
 35: '"',
 36: 'Bush',
 37: 'Number',
 38: 'One',
 39: 'Terrorist',
 40: 'Stop',
 41: 'Bombings',
 42: 'They',
 43: 'Houses',
 44: 'Parliament',
 45: 'a',
 46: 'rally',
 47: 'Hyde',
 48: 'Park',
 49: 'Police',
 50: 'put',
 51: 'number',
 52: 'marchers',
 53: 'at',
 54: '10,000',
 55: 'while',
 56: 'organizers',
 57: 'claimed',
 58: 'it',
 59: 'was',
 60: '1,00,000',
 61: 'The',
 62: 'comes',
 63: 'on',
 64: 'eve',
 65: 'annual',
 66: 'conference',
 67: 'Britain',
 68: "'s",
 69: 'ruling',
 70:

In [52]:
red = "\033[1;31m"
green = "\033[1;32m"
purple= "\033[1;35m"
reset= "\033[0m"

In [53]:
sentence = [reverse_vocab_map[i.item()] for i in x_test[0][mask].detach().cpu()]
pred_tags_converted = [reverse_tag_map[i.item()] for i in pred_tags[0][mask]]
true_tags_converted = [reverse_tag_map[i.item()] for i in y_test[0][mask]]

print("{:>20} | {:>7} | {:>5}".format("Token", "Pred", "True NER Tag"))
print('-'*70)
for s, p, t in zip(sentence, pred_tags_converted, true_tags_converted):
    if p == t:
        p = green + p + reset
    else:
        p = red + p + reset
    t = purple + t + reset
    print("{:>20} | {:>17}  | {:>15}".format(s, p, t))

               Token |    Pred | True NER Tag
----------------------------------------------------------------------
                  In |      [1;32mO[0m  |    [1;35mO[0m
                1861 |      [1;31mO[0m  | [1;35mB-tim[0m
                   , |      [1;32mO[0m  |    [1;35mO[0m
                 the |      [1;32mO[0m  |    [1;35mO[0m
          Dominicans |      [1;31mO[0m  | [1;35mB-gpe[0m
         voluntarily |      [1;32mO[0m  |    [1;35mO[0m
            returned |      [1;32mO[0m  |    [1;35mO[0m
                  to |      [1;32mO[0m  |    [1;35mO[0m
                 the |      [1;32mO[0m  |    [1;35mO[0m
             Spanish |  [1;32mB-gpe[0m  | [1;35mB-gpe[0m
              Empire |      [1;32mO[0m  |    [1;35mO[0m
                   , |      [1;32mO[0m  |    [1;35mO[0m
                 but |      [1;32mO[0m  |    [1;35mO[0m
                 two |      [1;31mO[0m  | [1;35mB-tim[0m
               years |      [1;32mO[0

In [54]:
random_id = random.randint(0, len(test_loader))

In [55]:
mask = y_test[random_id] != vocab['<PAD>']
sentence = [reverse_vocab_map[i.item()] for i in x_test[random_id][mask].detach().cpu()]
pred_tags_converted = [reverse_tag_map[i.item()] for i in pred_tags[random_id][mask]]
true_tags_converted = [reverse_tag_map[i.item()] for i in y_test[random_id][mask]]

print("{:>20} | {:>7} | {:>5}".format("Token", "Pred", "True Tag"))
print('-'*70)
for s, p, t in zip(sentence, pred_tags_converted, true_tags_converted):
    if p == t:
        p = green + p + reset
    else:
        p = red + p + reset
    t = purple + t + reset
    print("{:>20} | {:>17}  | {:>15}".format(s, p, t))

               Token |    Pred | True Tag
----------------------------------------------------------------------
               There |      [1;32mO[0m  |    [1;35mO[0m
                 has |      [1;32mO[0m  |    [1;35mO[0m
                been |      [1;32mO[0m  |    [1;35mO[0m
                  no |      [1;32mO[0m  |    [1;35mO[0m
                U.S. |  [1;32mB-geo[0m  | [1;35mB-geo[0m
             comment |      [1;32mO[0m  |    [1;35mO[0m
                  on |      [1;32mO[0m  |    [1;35mO[0m
                 the |      [1;32mO[0m  |    [1;35mO[0m
              report |      [1;32mO[0m  |    [1;35mO[0m
                   . |      [1;32mO[0m  |    [1;35mO[0m


# Transformers (Hugging Face)

**These below steps will necessary for NER**

- Read text lines and split them into tokens i.e. pre-tokenize using spaces. Clean the data before if necessary
    - sentences = [list of [list of tokens for each sentence] each with dynamic length]
    - labels = [list of [list of label-tags for each token] each with dynamic length but same as its corresponding list]
    
- Re-tokenize eaech sentence with the corresponding model tokenizer of interest and get the offset mapping
    - *make auto padding* = True
    - *is_split_into_words* = True
    - *return_offset* = True
    - *truncation* = True

## Model Specific Tokenization

In [56]:
from transformers import AutoTokenizer

In [57]:
roberta = 'roberta-base'

In [58]:
roberta_tokenizer = AutoTokenizer.from_pretrained(roberta)

In [59]:
train_loader.dataset.sentences[rand_id]

'In a statement Monday , Mr. Peres said " there exists no basis in reality for the claims published " by the British newspaper , The Guardian .'

In [60]:
print(roberta_tokenizer(train_loader.dataset.sentences[rand_id], truncation=True, padding=True).tokens())

['<s>', 'In', 'Ġa', 'Ġstatement', 'ĠMonday', 'Ġ,', 'ĠMr', '.', 'ĠPe', 'res', 'Ġsaid', 'Ġ"', 'Ġthere', 'Ġexists', 'Ġno', 'Ġbasis', 'Ġin', 'Ġreality', 'Ġfor', 'Ġthe', 'Ġclaims', 'Ġpublished', 'Ġ"', 'Ġby', 'Ġthe', 'ĠBritish', 'Ġnewspaper', 'Ġ,', 'ĠThe', 'ĠGuardian', 'Ġ.', '</s>']


In [61]:
roberta_tokenized = roberta_tokenizer(train_loader.dataset.sentences[rand_id], truncation=True, padding=True)

In [62]:
from transformers import RobertaTokenizerFast

In [63]:
roberta_bert = RobertaTokenizerFast.from_pretrained('roberta-base', add_prefix_space=True)

In [163]:
sample_texts = [train_loader.dataset.sentences[i].split() for i in range(rand_id, rand_id + 10)]
roberta_encodings = roberta_bert(sample_texts, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)
roberta_encodings

{'input_ids': [[0, 96, 10, 445, 302, 2156, 427, 4, 4119, 1535, 26, 22, 89, 8785, 117, 1453, 11, 2015, 13, 5, 1449, 1027, 22, 30, 5, 1089, 2924, 2156, 20, 8137, 479, 2, 1, 1, 1, 1, 1, 1], [0, 427, 4, 4119, 1535, 26, 20, 8137, 875, 63, 1566, 716, 15, 5, 22, 21921, 13794, 9, 391, 1704, 2339, 8, 45, 15, 6369, 4905, 479, 22, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 20, 266, 1027, 395, 5304, 3862, 703, 391, 1704, 2339, 59, 299, 3556, 2891, 11, 61, 427, 4, 4119, 1535, 2346, 1661, 5, 37449, 7, 391, 1704, 503, 11, 14873, 479, 2, 1, 1, 1, 1, 1], [0, 1870, 34, 393, 1474, 50, 2296, 5, 3924, 547, 6563, 14, 24, 34, 1748, 2398, 479, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 391, 1327, 2226, 1748, 2398, 148, 1104, 5688, 2178, 2156, 53, 31088, 63, 1748, 586, 11, 9633, 479, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1557, 1939, 1984, 4282, 1284, 161, 114, 2736, 394, 37, 708, 7, 146, 1564, 4555, 13, 961, 479, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],

In [164]:
sample_labels = [train_loader.dataset.labels[i].split() for i in range(rand_id, rand_id + 10)]

In [165]:
sample_labels_id = [[tag_map[i] for i in label] for label in sample_labels]

In [166]:
encoded_labels_array = []
for labels, offset_map in zip(sample_labels_id, roberta_encodings.offset_mapping):
    # Creating empty array of size offset_map
    encoded_labels = np.ones(len(offset_map)) * -100
    offset_map = np.array(offset_map)
    # The offset maps will have starting index = 0 if tokenized normally
    # If subword tokenization happens then offset_i at index[0] = offset_(i-1) index[1]
    encoded_labels[(offset_map[:,0] == 0) & (offset_map[:,1] != 0)] = labels
    
    encoded_labels_array.append(encoded_labels)
    

In [167]:
class RobertaNERDataset(Dataset):
    def __init__(self, sentence_encodings, encoded_labels):
        self.sentence_encodings = sentence_encodings
        self.encoded_labels = encoded_labels

    def __getitem__(self, idx):
        """
        Sentence Encodings are structured in:
        
            {
            'input_ids': [[...], [...], [...]]
            'attention_mask': [[...], [...], [...]]            
            }
            
        Return a dictionary with the requested index
        
        {
            'input_ids': tensor([encoded sentence at idx])
            'attention_mask': tensor([attention at idx])
            'labels': tensor([encoded labels at idx])        
        }
        
        """
        item = {key: torch.tensor(val[idx], dtype=torch.long) for key, val in self.sentence_encodings.items()}
        item['labels'] = torch.tensor(self.encoded_labels[idx], dtype=torch.long)
        return item

    def __len__(self):
        return len(self.encoded_labels)

In [168]:
roberta_encodings.pop("offset_mapping") # we don't want to pass this to the model
train_dataset = RobertaNERDataset(roberta_encodings, encoded_labels_array)

In [65]:
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

In [66]:
print(bert_tokenizer(train_loader.dataset.sentences[rand_id]).tokens())

['[CLS]', 'In', 'a', 'statement', 'Monday', ',', 'Mr', '.', 'Per', '##es', 'said', '"', 'there', 'exists', 'no', 'basis', 'in', 'reality', 'for', 'the', 'claims', 'published', '"', 'by', 'the', 'British', 'newspaper', ',', 'The', 'Guardian', '.', '[SEP]']


## Defining Fine-Tune Model

In [67]:
import torch.nn as nn
from transformers import RobertaConfig
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.roberta.modeling_roberta import RobertaModel
from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel

In [68]:
RobertaConfig.get_config_dict('roberta-base')

({'architectures': ['RobertaForMaskedLM'],
  'attention_probs_dropout_prob': 0.1,
  'bos_token_id': 0,
  'eos_token_id': 2,
  'hidden_act': 'gelu',
  'hidden_dropout_prob': 0.1,
  'hidden_size': 768,
  'initializer_range': 0.02,
  'intermediate_size': 3072,
  'layer_norm_eps': 1e-05,
  'max_position_embeddings': 514,
  'model_type': 'roberta',
  'num_attention_heads': 12,
  'num_hidden_layers': 12,
  'pad_token_id': 1,
  'type_vocab_size': 1,
  'vocab_size': 50265},
 {})

In [69]:
?TokenClassifierOutput

[0;31mInit signature:[0m
[0mTokenClassifierOutput[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mloss[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mtorch[0m[0;34m.[0m[0mFloatTensor[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlogits[0m[0;34m:[0m [0mtorch[0m[0;34m.[0m[0mFloatTensor[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mhidden_states[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mTuple[0m[0;34m[[0m[0mtorch[0m[0;34m.[0m[0mFloatTensor[0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mattentions[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mTuple[0m[0;34m[[0m[0mtorch[0m[0;34m.[0m[0mFloatTensor[0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;32mNone[0m[0;34m[0m[0;34m[0m[0m
[0

In [70]:
class RobertaNER(RobertaPreTrainedModel):
    config_class = RobertaConfig
    
    def __init__(self, config):
        super().__init__(config)
        
        self.num_labels = config.num_labels
        
        self.roberta = RobertaModel(config, add_pooling_layer=False)
        
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        
        self.init_weights()
        
    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, labels=None, **kwargs):
        
        outputs = self.roberta(input_ids, attention_mask, token_type_ids, **kwargs)
        
        sequence_output = self.dropout(outputs[0])
        
        logits = self.classifier(sequence_output)
        
        loss = None
        
        if labels is not None:
            loss_function = nn.CrossEntropyLoss()
            loss = loss_function(logits.view(-1, logits.shape[2]), labels.view(-1))
        
        return TokenClassifierOutput(loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions)

In [71]:
from transformers import AutoConfig

In [72]:
roberta_custom_config = AutoConfig.from_pretrained(roberta, num_labels=len(tag_map), id2label=tag_map, label2id=reverse_tag_map)

In [73]:
r_ner = RobertaNER.from_pretrained('roberta-base', config=roberta_custom_config).to(device)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaNER: ['lm_head.layer_norm.weight', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaNER from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaNER from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaNER were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.weight', 'roberta.embeddings.position_ids', 'classifier.bias']
You should probably TRAIN this model on a down-str

In [74]:
roberta_custom_config

RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "B-art": 8,
    "B-eve": 14,
    "B-geo": 1,
    "B-gpe": 2,
    "B-nat": 13,
    "B-org": 5,
    "B-per": 3,
    "B-tim": 7,
    "I-art": 9,
    "I-eve": 15,
    "I-geo": 4,
    "I-gpe": 11,
    "I-nat": 16,
    "I-org": 6,
    "I-per": 10,
    "I-tim": 12,
    "O": 0
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "0": "O",
    "1": "B-geo",
    "2": "B-gpe",
    "3": "B-per",
    "4": "I-geo",
    "5": "B-org",
    "6": "I-org",
    "7": "B-tim",
    "8": "B-art",
    "9": "I-art",
    "10": "I-per",
    "11": "I-gpe",
    "12": "I-tim",
    "13": "B-nat",
    "14": "B-eve",
    "15": "I-eve",
    "16": "I-nat"
  },
  "layer_norm_eps": 1e-05,
  

In [75]:
roberta_custom_config.num_labels

17

In [76]:
input_ids = roberta_tokenizer.encode(train_loader.dataset.sentences[rand_id], truncation=True, padding=True, return_tensors='pt').to(device)

In [77]:
predictions = r_ner(input_ids).logits

In [78]:
pred_tags = predictions.argmax(dim=-1)

In [79]:
pred_tags

tensor([[13, 13, 13, 13, 13, 13, 13, 13, 13, 10, 13, 13, 13, 10, 10, 13, 11, 16,
         13, 11, 13, 10, 13, 13, 13, 13, 13, 13, 11,  6, 13, 13]],
       device='cuda:0')

In [80]:
len(train_loader.dataset.labels[rand_id].split())

28

In [81]:
len([reverse_tag_map[i] for i in pred_tags[0].detach().cpu().numpy()])

32

In [82]:
tokenized = roberta_tokenizer(train_loader.dataset.sentences[rand_id], truncation=True, padding=True)

In [83]:
roberta_tokenizer.convert_ids_to_tokens(tokenized['input_ids'])

['<s>',
 'In',
 'Ġa',
 'Ġstatement',
 'ĠMonday',
 'Ġ,',
 'ĠMr',
 '.',
 'ĠPe',
 'res',
 'Ġsaid',
 'Ġ"',
 'Ġthere',
 'Ġexists',
 'Ġno',
 'Ġbasis',
 'Ġin',
 'Ġreality',
 'Ġfor',
 'Ġthe',
 'Ġclaims',
 'Ġpublished',
 'Ġ"',
 'Ġby',
 'Ġthe',
 'ĠBritish',
 'Ġnewspaper',
 'Ġ,',
 'ĠThe',
 'ĠGuardian',
 'Ġ.',
 '</s>']

## Model Specific Detokenizer returning the actual sentence

In [84]:
roberta_tokenizer.decode(input_ids.detach().cpu()[0], skip_special_tokens=True)

'In a statement Monday, Mr. Peres said " there exists no basis in reality for the claims published " by the British newspaper, The Guardian.'

## Data Preparation for 🤗-Transformers

In [85]:
train_data = iter(train_loader)
sample_text_batch = train_data.next()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [86]:
sample_text = [reverse_vocab_map[i.item()] for i in sample_text_batch[0][10] if i.item() != vocab['<PAD>']]
' '.join(sample_text)

'" What is the matter with your shirt ? " inquired the Tramp .'

In [87]:
from transformers import DistilBertTokenizerFast
dbert_tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased')

In [88]:
train_dbert = dbert_tokenizer(sample_text, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)

In [89]:
train_roberta = roberta_tokenizer(' '.join(sample_text), return_offsets_mapping=True, padding=True, truncation=True)

In [90]:
train_roberta

{'input_ids': [0, 113, 653, 16, 5, 948, 19, 110, 6399, 17487, 22, 38276, 5, 2393, 3914, 479, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 1), (2, 6), (7, 9), (10, 13), (14, 20), (21, 25), (26, 30), (31, 36), (37, 38), (39, 40), (41, 49), (50, 53), (54, 56), (56, 59), (60, 61), (0, 0)]}

In [91]:
len(sample_text), len(train_roberta.tokens())

(14, 17)

In [92]:
print(sample_text, train_roberta.tokens(), train_dbert.tokens(), sep='\n')

['"', 'What', 'is', 'the', 'matter', 'with', 'your', 'shirt', '?', '"', 'inquired', 'the', 'Tramp', '.']
['<s>', '"', 'ĠWhat', 'Ġis', 'Ġthe', 'Ġmatter', 'Ġwith', 'Ġyour', 'Ġshirt', 'Ġ?', 'Ġ"', 'Ġinquired', 'Ġthe', 'ĠTr', 'amp', 'Ġ.', '</s>']
['[CLS]', '"', 'What', 'is', 'the', 'matter', 'with', 'your', 'shirt', '?', '"', 'inquired', 'the', 'T', '##ram', '##p', '.', '[SEP]']


## Fine-Tuning using 🤗-`Trainer`

In [139]:
from transformers import Trainer, TrainingArguments

In [169]:
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
train_dataset = ner_data(sentences_path=train_sentence,
                         labels_path=train_labels,
                         vocab_map=vocab,
                         tags_map=tag_map)
valid_dataset = ner_data(sentences_path=valid_sentence,
                         labels_path=valid_labels,
                         vocab_map=vocab,
                         tags_map=tag_map)

In [170]:
trainer = Trainer(model=r_ner, 
                  args=training_args,
                  train_dataset=train_dataset,
                 )

In [171]:
trainer.train()

***** Running training *****
  Num examples = 10
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 3


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=3, training_loss=2.8865461349487305, metrics={'train_runtime': 0.7213, 'train_samples_per_second': 41.589, 'train_steps_per_second': 4.159, 'total_flos': 581872459320.0, 'train_loss': 2.8865461349487305, 'epoch': 3.0})

In [190]:
z = next(iter(train_dataset))

In [198]:
z['input_ids'] = z['input_ids'].view(1, -1).to('cuda')
z['attention_mask'] = z['attention_mask'].view(1, -1).to('cuda')
z.pop('labels')

tensor([-100,    0,    0,    0,    7,    0,    3, -100,   10, -100,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    2,    0,    0,    5,    6,    0, -100, -100, -100, -100, -100,
        -100, -100])

In [200]:
with torch.no_grad():
    r_ner.eval()
    preeds = r_ner(**z)

In [206]:
preds = preeds['logits'][0].detach().cpu().numpy()

In [211]:
true_labels = np.array([-100,    0,    0,    0,    7,    0,    3, -100,   10, -100,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    2,    0,    0,    5,    6,    0, -100, -100, -100, -100, -100,
        -100, -100])
mask = true_labels != -100

<span style="color:red;font-weight:700;font-size:20px">
   Need to keep track of the pre model-specific tokenized text for validation
</span>

In [245]:
true_text_reversed = sample_texts[0]
true_tags_reversed = [reverse_tag_map[i] for i in true_labels[mask]]
predicted_tags_reversed = [reverse_tag_map[i] for i in preds.argmax(-1)[mask]]

In [247]:
print("{:>20} | {:>7} | {:>5}".format("Token", "Pred", "True NER Tag"))
print('-'*70)
for s, p, t in zip(true_text_reversed, predicted_tags_reversed, true_tags_reversed):
    if p == t:
        p = green + p + reset
    else:
        p = red + p + reset
    t = purple + t + reset
    print("{:>20} | {:>17}  | {:>15}".format(s, p, t))

               Token |    Pred | True NER Tag
----------------------------------------------------------------------
                  In |  [1;31mI-per[0m  |    [1;35mO[0m
                   a |  [1;31mB-nat[0m  |    [1;35mO[0m
           statement |  [1;31mB-nat[0m  |    [1;35mO[0m
              Monday |  [1;31mB-nat[0m  | [1;35mB-tim[0m
                   , |  [1;31mB-nat[0m  |    [1;35mO[0m
                 Mr. |  [1;31mB-nat[0m  | [1;35mB-per[0m
               Peres |  [1;31mB-nat[0m  | [1;35mI-per[0m
                said |  [1;31mB-nat[0m  |    [1;35mO[0m
                   " |  [1;31mB-nat[0m  |    [1;35mO[0m
               there |  [1;31mB-nat[0m  |    [1;35mO[0m
              exists |  [1;31mI-per[0m  |    [1;35mO[0m
                  no |  [1;31mI-per[0m  |    [1;35mO[0m
               basis |  [1;31mB-nat[0m  |    [1;35mO[0m
                  in |  [1;31mI-gpe[0m  |    [1;35mO[0m
             reality |  [1;31mI-nat[0m

In [239]:
for t, g in zip(sample_texts[0], sample_labels[0]):
    print(f"{t:>20}, {g:>9}")

                  In,         O
                   a,         O
           statement,         O
              Monday,     B-tim
                   ,,         O
                 Mr.,     B-per
               Peres,     I-per
                said,         O
                   ",         O
               there,         O
              exists,         O
                  no,         O
               basis,         O
                  in,         O
             reality,         O
                 for,         O
                 the,         O
              claims,         O
           published,         O
                   ",         O
                  by,         O
                 the,         O
             British,     B-gpe
           newspaper,         O
                   ,,         O
                 The,     B-org
            Guardian,     I-org
                   .,         O


In [243]:
' '.join(sample_texts[0]), len(sample_texts[0])

('In a statement Monday , Mr. Peres said " there exists no basis in reality for the claims published " by the British newspaper , The Guardian .',
 28)

In [244]:
true_text_reversed, len(true_text_reversed.split())

(' In a statement Monday, Mr. Peres said " there exists no basis in reality for the claims published " by the British newspaper, The Guardian.',
 25)