<a href="https://colab.research.google.com/github/sangeetsaurabh/tweet_sentiment_extraction/blob/master/bilstm_pytorch/biLSTM_pytorch_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1 - BiLSTM model to predict selected text

This is a bi-lstm PyTorch model that goes through each tweet and picks the phrase that should be selected.

It's being implemented as a classification problem. So, for each word in the tweet model predicts if that word is selected or not.


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [3]:
data_folder = "/content/drive/My Drive/tweet-sentiment-extraction/data"
tmp_folder = '/tmp'

In [4]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext import data
from torchtext import datasets

import spacy
import numpy as np

import time
import random

import pandas as pd

Next, we'll set the random seeds for reproducability.

In [5]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

#### Define the Spacy tokenizer

In [6]:
spacy_en = spacy.load('en')
def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

##### Define the fields for PyTorch Data Loader

In [7]:
TEXT = data.Field(tokenize = tokenize_en,
                  init_token = '<sos>', 
                  eos_token = '<eos>',   
                  lower = True) 
                  #include_lengths = True)
SEL_TEXT = data.Field(tokenize = tokenize_en,
                      init_token = '<sos>', 
                      eos_token = '<eos>',   
                      lower = True,)
LABEL = data.LabelField()
IDX = data.Field(sequential=False,use_vocab=False,preprocessing=int)

fields = [('text', TEXT),('sel_text', TEXT), ('idx', IDX), ('label', LABEL)]

Using Tabular dataset to read CSV files

In [8]:
train_dataset, valid_dataset, test_dataset = data.TabularDataset.splits(
                                        path = data_folder,
                                        train = 'train_transform.csv',
                                        validation = 'valid_transform.csv',
                                        test = 'test_transform.csv',
                                        format = 'csv',
                                        fields = fields,
                                        skip_header = True
)

We can check how many examples are in each section of the dataset by checking their length.

In [9]:
print(f"Number of training examples: {len(train_dataset)}")
print(f"Number of validation examples: {len(valid_dataset)}")
print(f"Number of testing examples: {len(test_dataset)}")

Number of training examples: 26106
Number of validation examples: 1374
Number of testing examples: 3534


Let's print out an example:

In [10]:
print(vars(train_dataset.examples[0]))

{'text': ['neutral', '-pron-', '`', 'would', 'have', 'respond', ',', 'if', '-pron-', 'be', 'go', 'neutral'], 'sel_text': ['-pron-', '`', 'would', 'have', 'respond', ',', 'if', '-pron-', 'be', 'go'], 'idx': 0, 'label': 'neutral'}


In [11]:
print(vars(train_dataset.examples[5]))

{'text': ['neutral', 'url', '-', 'some', 'shameless', 'plug', 'for', 'the', 'best', 'ranger', 'forum', 'on', 'earth', 'neutral'], 'sel_text': ['url', '-', 'some', 'shameless', 'plug', 'for', 'the', 'best', 'ranger', 'forum', 'on', 'earth'], 'idx': 5, 'label': 'neutral'}


Next, we'll build the vocabulary - a mapping of tokens to integers. 


We also load the [GloVe](https://nlp.stanford.edu/projects/glove/) pre-trained token embeddings. Specifically, the 100-dimensional embeddings that have been trained on 6 billion tokens. Using pre-trained embeddings usually leads to improved performance - although admittedly the dataset used in this tutorial is too small to take advantage of the pre-trained embeddings. 

`unk_init` is used to initialize the token embeddings which are not in the pre-trained embedding vocabulary. By default this sets those embeddings to zeros, however it is better to not have them all initialized to the same value, so we initialize them from a Normal/Gaussian distribution.


In [13]:
MIN_FREQ = 1

TEXT.build_vocab(train_dataset, 
                 min_freq = MIN_FREQ,
                 vectors = "glove.twitter.27B.100d",
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_dataset)

.vector_cache/glove.twitter.27B.zip: 1.52GB [11:47, 2.15MB/s]                            
100%|█████████▉| 1192023/1193514 [00:44<00:00, 26906.34it/s]

We can check how many tokens and tags are in our vocabulary by getting their length:

In [14]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")

Unique tokens in TEXT vocabulary: 21271


Exploring the vocabulary, we can check the most common tokens within our texts:

In [15]:
print(TEXT.vocab.freqs.most_common(20))

[('-pron-', 62876), ('to', 24512), ('!', 22220), ('.', 21422), ('neutral', 21080), ('be', 19462), ('`', 17090), ('positive', 16409), ('negative', 14763), ('the', 12957), (',', 12334), ('going', 9845), ('*', 7350), ('and', 7152), ('have', 7008), ('?', 6316), ('s', 5803), ('in', 5540), ('...', 5535), ('for', 5244)]


The final part of data preparation is handling the iterator. 

This will be iterated over to return batches of data to process. Here, we set the batch size and the `device` - which is used to place the batches of tensors on our GPU. 

In [16]:
BATCH_SIZE = 32
#BATCH_SIZE = 4

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_dataset, valid_dataset, test_dataset), 
     batch_size = BATCH_SIZE,
     sort_within_batch = True,
     sort_key = lambda x : len(x.text),
     device = device)

cuda


In [17]:
TEXT.vocab.itos[0:10]

['<unk>', '<pad>', '<sos>', '<eos>', '-pron-', 'to', '!', '.', 'neutral', 'be']

In [18]:
unk_token_idx = TEXT.vocab.stoi['<unk>']
pad_token_idx = TEXT.vocab.stoi['<pad>']

#### Output label preparation

In [19]:
#Function to go through each input tweet and return 0, 1, 2, or 3
# 0 - word is not a selected phrase
# 1 - word is a selected phrase
# 2 - pad token
# 3- unknown

def find_subset_index(haystack, needle): 
    output = [0,0]
    for idx, item in enumerate(haystack[2:len(haystack)-2]):
        if item in [unk_token_idx]:
          output.append(3)
        elif item in [pad_token_idx]:
          output.append(2)
        elif item in needle:
          output.append(1)
        else:
          output.append(0)
    output = output + [0,0]
    return output

In [20]:
# Function to prepare outout for a batch
def return_output(src, trg):
    output = []
    output_flag = []
    src = src.permute(1,0)
    trg = trg.permute(1,0)
    for bs in range(src.shape[0]):
        trg_sentence = trg[bs,:].cpu().detach().numpy()
        src_sentence = src[bs,:].cpu().detach().numpy()
        #print(trg_sentence)
        #print(src_sentence)
       
        
        trg_index = find_subset_index(src_sentence, trg_sentence)
        #print(trg_index)
        #print("\n")
        #print(trg_index)
        if trg_index is None:
          output_flag.append(batch_idx[bs].item())
        output.append(trg_index)
    output = torch.tensor(output).permute(1,0)
    return output


In [21]:
for i, batch in enumerate(valid_iterator):
  
  src = batch.text
  trg = batch.sel_text
  idx = batch.idx
  

  if i > 10:
    
    print(trg.permute(1,0))
    print("\n")
    x = return_output(src, trg)
    print(src.permute(1,0))
    print("\n")
    print(x.permute(1,0))
    print(idx)
    break

tensor([[    2,     4,   196,     9,   207,     3,     1,     1,     1,     1,
             1,     1],
        [    2,     4,  9972,     3,     1,     1,     1,     1,     1,     1,
             1,     1],
        [    2,   775,     3,     1,     1,     1,     1,     1,     1,     1,
             1,     1],
        [    2,    36, 19014,    38,    42,     7,     4,    80,    63,   340,
             7,     3],
        [    2,  4814,    37,   291,   179,  1567,    14,     0,    17,     0,
             7,     3],
        [    2,    44,     3,     1,     1,     1,     1,     1,     1,     1,
             1,     1],
        [    2,   907,    26,   952,     6,   952,     6,   952,     6,   225,
            47,     3],
        [    2,    60,    23,    13,  3047,     7,     3,     1,     1,     1,
             1,     1],
        [    2,  1384,   171,    22,     3,     1,     1,     1,     1,     1,
             1,     1],
        [    2,     4,  1013,     9,    91,     7,     3,     1,     1,  

## Building the Model


`nn.Embedding` is an embedding layer and the input dimension should be the size of the input (text) vocabulary. We tell it what the index of the padding token is so it does not update the padding token's embedding entry.

`nn.LSTM` is the LSTM. We apply dropout as regularization between the layers, if we are using more than one.

`nn.Linear` defines the linear layer to make predictions using the LSTM outputs. We double the size of the input if we are using a bi-directional LSTM. The output dimensions should be the size of the tag vocabulary.

We also define a dropout layer with `nn.Dropout`, which we use in the `forward` method to apply dropout to the embeddings and the outputs of the final layer of the LSTM.

In [22]:
class BiLSTMPOSTagger(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim, 
                 n_layers, 
                 bidirectional, 
                 dropout, 
                 pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx = pad_idx)
        
        self.lstm = nn.LSTM(embedding_dim, 
                            hidden_dim, 
                            num_layers = n_layers, 
                            bidirectional = bidirectional,
                            dropout = dropout if n_layers > 1 else 0)
        
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):

        #text = [sent len, batch size]
        
        #pass text through embedding layer
        embedded = self.dropout(self.embedding(text))
        
        #embedded = [sent len, batch size, emb dim]
        
        #pass embeddings into LSTM
        outputs, (hidden, cell) = self.lstm(embedded)
        
        #outputs holds the backward and forward hidden states in the final layer
        #hidden and cell are the backward and forward hidden and cell states at the final time-step
        
        #output = [sent len, batch size, hid dim * n directions]
        #hidden/cell = [n layers * n directions, batch size, hid dim]
        
        #we use our outputs to make a prediction of what the tag should be
        predictions = self.fc(self.dropout(outputs))
        
        #predictions = [sent len, batch size, output dim]
        
        return predictions

## Training the Model

Next, we instantiate the model. We need to ensure the embedding dimensions matches that of the GloVe embeddings we loaded earlier.

The rest of the hyperparmeters have been chosen as sensible defaults, though there may be a combination that performs better on this model and dataset.

The input and output dimensions are taken directly from the lengths of the respective vocabularies. The padding index is obtained using the vocabulary and the `Field` of the text.

In [23]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = 4
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = BiLSTMPOSTagger(INPUT_DIM, 
                        EMBEDDING_DIM, 
                        HIDDEN_DIM, 
                        OUTPUT_DIM, 
                        N_LAYERS, 
                        BIDIRECTIONAL, 
                        DROPOUT, 
                        PAD_IDX)

We initialize the weights from a simple Normal distribution. Again, there may be a better initialization scheme for this model and dataset.

In [24]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.normal_(param.data, mean = 0, std = 0.1)
        
model.apply(init_weights)

BiLSTMPOSTagger(
  (embedding): Embedding(21271, 100, padding_idx=1)
  (lstm): LSTM(100, 128, num_layers=2, dropout=0.25, bidirectional=True)
  (fc): Linear(in_features=256, out_features=4, bias=True)
  (dropout): Dropout(p=0.25, inplace=False)
)

Next, a small function to tell us how many parameters are in our model. Useful for comparing different models.

In [25]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,758,912 trainable parameters


We'll now initialize our model's embedding layer with the pre-trained embedding values we loaded earlier.

This is done by getting them from the vocab's `.vectors` attribute and then performing a `.copy` to overwrite the embedding layer's current weights.

In [26]:
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([21271, 100])


In [27]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [ 0.4298,  0.8205, -1.4562,  ...,  1.4802,  0.2942,  1.3924],
        ...,
        [-1.8372, -0.6600,  0.7681,  ..., -0.8051,  0.6817,  0.0477],
        [-0.2065,  0.1973,  0.2503,  ...,  0.8105, -0.3320, -0.0935],
        [ 0.0168, -0.6128, -0.0734,  ...,  0.8139, -0.4705, -0.2133]])

It's common to initialize the embedding of the pad token to all zeros. This, along with setting the `padding_idx` in the model's embedding layer, means that the embedding should always output a tensor full of zeros when a pad token is input.

In [28]:
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.4298,  0.8205, -1.4562,  ...,  1.4802,  0.2942,  1.3924],
        ...,
        [-1.8372, -0.6600,  0.7681,  ..., -0.8051,  0.6817,  0.0477],
        [-0.2065,  0.1973,  0.2503,  ...,  0.8105, -0.3320, -0.0935],
        [ 0.0168, -0.6128, -0.0734,  ...,  0.8139, -0.4705, -0.2133]])


We then define our optimizer, used to update our parameters w.r.t. their gradients. We use Adam with the default learning rate.

In [29]:
optimizer = optim.Adam(model.parameters())

Next, we define our loss function, cross-entropy loss.

Even though we have no `<unk>` tokens within our tag vocab, we still have `<pad>` tokens. This is because all sentences within a batch need to be the same size. However, we don't want to calculate the loss when the target is a `<pad>` token as we aren't training our model to recognize padding tokens.

We handle this by setting the `ignore_index` in our loss function to the index of the padding token in our tag vocabulary.

In [30]:
TAG_PAD_IDX = 2

criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

We then place our model and loss function on our GPU, if we have one.

In [31]:
model = model.to(device)
criterion = criterion.to(device)

We will be using the loss value between our predicted and actual tags to train the network, but ideally we'd like a more interpretable way to see how well our model is doing - accuracy.

The issue is that we don't want to calculate accuracy over the `<pad>` tokens as we aren't interested in predicting them.

The function below only calculates accuracy over non-padded tokens. `non_pad_elements` is a tensor containing the indices of the non-pad tokens within an input batch. We then compare the predictions of those elements with the labels to get a count of how many predictions were correct. We then divide this by the number of non-pad elements to get our accuracy value over the batch.

In [32]:
def categorical_accuracy(preds, y, tag_pad_idx):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    max_preds = preds.argmax(dim = 1, keepdim = True) # get the index of the max probability
    non_pad_elements = (y != tag_pad_idx).nonzero()
    correct = max_preds[non_pad_elements].squeeze(1).eq(y[non_pad_elements])
    return correct.sum() / torch.FloatTensor([y[non_pad_elements].shape[0]])

Next is the function that handles training our model.

We first set the model to `train` mode to turn on dropout/batch-norm/etc. (if used). Then we iterate over our iterator, which returns a batch of examples. 

For each batch: 
- we zero the gradients over the parameters from the last gradient calculation
- insert the batch of text into the model to get predictions
- as PyTorch loss functions cannot handle 3-dimensional predictions we reshape our predictions
- calculate the loss and accuracy between the predicted tags and actual tags
- call `backward` to calculate the gradients of the parameters w.r.t. the loss
- take an optimizer `step` to update the parameters
- add to the running total of loss and accuracy

In [33]:
def train(model, iterator, optimizer, criterion, tag_pad_idx):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
    
        text = batch.text
        sel_text = batch.sel_text
        tags = return_output(text, sel_text)

        text, tags = text.to(device), tags.to(device)
                
        optimizer.zero_grad()
        
        #text = [sent len, batch size]
        
        predictions = model(text)
        
        #predictions = [sent len, batch size, output dim]
        #tags = [sent len, batch size]
        
        predictions = predictions.view(-1, predictions.shape[-1])
        tags = tags.contiguous().view(-1)
        
        #predictions = [sent len * batch size, output dim]
        #tags = [sent len * batch size]
        
        loss = criterion(predictions, tags)
                
        acc = categorical_accuracy(predictions, tags, tag_pad_idx)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [34]:
def evaluate(model, iterator, criterion, tag_pad_idx):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text = batch.text
            sel_text = batch.sel_text
            tags = return_output(text, sel_text)

            text, tags = text.to(device), tags.to(device)
            
            predictions = model(text)
            
            predictions = predictions.view(-1, predictions.shape[-1])
            tags = tags.contiguous().view(-1)
            
            loss = criterion(predictions, tags)
            
            acc = categorical_accuracy(predictions, tags, tag_pad_idx)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Next, we have a small function that tells us how long an epoch takes.

In [35]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Finally, we train our model!

After each epoch we check if our model has achieved the best validation loss so far. If it has then we save the parameters of this model and we will use these "best" parameters to calculate performance over our test set.

In [36]:
N_EPOCHS = 10

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion, TAG_PAD_IDX)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion, TAG_PAD_IDX)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

100%|█████████▉| 1192023/1193514 [01:00<00:00, 26906.34it/s]

Epoch: 01 | Epoch Time: 0m 12s
	Train Loss: 0.318 | Train Acc: 84.31%
	 Val. Loss: 0.510 |  Val. Acc: 85.29%
Epoch: 02 | Epoch Time: 0m 11s
	Train Loss: 0.270 | Train Acc: 87.12%
	 Val. Loss: 0.531 |  Val. Acc: 85.63%
Epoch: 03 | Epoch Time: 0m 11s
	Train Loss: 0.255 | Train Acc: 87.87%
	 Val. Loss: 0.533 |  Val. Acc: 85.81%
Epoch: 04 | Epoch Time: 0m 11s
	Train Loss: 0.242 | Train Acc: 88.48%
	 Val. Loss: 0.553 |  Val. Acc: 85.92%
Epoch: 05 | Epoch Time: 0m 11s
	Train Loss: 0.227 | Train Acc: 89.18%
	 Val. Loss: 0.572 |  Val. Acc: 85.67%
Epoch: 06 | Epoch Time: 0m 11s
	Train Loss: 0.210 | Train Acc: 90.09%
	 Val. Loss: 0.609 |  Val. Acc: 85.37%
Epoch: 07 | Epoch Time: 0m 11s
	Train Loss: 0.194 | Train Acc: 90.97%
	 Val. Loss: 0.617 |  Val. Acc: 84.76%
Epoch: 08 | Epoch Time: 0m 11s
	Train Loss: 0.178 | Train Acc: 91.85%
	 Val. Loss: 0.664 |  Val. Acc: 85.28%
Epoch: 09 | Epoch Time: 0m 11s
	Train Loss: 0.164 | Train Acc: 92.60%
	 Val. Loss: 0.696 |  Val. Acc: 84.51%
Epoch: 10 | Epoch T

We then load our "best" parameters and evaluate performance on the test set.

In [37]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, valid_iterator, criterion, TAG_PAD_IDX)

print(f'Test Loss: {test_loss:.3f} |  Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.510 |  Test Acc: 85.29%


## Inference

85% accuracy looks pretty good, but let's see our model tag some actual sentences.

In [38]:
#### Function to test against an actual tweet
def tag_sentence(model, device, sentence, text_field):
    
    model.eval()
    
    if isinstance(sentence, str):
        nlp = spacy.load('en')
        tokens = [token.text for token in nlp(sentence)]
    else:
        tokens = [token for token in sentence]

    if text_field.lower:
        tokens = [t.lower() for t in tokens]
        
    numericalized_tokens = [text_field.vocab.stoi[t] for t in tokens]
    #print(numericalized_tokens)
    #numericalized_tokens = tokens

    unk_idx = text_field.vocab.stoi[text_field.unk_token]
    
    unks = [t for t, n in zip(tokens, numericalized_tokens) if n == unk_idx]

    numericalized_tokens = [text_field.vocab.stoi['<sos>']] + numericalized_tokens + [text_field.vocab.stoi[TEXT.eos_token]]
    
    token_tensor = torch.LongTensor(numericalized_tokens)
    
    token_tensor = token_tensor.unsqueeze(-1).to(device)
         
    predictions = model(token_tensor)
    
    top_predictions = predictions.argmax(-1)
    
    predicted_tags = top_predictions.cpu().detach().numpy()
    
    return numericalized_tokens, predicted_tags.squeeze(1), unks

We'll get an already tokenized example from the training set and test our model's performance.

In [39]:
example_index = 1

text = vars(valid_dataset.examples[example_index])['text']
actual_tags = vars(valid_dataset.examples[example_index])['sel_text']


print(text)
print (actual_tags)

['negative', '_', 'wx', 'do', '-pron-', 'see', 'the', 'color', 'of', 'the', 'sky', 'and', 'how', '-pron-', 'look', 'in', 'philly', '?', '?', '-pron-', 'be', 'yellowish', '/', 'orangeish', '/', 'brownish', 'look', 'scary', '!', '!', 'lol', 'negative']
['look', 'scary', '!', '!']


In [40]:
example_index = 1
text = vars(valid_dataset.examples[example_index])['text']
sel_text = vars(valid_dataset.examples[example_index])['sel_text']
idx = vars(valid_dataset.examples[example_index])['idx']

sel_text_token = [TEXT.vocab.stoi[token] for token in sel_text]




print(text)
print (sel_text)
print(idx)

['negative', '_', 'wx', 'do', '-pron-', 'see', 'the', 'color', 'of', 'the', 'sky', 'and', 'how', '-pron-', 'look', 'in', 'philly', '?', '?', '-pron-', 'be', 'yellowish', '/', 'orangeish', '/', 'brownish', 'look', 'scary', '!', '!', 'lol', 'negative']
['look', 'scary', '!', '!']
16071


In [41]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', 500)
valid_data = pd.read_csv(data_folder + '/valid_transform.csv')
valid_data[valid_data.idx == idx]

Unnamed: 0,text,selected_text,idx,sentiment
1,negative _wx do -PRON- see the color of the sky and how -PRON- look in philly ?? -PRON- be yellowish / orangeish / brownish look scary !! lol negative,look scary !!,16071,negative


We can then use our `tag_sentence` function to get the tags. Notice how the tokens referring to subject of the sentence, the "respected cleric", are both `<unk>` tokens!

In [42]:
tokens, pred_tags, unks = tag_sentence(model, 
                                       device, 
                                       text,
                                       TEXT)

actual_tags = find_subset_index(tokens,sel_text_token)

print(tokens)
print(actual_tags)
print(pred_tags)
print(unks)

[2, 12, 54, 8117, 30, 4, 56, 13, 1625, 25, 13, 1209, 17, 92, 4, 124, 21, 2045, 19, 19, 4, 9, 0, 135, 0, 135, 0, 124, 957, 6, 6, 65, 12, 3]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 3, 0, 3, 0, 3, 1, 1, 1, 1, 0, 0, 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0]
['yellowish', 'orangeish', 'brownish']


We can then check how well it did. Surprisingly, it got every token correct, including the two that were unknown tokens!

In [43]:
print("Pred. Tag\tActual Tag\tCorrect?\tToken\n")

for token, pred_tag, actual_tag in zip(tokens, pred_tags, actual_tags):
    correct = '✔' if pred_tag == actual_tag else '✘'
    print(f"{pred_tag}\t\t{actual_tag}\t\t{correct}\t\t{TEXT.vocab.itos[token]}")

Pred. Tag	Actual Tag	Correct?	Token

0		0		✔		<sos>
0		0		✔		negative
0		0		✔		_
0		0		✔		wx
0		0		✔		do
0		0		✔		-pron-
0		0		✔		see
0		0		✔		the
0		0		✔		color
0		0		✔		of
0		0		✔		the
0		0		✔		sky
0		0		✔		and
0		0		✔		how
0		0		✔		-pron-
0		1		✘		look
0		0		✔		in
0		0		✔		philly
0		0		✔		?
0		0		✔		?
0		0		✔		-pron-
0		0		✔		be
0		3		✘		<unk>
0		0		✔		/
0		3		✘		<unk>
0		0		✔		/
0		3		✘		<unk>
0		1		✘		look
1		1		✔		scary
1		1		✔		!
1		1		✔		!
0		0		✔		lol
0		0		✔		negative
0		0		✔		<eos>


Not bad. Almost gets it.

### Jaccard testing

Since Kaggle submission will test against Jaccard, let's see how the model performs against Jaccard score

In [44]:
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

In [45]:
valid_df = []

for i in range(0, len(valid_dataset)):


  output_dict = {}
  valid_idx = vars(valid_dataset.examples[i])['idx']
  output_dict["idx"] = valid_idx

  src = vars(valid_dataset.examples[i])['text']
  output_dict["text"] = " ".join(src)
  #print(output_dict["text"])

  trg = vars(valid_dataset.examples[i])['sel_text']
  output_dict["selected_text"] = " ".join(trg)
  #print(output_dict["selected_text"])

  snt = vars(valid_dataset.examples[i])['label']
  output_dict["sentiment"] = snt

  tokens, predicted_tags, unks = tag_sentence(model, 
                                       device, 
                                       src,
                                       TEXT)

  
  

  pred_indx = [indx for indx, x in enumerate(predicted_tags[:-1]) if x == 1]
  output_dict["predicted_text"] = " ".join([TEXT.vocab.itos[tokens[p]] for p in pred_indx])
  #print(pred_indx)


  output_dict["baseline_score"] = jaccard(output_dict["text"], output_dict["selected_text"])
  output_dict["j_score"] = jaccard(output_dict["predicted_text"], output_dict["selected_text"])



  

  valid_df.append(output_dict)

In [46]:
valid_df = pd.DataFrame(valid_df)
valid_df.tail(10)

Unnamed: 0,idx,text,selected_text,sentiment,predicted_text,baseline_score,j_score
1364,14443,neutral crawling back into bed ... because -pron- can neutral,crawling back into bed ... because -pron- can,neutral,crawling back into bed ... because -pron- can,0.888889,1.0
1365,14636,positive -pron- friend be awesome ! -- and the non twitt one here right now too ! ! positive,-pron- friend be awesome !,positive,awesome !,0.3125,0.4
1366,3329,"negative - l ` would come if u could , but australia be just too far away negative",far,negative,too far away,0.058824,0.333333
1367,23117,neutral just sittin here listenin to music . follow -pron- ? neutral,just sittin here listenin to music . follow -pron- ?,neutral,just sittin here listenin to music . follow -pron- ?,0.909091,1.0
1368,25902,positive infamous on the ps3 = awesome . -pron- eye be so sore now though positive,awesome .,positive,awesome .,0.133333,1.0
1369,7493,"positive finally , -pron- get -pron- teaching load confusion clear . -pron- will teach 3 third year section but with going to catch . positive","finally , -pron- get -pron- teaching load confusion clear .",positive,,0.428571,0.0
1370,26744,neutral -pron- do the same thing in nola neutral,-pron- do the same thing in nola,neutral,-pron- do the same thing in nola,0.875,1.0
1371,693,positive thank for the greeting positive,thank,positive,thank,0.2,1.0
1372,22833,neutral eating maccie neutral,eating maccie,neutral,eating <unk>,0.666667,0.333333
1373,22906,neutral do with the packing and everything else ... leaving in 3 hour ... neutral,do with the packing and everything else ... leaving in 3 hour ...,neutral,do with the packing and everything else ... leaving in 3 hour ...,0.923077,1.0


In [47]:
round(valid_df.j_score.mean(),2) ### j-score using algorithm

0.61

In [48]:
round(valid_df.baseline_score.mean(),2) ### j-score using baseline

0.56

#### Conclusion

The model performs better than the baseline. But, it needs to become much better to compete on the leaderboard