# Long Text Sentiment - Text Token Split Mean

So far, we have restricted the length of the text being fed into our models. Bert in particular is restricted to consuming 512 tokens per sample. For many use-cases, this is most likely not a problem - but in some cases it can be.

If we take the example of Customer feedbacks on e-commerce sites, which often consists of what customers is thinking about the products. On these longer pieces of text, the actual sentiment from the customer may not be clear from the first 512 tokens. We need to consider the full post.

Before working through the logic that allows us to consider the full post, let's import and define everything we need to make a prediction on a single chunk of text (using much of what we covered in the last section).

In [31]:
# this text is taken from kaggle Amazon Reviews Dataset
# https://www.kaggle.com/bittlingmayer/amazonreviews
text = """
       One of the best game music soundtracks - for a game I didn't really play: Despite the fact that I have only played a small portion of the game,
       the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my 
       favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially like, 
       as there's not too many of those kinds of songs in my other video game soundtracks. I must admit that one of the songs (Life-A Distant Promise) 
       has brought tears to my eyes on many occasions.My one complaint about this soundtrack is that they use guitar fretting effects in many of the 
       songs, which I find distracting. But even if those weren't included I would still consider the collection worth it. Not an "ultimate guide": Firstly,I enjoyed the format and tone of the book (how the author addressed the reader). 
       However, I did not feel that she imparted any insider secrets that the book promised to reveal. 
       If you are just starting to research law school, and do not know all the requirements of admission, then this book may be a tremendous help. 
       If you have done your homework and are looking for an edge when it comes to admissions, I recommend some more topic-specific books. 
       For example, books on how to write your personal statment, books geared specifically towards LSAT preparation (Powerscore books were the most helpful for me), and there are some websites with great advice geared towards aiding the individuals whom you are asking to write letters of recommendation. 
       Yet, for those new to the entire affair, this book can definitely clarify the requirements for you. Don't Take the Chance - Get the SE Branded Cable: If you purchase this data cable, you need to know that you will receive no real directions or information regarding what to check if nothing works. As directed, I downloaded all of the files from the SE site (70MB on dial up!), and then downloaded all of the user guides.
       Everything seemed to install ok, but nothing would make my phone be recognized. After that I scoured the SE site for troubleshooting info on their branded cable-in the hope that something would help me figure out the problem. 
       After 2 full days of beating my head against the wall, I finally threw the cable and the useless CD that came with it in the trash.If I had used my brain I would have paid the extra $$ for a SE branded cable and software (and the support that comes along with that). 
       I now have the real deal (SE data cable and software), and guess what? Yep, installation was a breeze and it works beautifully. You really do get what you pay for. great IMO: First of all, I saw the review by "Tyley Mike "Relite"" and thought he was grossly overcritical of EVERYTHING and every instrument played... 
       so I'd like to hear Tyley Mike's album, since he thinks he can do better :) --seriously! I think some people don't understand that things sound the way they were MEANT to sound, if they sound poppy, they made it that way, why the hell should they stick to the norm? They want to do something different and in my opinion it sounds great.
       I can't write a good enough review for this album, all their albums actually, as they are all a masterpiece of their own while still being different enough to keep it interesting. It bugs me when a group doesn't evolve or try new things and stays exactly the same as they ever were, all the time, so I was glad to see them progress and "grow".
       There's too much to say to describe this album, but frankly I don't think I could write a good enough review to do it justice, so I'll just give it my 5 stars :) .  It Rises above the "Fluff" Books: The first thing that struck me was that it was easy to read. 
       The print was readable and the illustrations were helpful. I did also find some grammatical errors as an earlier review said. But mostly it was very specific and practical. 
       The chapters most helpful were on "emotional states" and music. It's hard to find a book on this subject that's across the board, dealing with many different issues and this one addresses nearly every brain-related research issue from nutrition to memory. 
       As a scientist who also works with high school students, I found his translation of brain research into the classroom to be thoughtful, if not enthusiastic. It's a tough subject to translate, but I did get more than I thought I would out of the book.
       Mostly it helped me get past the hype and get into the real practical meat of the material. The book's far from perfect, but it's the best I've seen so far on this topic. 
       """

Now let's get to how we apply sentiment to longer pieces of text. this approch split the text into sentences and calculates the mean of probabilities

In [32]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np

In [33]:
# initialize our model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('bertweet-base-sentiment-analysis')
model = AutoModelForSequenceClassification.from_pretrained('bertweet-base-sentiment-analysis')
labels = ["negative", "neutral", "positive"]

In [34]:
def get_sentiment(text):
    # get tokens
    inputs = tokenizer.encode_plus(text,return_tensors='pt')
    # get output logits from the model
    output = model(**inputs)
    # convert to probabilities
    probs = torch.nn.functional.softmax(output[0], dim=-1)
    # we will return the probability tensor (we will not need argmax until later)
    return probs

In [35]:
# check the text token length
tokens = tokenizer.encode_plus(text, add_special_tokens=False)

len(tokens['input_ids'])

Token indices sequence length is longer than the specified maximum sequence length for this model (1046 > 128). Running this sequence through the model will result in indexing errors


1046

If we tokenize this longer piece of text we get a total of **1046** tokens, far too many to fit into bertweet-base-sentiment-analysis model containing a maximum limit of 128 tokens. We will need to split this text into chunks of 128 tokens at a time, and calculate our sentiment probabilities for each chunk seperately.

Because we are taking this slightly different approach, we have encoded our tokens using a different set of parameters to what we have used before. This time, we:

* Avoided adding special tokens `add_special_tokens=False` because this will add *[CLS]* and *[SEP]* tokens to the start and end of the full tokenized tensor of length **1046**, we will instead add them manually later.

* We will not specify `max_length`, `truncation`, or `padding` parameters (as we do not use any of them here).

* We will return standard Python *lists* rather than tensors by not specifying `return_tensors` (it will return lists by default). This will make the following logic steps easier to follow - but we will rewrite them using PyTorch code in the next section.

First, we break our tokenized dictionary into `input_ids` and `attention_mask` variables.

In [36]:
input_ids = tokens['input_ids']
attention_mask = tokens['attention_mask']

We can now access slices of these lists like so:

In [37]:
input_ids[16:32]

[6227, 6, 733, 25, 8, 36, 121, 834, 11, 915, 10705, 15, 6, 20905, 6458, 3]

In [38]:
# define our starting position (0) and window size (number of tokens in each chunk)
start = 0
window_size = 128

# get the total length of our tokens
total_len = len(input_ids)

# initialize condition for our while loop to run
loop = True

# loop through and print out start/end positions
while loop:
    # the end position is simply the start + window_size
    end = start + window_size
    # if the end position is greater than the total length, make this our final iteration
    if end >= total_len:
        loop = False
        # and change our endpoint to the final token position
        end = total_len
    print(f"start={start}\nend={end}")
    # we need to move the window to the next 512 tokens
    start = end

start=0
end=128
start=128
end=256
start=256
end=384
start=384
end=512
start=512
end=640
start=640
end=768
start=768
end=896
start=896
end=1024
start=1024
end=1046


This logic works for shifting our window across the full length of input IDs, so now we can modify it to iterately predict sentiment for each window. There will be a few added steps for us to get this to work:

1. Extract the window from `input_ids` and `attention_mask`.

2. Add the start of sequence token `[CLS]`/`101` and seperator token `[SEP]`/`102`.

3. Add padding (only applicable to final batch).

4. Format into dictionary containing PyTorch tensors.

5. Make logits predictions with the model.

6. Calculate softmax and append softmax vector to a list `probs_list`.

In [39]:
# initialize probabilities list
probs_list = []

start = 0
window_size = 126  # we take 2 off here so that we can fit in our [CLS] and [SEP] tokens

loop = True

while loop:
    end = start + window_size
    if end >= total_len:
        loop = False
        end = total_len
    # (1) extract window from input_ids and attention_mask
    input_ids_chunk = input_ids[start:end]
    attention_mask_chunk = attention_mask[start:end]
    # (2) add [CLS] and [SEP]
    input_ids_chunk = [101] + input_ids_chunk + [102]
    attention_mask_chunk = [1] + attention_mask_chunk + [1]
    # (3) add padding upto window_size + 2 (512) tokens
    input_ids_chunk += [0] * (window_size - len(input_ids_chunk) + 2)
    attention_mask_chunk += [0] * (window_size - len(attention_mask_chunk) + 2)
    # (4) format into PyTorch tensors dictionary
    input_dict = {
        'input_ids': torch.Tensor([input_ids_chunk]).long(),
        'attention_mask': torch.Tensor([attention_mask_chunk]).int()
    }
    print(len(input_ids_chunk))
    print(len(attention_mask_chunk))
    # (5) make logits prediction
    outputs = model(**input_dict)
    # (6) calculate softmax and append to list
    probs = torch.nn.functional.softmax(outputs[0], dim=-1)
    probs_list.append(probs)

    start = end
    
# let's view the probabilities given
probs_list

128
128
128
128
128
128
128
128
128
128
128
128
128
128
128
128
128
128


[tensor([[0.0010, 0.0098, 0.9892]], grad_fn=<SoftmaxBackward>),
 tensor([[0.4240, 0.5489, 0.0272]], grad_fn=<SoftmaxBackward>),
 tensor([[0.0010, 0.0867, 0.9123]], grad_fn=<SoftmaxBackward>),
 tensor([[0.9362, 0.0607, 0.0031]], grad_fn=<SoftmaxBackward>),
 tensor([[0.7842, 0.1430, 0.0729]], grad_fn=<SoftmaxBackward>),
 tensor([[0.5739, 0.4029, 0.0232]], grad_fn=<SoftmaxBackward>),
 tensor([[0.0079, 0.3190, 0.6731]], grad_fn=<SoftmaxBackward>),
 tensor([[0.0027, 0.4297, 0.5676]], grad_fn=<SoftmaxBackward>),
 tensor([[0.0030, 0.1675, 0.8295]], grad_fn=<SoftmaxBackward>)]

In [40]:
probs_list

[tensor([[0.0010, 0.0098, 0.9892]], grad_fn=<SoftmaxBackward>),
 tensor([[0.4240, 0.5489, 0.0272]], grad_fn=<SoftmaxBackward>),
 tensor([[0.0010, 0.0867, 0.9123]], grad_fn=<SoftmaxBackward>),
 tensor([[0.9362, 0.0607, 0.0031]], grad_fn=<SoftmaxBackward>),
 tensor([[0.7842, 0.1430, 0.0729]], grad_fn=<SoftmaxBackward>),
 tensor([[0.5739, 0.4029, 0.0232]], grad_fn=<SoftmaxBackward>),
 tensor([[0.0079, 0.3190, 0.6731]], grad_fn=<SoftmaxBackward>),
 tensor([[0.0027, 0.4297, 0.5676]], grad_fn=<SoftmaxBackward>),
 tensor([[0.0030, 0.1675, 0.8295]], grad_fn=<SoftmaxBackward>)]

Each section has been assign varying levels of sentiment. To calculate the average sentiment across the full text, we will merge these tensors using the `stack` method:

We will calculate the mean score of each column (positive, negative, and neutral sentiment respectively) using `mean(dim=0)`. But before we do that we must reshape our tensor into a *3x3* shape - it is currently a 3x1x3:

In [41]:
# take mean of probabilities
with torch.no_grad():
    # we must include our stacks operation in here too
    stacks = torch.stack(probs_list)
    print(stacks.shape)
    # We can reshape our tensor dimensions using the `resize_` method, and use dimensions `0` and `2` of our current tensor shape
    print(stacks.shape[0], stacks.shape[2])
    # now resize
    stacks = stacks.resize_(stacks.shape[0], stacks.shape[2])
    # finally, we can calculate the mean value for each sentiment class
    mean = stacks.mean(dim=0)
    scores = mean.detach().numpy()

torch.Size([9, 1, 3])
9 3


In [42]:
scores

array([0.30374646, 0.24089938, 0.4553542 ], dtype=float32)

In [43]:
# sentiment
ranking = np.argsort(scores)
ranking = ranking[::-1]
senti_obj = list()
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i + 1}) {l} {np.round(float(s), 4)}")

1) positive 0.4554
2) negative 0.3037
3) neutral 0.2409
