# Long Text Sentiment - Text Summary Sentiment

So far, we have restricted the length of the text being fed into our models. Bert in particular is restricted to consuming 512 tokens per sample. For many use-cases, this is most likely not a problem - but in some cases it can be.

If we take the example of Customer feedbacks on e-commerce sites, which often consists of what customers is thinking about the products. On these longer pieces of text, the actual sentiment from the customer may not be clear from the first 512 tokens. We need to consider the full post.

Before working through the logic that allows us to consider the full post, let's import and define everything we need to make a prediction on a single chunk of text (using much of what we covered in the last section).

In [2]:
# this text is taken from kaggle Amazon Reviews Dataset
# https://www.kaggle.com/bittlingmayer/amazonreviews
text = """
       One of the best game music soundtracks - for a game I didn't really play: Despite the fact that I have only played a small portion of the game,
       the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my 
       favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially like, 
       as there's not too many of those kinds of songs in my other video game soundtracks. I must admit that one of the songs (Life-A Distant Promise) 
       has brought tears to my eyes on many occasions.My one complaint about this soundtrack is that they use guitar fretting effects in many of the 
       songs, which I find distracting. But even if those weren't included I would still consider the collection worth it. Not an "ultimate guide": Firstly,I enjoyed the format and tone of the book (how the author addressed the reader). 
       However, I did not feel that she imparted any insider secrets that the book promised to reveal. 
       If you are just starting to research law school, and do not know all the requirements of admission, then this book may be a tremendous help. 
       If you have done your homework and are looking for an edge when it comes to admissions, I recommend some more topic-specific books. 
       For example, books on how to write your personal statment, books geared specifically towards LSAT preparation (Powerscore books were the most helpful for me), and there are some websites with great advice geared towards aiding the individuals whom you are asking to write letters of recommendation. 
       Yet, for those new to the entire affair, this book can definitely clarify the requirements for you. Don't Take the Chance - Get the SE Branded Cable: If you purchase this data cable, you need to know that you will receive no real directions or information regarding what to check if nothing works. As directed, I downloaded all of the files from the SE site (70MB on dial up!), and then downloaded all of the user guides.
       Everything seemed to install ok, but nothing would make my phone be recognized. After that I scoured the SE site for troubleshooting info on their branded cable-in the hope that something would help me figure out the problem. 
       After 2 full days of beating my head against the wall, I finally threw the cable and the useless CD that came with it in the trash.If I had used my brain I would have paid the extra $$ for a SE branded cable and software (and the support that comes along with that). 
       I now have the real deal (SE data cable and software), and guess what? Yep, installation was a breeze and it works beautifully. You really do get what you pay for. great IMO: First of all, I saw the review by "Tyley Mike "Relite"" and thought he was grossly overcritical of EVERYTHING and every instrument played... 
       so I'd like to hear Tyley Mike's album, since he thinks he can do better :) --seriously! I think some people don't understand that things sound the way they were MEANT to sound, if they sound poppy, they made it that way, why the hell should they stick to the norm? They want to do something different and in my opinion it sounds great.
       I can't write a good enough review for this album, all their albums actually, as they are all a masterpiece of their own while still being different enough to keep it interesting. It bugs me when a group doesn't evolve or try new things and stays exactly the same as they ever were, all the time, so I was glad to see them progress and "grow".
       There's too much to say to describe this album, but frankly I don't think I could write a good enough review to do it justice, so I'll just give it my 5 stars :) .  It Rises above the "Fluff" Books: The first thing that struck me was that it was easy to read. 
       The print was readable and the illustrations were helpful. I did also find some grammatical errors as an earlier review said. But mostly it was very specific and practical. 
       The chapters most helpful were on "emotional states" and music. It's hard to find a book on this subject that's across the board, dealing with many different issues and this one addresses nearly every brain-related research issue from nutrition to memory. 
       As a scientist who also works with high school students, I found his translation of brain research into the classroom to be thoughtful, if not enthusiastic. It's a tough subject to translate, but I did get more than I thought I would out of the book.
       Mostly it helped me get past the hype and get into the real practical meat of the material. The book's far from perfect, but it's the best I've seen so far on this topic. 
       """

Now let's get to how we apply sentiment to longer pieces of text. this approch split the text into sentences and calculates the mean of probabilities

In [10]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TFAutoModelForSeq2SeqLM
import torch
import numpy as np

In [12]:
# Get huggginface text summarization model
sum_model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
sum_tokenizer = AutoTokenizer.from_pretrained("t5-base")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=892146080.0, style=ProgressStyle(descri…




All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [13]:
# get summarized text
inputs = sum_tokenizer("summarize: " + text, return_tensors="tf", max_length=128, truncation=True)
outputs = sum_model.generate(inputs["input_ids"], max_length=128, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
print(sum_tokenizer.decode(outputs[0]))

<pad> despite the fact that i have only played a small portion of the game, the music I heard led me to purchase the soundtrack, and it remains one of my favorite albums. there is an incredible mix of fun, epic, and emotional songs.


In [16]:
text = str(sum_tokenizer.decode(outputs[0])).replace("<pad>","").strip()
print(text)

despite the fact that i have only played a small portion of the game, the music I heard led me to purchase the soundtrack, and it remains one of my favorite albums. there is an incredible mix of fun, epic, and emotional songs.


In [17]:
# initialize our sentiment model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('bertweet-base-sentiment-analysis')
model = AutoModelForSequenceClassification.from_pretrained('bertweet-base-sentiment-analysis')
labels = ["negative", "neutral", "positive"]

In [18]:
def get_sentiment(text):
    # get tokens
    inputs = tokenizer.encode_plus(text,return_tensors='pt')
    # get output logits from the model
    output = model(**inputs)
    # convert to probabilities
    probs = torch.nn.functional.softmax(output[0], dim=-1)
    # we will return the probability tensor (we will not need argmax until later)
    return probs

In [19]:
# check the text token length
tokens = tokenizer.encode_plus(text, add_special_tokens=False)

len(tokens['input_ids'])

52

In [28]:
scores = get_sentiment(text)
scores = scores.detach().numpy()[0]

In [29]:
scores

array([0.00118135, 0.00705657, 0.99176204], dtype=float32)

In [30]:
# sentiment
ranking = np.argsort(scores)
ranking = ranking[::-1]
senti_obj = list()
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i + 1}) {l} {np.round(float(s), 4)}")

1) positive 0.9918
2) neutral 0.0071
3) negative 0.0012
