# Compute NLTK and Transformers sentiment scores on labelled dataset

### References
- Transformers: https://huggingface.co/transformers/quicktour.html

### Datasets
- UCI: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

### Notes
- Transformers gives an error with texts such as ` The structure of this film is easily the most tightly constructed in the history of cinema.  \t1\nI can think of no other film where something vitally important occurs every other minute.  \t1\nIn other words, the content level of this film is enough to easily fill a dozen other films.  \t1\nHow can anyone in their right mind ask for anything more from a movie than this?  \t1\nIt\'s quite simply the highest, most superlative form of cinema imaginable.  \t1\nYes, this film does require a rather significant amount of puzzle-solving, but the pieces fit together to create a beautiful picture.  \t1\nThis short film certainly pulls no punches.  \t0\nGraphics is far from the best part of the game.  \t0\nThis is the number one best TH game in the series.  \t1\nIt deserves strong love.  \t1\nIt is an insane game.  \t1\nThere are massive levels, massive unlockable characters... it\'s just `
- Ignore for the comparision for the moment



In [1]:
import pandas as pd

### Read data

In [2]:
# Read datasets
amazon = pd.read_csv('../data/raw/uci-sentiment/amazon_cells_labelled.txt', sep='\t', names=['Text', 'GT'])
imdb = pd.read_csv('../data/raw/uci-sentiment/imdb_labelled.txt', sep='\t', names=['Text', 'GT'])
yelp = pd.read_csv('../data/raw/uci-sentiment/yelp_labelled.txt', sep='\t', names=['Text', 'GT'])
df = pd.concat([amazon, imdb, yelp])
display(df.shape)
df.head(3)

(2748, 2)

Unnamed: 0,Text,GT
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1


### NLTK sentiment analysis

In [3]:
%%time
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

CPU times: user 725 ms, sys: 62.9 ms, total: 788 ms
Wall time: 793 ms


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/srimal/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [4]:
%time

sia = SentimentIntensityAnalyzer()

display(sia.polarity_scores("Wow, NLTK is really powerful!"))
display(sia.polarity_scores("absolutely really bad"))

df['NLTK'] = df['Text'].apply(lambda x: sia.polarity_scores(str(x))['compound'])
df.head(3)

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 5.72 µs


{'neg': 0.0, 'neu': 0.295, 'pos': 0.705, 'compound': 0.8012}

{'neg': 0.671, 'neu': 0.329, 'pos': 0.0, 'compound': -0.6214}

Unnamed: 0,Text,GT,NLTK
0,So there is no way for me to plug it in here i...,0,-0.3535
1,"Good case, Excellent value.",1,0.8402
2,Great for the jawbone.,1,0.6249


### Transformers

In [5]:
from transformers import pipeline
import numpy as np

In [6]:
%%time

def tfScore(text, classifier):
    try:
        r = classifier(text)
        if r[0]['label'] == 'NEGATIVE':
            return -1.0 * r[0]['score']
        else:
            return r[0]['score']        
    except Exception as e:
        display(f'ERROR:')
        display(f'- Text:  "{text}"')
        display(f'- Exception:  "{e}"')
        return np.nan
        

def getTFScore(text, classifier):
    score = tfScore(text, classifier)
    return score

    
tfSentiment = pipeline('sentiment-analysis')

display(getTFScore('I feel horrible', tfSentiment))
display(getTFScore('I feel awesome', tfSentiment))
display(getTFScore('1', tfSentiment))


-0.9996577501296997

0.9998730421066284

0.9854032397270203

CPU times: user 1.19 s, sys: 295 ms, total: 1.49 s
Wall time: 11.9 s


In [7]:
%%time
df['Transformers'] = df['Text'].apply(lambda x: getTFScore(str(x), tfSentiment))


Token indices sequence length is longer than the specified maximum sequence length for this model (1103 > 512). Running this sequence through the model will result in indexing errors


'ERROR:'

'- Text:  " The structure of this film is easily the most tightly constructed in the history of cinema.  \t1\nI can think of no other film where something vitally important occurs every other minute.  \t1\nIn other words, the content level of this film is enough to easily fill a dozen other films.  \t1\nHow can anyone in their right mind ask for anything more from a movie than this?  \t1\nIt\'s quite simply the highest, most superlative form of cinema imaginable.  \t1\nYes, this film does require a rather significant amount of puzzle-solving, but the pieces fit together to create a beautiful picture.  \t1\nThis short film certainly pulls no punches.  \t0\nGraphics is far from the best part of the game.  \t0\nThis is the number one best TH game in the series.  \t1\nIt deserves strong love.  \t1\nIt is an insane game.  \t1\nThere are massive levels, massive unlockable characters... it\'s just a massive game.  \t1\nWaste your money on this game.  \t1\nThis is the kind of money that is was

'- Exception:  "index out of range in self"'

'ERROR:'

'- Text:  " In fact, it\'s hard to remember that the part of Ray Charles is being acted, and not played by the man himself.  \t1\nRay Charles is legendary.  \t1\nRay Charles\' life provided excellent biographical material for the film, which goes well beyond being just another movie about a musician.  \t1\nHitchcock is a great director.  \t1\nIronically I mostly find his films a total waste of time to watch.  \t0\nSecondly, Hitchcock pretty much perfected the thriller and chase movie.  \t1\nIt\'s this pandering to the audience that sabotages most of his films.  \t0\nHence the whole story lacks a certain energy.  \t0\nThe plot simply rumbles on like a machine, desperately depending on the addition of new scenes.  \t0\nThere are the usual Hitchcock logic flaws.  \t0\nMishima is extremely uninteresting.  \t0\nThis is a chilly, unremarkable movie about an author living/working in a chilly abstruse culture.  \t0\nThe flat reenactments don\'t hold your attention because they are emotionally 

'- Exception:  "index out of range in self"'

'ERROR:'

'- Text:  " With great sound effects, and impressive special effects, I can\'t recommend this movie enough.  \t1\nCall me a nut, but I think this is one of the best movies ever.  \t1\nGreat character actors Telly Savalas and Peter Boyle.  \t1\n1 hour 54 minutes of sheer tedium, melodrama and horrible acting, a mess of a script, and a sinking feeling of GOOD LORD, WHAT WERE THEY THINKING?  \t0\nLots of holes in the script.  \t0\nIt\'s like a bad two hour TV movie.  \t0\nNow imagine that every single one of those decisions was made wrong.  \t0\nThe dialogue is atrocious.  \t0\nThe acting is beyond abysmal.  \t0\nEverything stinks.  \t0\nTrouble is, the writing and directing make it impossible to establish those things that make a movie watchable, like character, story, theme and so on.  \t0\nWorse, there\'s an incredibly weak sub-plot thrown in that follows a little band of latter-day Mansonites as they go after a reporter who\'s working on a story on the anniversary of the killings.  \t

'- Exception:  "index out of range in self"'

CPU times: user 3min 16s, sys: 1.57 s, total: 3min 17s
Wall time: 1min 38s


### Inspect text length

In [10]:
df['TextLength'] = df.Text.apply(lambda x: len(x))

In [15]:
df.TextLength.sort_values()

590       7
64        7
101       8
295       9
924      11
       ... 
135    1053
149    1562
646    4487
19     4778
136    7944
Name: TextLength, Length: 2748, dtype: int64

In [20]:
print(df.loc[136].Text)

136                       Very good stuff for the price.
136     In fact, it's hard to remember that the part ...
136              I had a seriously solid breakfast here.
Name: Text, dtype: object


## Save output

In [23]:
df.to_csv('../output/1-nltk-transformers.csv', index=False)