# Lesson 4
## Sentiment using nltk¶

In [2]:
import nltk


**Element to download:**

names: A list of common English names compiled by Mark Kantrowitz

stopwords: A list of really common words, like articles, pronouns, prepositions, and conjunctions

state_union: A sample of transcribed State of the Union addresses by different US presidents, compiled by Kathleen Ahrens

twitter_samples: A list of social media phrases posted to Twitter

movie_reviews: Two thousand movie reviews categorized by Bo Pang and Lillian Lee

averaged_perceptron_tagger: A data model that NLTK uses to categorize words into their part of speech

vader_lexicon: A scored list of words and jargon that NLTK references when performing sentiment analysis, created by C.J. Hutto and Eric Gilbert

punkt: A data model created by Jan Strunk that NLTK uses to split full texts into word lists


In [3]:
nltk.download([
     "names",
     "stopwords",
     "state_union",
     "twitter_samples",
     "movie_reviews",
     "averaged_perceptron_tagger",
     "vader_lexicon",
     "punkt",
])

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\Utente\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Utente\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package state_union to
[nltk_data]     C:\Users\Utente\AppData\Roaming\nltk_data...
[nltk_data]   Package state_union is already up-to-date!
[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\Utente\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Utente\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Utente\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is alre

True

In [4]:
# load the State of the Union corpus

words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]  # deletes punctuation

'''
alternative --> use   nltk.word_tokenize(   )
'''

'\nalternative --> use   nltk.word_tokenize(   )\n'

In [5]:
words[:50]

['PRESIDENT',
 'HARRY',
 'S',
 'TRUMAN',
 'S',
 'ADDRESS',
 'BEFORE',
 'A',
 'JOINT',
 'SESSION',
 'OF',
 'THE',
 'CONGRESS',
 'April',
 'Mr',
 'Speaker',
 'Mr',
 'President',
 'Members',
 'of',
 'the',
 'Congress',
 'It',
 'is',
 'with',
 'a',
 'heavy',
 'heart',
 'that',
 'I',
 'stand',
 'before',
 'you',
 'my',
 'friends',
 'and',
 'colleagues',
 'in',
 'the',
 'Congress',
 'of',
 'the',
 'United',
 'States',
 'Only',
 'yesterday',
 'we',
 'laid',
 'to',
 'rest']

In [6]:
# remove stopwords

stopwords = nltk.corpus.stopwords.words("english")

words = [w for w in words if w.lower() not in stopwords]

In [7]:
words[:50]

['PRESIDENT',
 'HARRY',
 'TRUMAN',
 'ADDRESS',
 'JOINT',
 'SESSION',
 'CONGRESS',
 'April',
 'Mr',
 'Speaker',
 'Mr',
 'President',
 'Members',
 'Congress',
 'heavy',
 'heart',
 'stand',
 'friends',
 'colleagues',
 'Congress',
 'United',
 'States',
 'yesterday',
 'laid',
 'rest',
 'mortal',
 'remains',
 'beloved',
 'President',
 'Franklin',
 'Delano',
 'Roosevelt',
 'time',
 'like',
 'words',
 'inadequate',
 'eloquent',
 'tribute',
 'would',
 'reverent',
 'silence',
 'Yet',
 'decisive',
 'hour',
 'world',
 'events',
 'moving',
 'rapidly',
 'silence',
 'might']

In [8]:
# word frequency distribution

fd = nltk.FreqDist(words)

In [9]:
# most common words

fd.most_common(10)

[('must', 1568),
 ('people', 1291),
 ('world', 1128),
 ('year', 1097),
 ('America', 1076),
 ('us', 1049),
 ('new', 1049),
 ('Congress', 1014),
 ('years', 827),
 ('American', 784)]

In [None]:
fd.tabulate(10)

In [None]:
fd["America"]

In [None]:
fd["america"]

In [None]:
fd["AMERICA"]

In [None]:
# lowercase word frequencies

lower_words = [w.lower() for w in words]
lower_fd = nltk.FreqDist(lower_words)

In [None]:
lower_fd["america"]

In the context of NLP, a **concordance** is a collection of word locations along with their context. You can use concordances to find:

- How many times a word appears
- Where each occurrence appears
- What words surround each occurrence

In [10]:
text1 = nltk.Text(nltk.corpus.state_union.words())

In [11]:
text1.concordance("america", lines=5)   # .concordance() already ignores case

Displaying 5 of 1079 matches:
 would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace
beyond any shadow of a doubt , that America will continue the fight for freedom
 to make complete victory certain , America will never become a party to any pl
nly in law and in justice . Here in America , we have labored long and hard to 


In [12]:
concordance_list = text1.concordance_list("america", lines=3)

for entry in concordance_list:
    print(entry.line)

 would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace
beyond any shadow of a doubt , that America will continue the fight for freedom


**Collocations** can be made up of two or more words. NLTK provides classes to handle several types of collocations:

- Bigrams: Frequent two-word combinations
- Trigrams: Frequent three-word combinations
- Quadgrams: Frequent four-word combinations

In [13]:
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
finder = nltk.collocations.TrigramCollocationFinder.from_words(words)

In [14]:
finder.ngram_fd.most_common(5)

[(('the', 'United', 'States'), 294),
 (('the', 'American', 'people'), 185),
 (('of', 'the', 'world'), 154),
 (('of', 'the', 'United'), 145),
 (('to', 'the', 'Congress'), 139)]

### VADER

NLTK already has a built-in, pretrained sentiment analyzer called VADER (Valence Aware Dictionary and sEntiment Reasoner)

VADER needs raw strings for its rating!

In [17]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

In [18]:
sia.polarity_scores("Wow, NLTK is really powerful!")

{'neg': 0.0, 'neu': 0.295, 'pos': 0.705, 'compound': 0.8012}

In [22]:
# new dataset

tweets = [t.replace("://", "//") for t in nltk.corpus.twitter_samples.strings()]  # list of raw tweets as strings, we disable
#the urls in order not to open unwanted webpages


In [20]:
tweets[:10]

['hopeless for tmr :(',
 "Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 months :(",
 '@Hegelbon That heart sliding into the waste basket. :(',
 '“@ketchBurning: I hate Japanese call him "bani" :( :(”\n\nMe too',
 'Dang starting next week I have "work" :(',
 "oh god, my babies' faces :( https//t.co/9fcwGvaki0",
 '@RileyMcDonough make me smile :((',
 '@f0ggstar @stuartthull work neighbour on motors. Asked why and he said hates the updates on search :( http//t.co/XvmTUikWln',
 'why?:("@tahuodyy: sialan:( https//t.co/Hv1i0xcrL2"',
 'Athabasca glacier was there in #1948 :-( #athabasca #glacier #jasper #jaspernationalpark #alberta #explorealberta #… http//t.co/dZZdqmf7Cz']

In [24]:
from random import shuffle

def is_positive(tweet: str) -> bool:
    """True if tweet has positive compound sentiment, False otherwise."""
    return sia.polarity_scores(tweet)["compound"] > 0.1

shuffle(tweets)
for tweet in tweets[:10]:
    print(">", is_positive(tweet), tweet)

> False I haven't gotten any sleep and I have to be up in 3 1/2 hours :)))))))))))))))
> True RT @Tommy_Colc: Financial Times come out in support of Tories claiming Miliband is "preoccupied w/ inequality". The man who wrote it http:/…
> False RT @HumzaYousaf: That sound you hear is the final nail hammered into New Labour coffin as Ed Miliband says he'd rather let Tories in than w…
> False RT @natalieben: #bbcqt Miliband: "I'm the 1st Labour leader going into an election saying spending in key areas is going to fall" #austeria…
> True @smartcookiesam @Confarreo I played dominoes in a pub once. It all got quite heated! Enjoyed it though :D
> True RT @amiablecynic: I'm a #GreenParty member but I'll vote #Labour because I believe Ed has the best chance of ending #Tory rule #dontvotegre…
> False RT @LucioFulciFan: @ChristinaSNP Maybe now even the thickest Labourites in Scotland can see Labour would rather side with the Tories than s…
> False RT @Plaid_Cymru: Miliband confirms he would rathe

In [None]:
# the end