# Lesson 4
## Sentiment using nltk¶

In [2]:
import nltk


**Element to download:**

names: A list of common English names compiled by Mark Kantrowitz

stopwords: A list of really common words, like articles, pronouns, prepositions, and conjunctions

state_union: A sample of transcribed State of the Union addresses by different US presidents, compiled by Kathleen Ahrens

twitter_samples: A list of social media phrases posted to Twitter

movie_reviews: Two thousand movie reviews categorized by Bo Pang and Lillian Lee

averaged_perceptron_tagger: A data model that NLTK uses to categorize words into their part of speech

vader_lexicon: A scored list of words and jargon that NLTK references when performing sentiment analysis, created by C.J. Hutto and Eric Gilbert

punkt: A data model created by Jan Strunk that NLTK uses to split full texts into word lists


In [3]:
nltk.download([
     "names",
     "stopwords",
     "state_union",
     "twitter_samples",
     "movie_reviews",
     "averaged_perceptron_tagger",
     "vader_lexicon",
     "punkt",
])

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\Utente\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Utente\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package state_union to
[nltk_data]     C:\Users\Utente\AppData\Roaming\nltk_data...
[nltk_data]   Package state_union is already up-to-date!
[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\Utente\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Utente\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Utente\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is alre

True

In [None]:
# load the State of the Union corpus

words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]  # deletes punctuation

'''
alternative --> use   nltk.word_tokenize(   )
'''

In [None]:
words[:50]

In [None]:
# remove stopwords

stopwords = nltk.corpus.stopwords.words("english")

words = [w for w in words if w.lower() not in stopwords]

In [None]:
words[:50]

In [None]:
# word frequency distribution

fd = nltk.FreqDist(words)

In [None]:
# most common words

fd.most_common(10)

In [None]:
fd.tabulate(10)

In [None]:
fd["America"]

In [None]:
fd["america"]

In [None]:
fd["AMERICA"]

In [None]:
# lowercase word frequencies

lower_words = [w.lower() for w in words]
lower_fd = nltk.FreqDist(lower_words)

In [None]:
lower_fd["america"]

In the context of NLP, a **concordance** is a collection of word locations along with their context. You can use concordances to find:

- How many times a word appears
- Where each occurrence appears
- What words surround each occurrence

In [None]:
text1 = nltk.Text(nltk.corpus.state_union.words())

In [None]:
text1.concordance("america", lines=5)   # .concordance() already ignores case

In [None]:
concordance_list = text1.concordance_list("america", lines=3)

for entry in concordance_list:
    print(entry.line)

**Collocations** can be made up of two or more words. NLTK provides classes to handle several types of collocations:

- Bigrams: Frequent two-word combinations
- Trigrams: Frequent three-word combinations
- Quadgrams: Frequent four-word combinations

In [None]:
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
finder = nltk.collocations.TrigramCollocationFinder.from_words(words)

In [None]:
finder.ngram_fd.most_common(5)

### VADER

NLTK already has a built-in, pretrained sentiment analyzer called VADER (Valence Aware Dictionary and sEntiment Reasoner)

VADER needs raw strings for its rating!

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

In [None]:
sia.polarity_scores("Wow, NLTK is really powerful!")

In [None]:
# new dataset

tweets = [t.replace("://", "//") for t in nltk.corpus.twitter_samples.strings()]  # list of raw tweets as strings

In [None]:
tweets[:10]

In [None]:
from random import shuffle

def is_positive(tweet: str) -> bool:
    """True if tweet has positive compound sentiment, False otherwise."""
    return sia.polarity_scores(tweet)["compound"] > 0

shuffle(tweets)
for tweet in tweets[:10]:
    print(">", is_positive(tweet), tweet)

In [None]:
# the end