## Sentiment Analysis Experimentation with Starbucks Reviews. Trying to see if I can train a global dataset and apply it to my location, other locations in the district 

In [1]:
%pip install nltk

/Applications/freesurfer/SetUpFreeSurfer.csh: No such file or directory.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
#Importing nltk and ssl so that we can download some packages for a tutorial 
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download([
    "names",
    "stopwords",
    "state_union",
    "twitter_samples",
    "movie_reviews",
    "averaged_perceptron_tagger",
    "vader_lexicon",
    "punkt",
])

[nltk_data] Downloading package names to
[nltk_data]     /Users/tonymoceri/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tonymoceri/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package state_union to
[nltk_data]     /Users/tonymoceri/nltk_data...
[nltk_data]   Package state_union is already up-to-date!
[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/tonymoceri/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/tonymoceri/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/tonymoceri/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to


True

# Compiling Data

In [3]:
# Loading the state of the union corpus we just downloaded. Separating the words. str.isalpha() is to include only the words that are made up of letters
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]

In [4]:
# Common words such as "a", "the", "of", etc. are referred to as "stopwords". NLTK can essentially remove/ignore these words using a function called .stopwords... 
# Because the corpus contains stopwords in multiple languages, we want to include the "english" argument here
stopwords = nltk.corpus.stopwords.words("english")

# Now, we can remove the stopwords from our words variable and limit the scope of the corpus a bit... 
words = [w for w in words if w.lower() not in stopwords]
words

['PRESIDENT',
 'HARRY',
 'TRUMAN',
 'ADDRESS',
 'JOINT',
 'SESSION',
 'CONGRESS',
 'April',
 'Mr',
 'Speaker',
 'Mr',
 'President',
 'Members',
 'Congress',
 'heavy',
 'heart',
 'stand',
 'friends',
 'colleagues',
 'Congress',
 'United',
 'States',
 'yesterday',
 'laid',
 'rest',
 'mortal',
 'remains',
 'beloved',
 'President',
 'Franklin',
 'Delano',
 'Roosevelt',
 'time',
 'like',
 'words',
 'inadequate',
 'eloquent',
 'tribute',
 'would',
 'reverent',
 'silence',
 'Yet',
 'decisive',
 'hour',
 'world',
 'events',
 'moving',
 'rapidly',
 'silence',
 'might',
 'misunderstood',
 'might',
 'give',
 'comfort',
 'enemies',
 'infinite',
 'wisdom',
 'Almighty',
 'God',
 'seen',
 'fit',
 'take',
 'us',
 'great',
 'man',
 'loved',
 'beloved',
 'humanity',
 'man',
 'could',
 'possibly',
 'fill',
 'tremendous',
 'void',
 'left',
 'passing',
 'noble',
 'soul',
 'words',
 'ease',
 'aching',
 'hearts',
 'untold',
 'millions',
 'every',
 'race',
 'creed',
 'color',
 'world',
 'knows',
 'lost',
 'he

Since asll words in the stopwords list are lowercase, and those in the original list may not be, you can use str.lower() to account for any discrpencies. Otherwise you may end up with mixed case or capitalized stop words still in your list

It is possible to build your own text corpora from any source. Building a corpus can be as simple as loading some plain text or as complex as labeling and categorizing each sentence. We can find NLTK's documentation on how to work with corpus readers. 

For this tutorial, we are just going to use the built-in corpora provided by NLTK. 

In [5]:
# NLTK provides nltk.word_tokenize(), a function that splits raw text into individual words. This will deliver simple word lists really well. 

from pprint import pprint

text = """
For some quick analysis, creating a corpus could be overkill.
If all you need is a word list,
there are simpler ways to achieve that goal."""

pprint(nltk.word_tokenize(text), width=79, compact=True)

['For', 'some', 'quick', 'analysis', ',', 'creating', 'a', 'corpus', 'could',
 'be', 'overkill', '.', 'If', 'all', 'you', 'need', 'is', 'a', 'word', 'list',
 ',', 'there', 'are', 'simpler', 'ways', 'to', 'achieve', 'that', 'goal', '.']


# Creating Frequency Distributions 

A frequency distribution is essentially a table that tells you how many times each word appears within a given text. In NLTK, frequency distributions are a specific object type implementend as a distinct class called "FreqDist". This class provides useful operations for word frequency analysis. 

In [6]:
words: list[str] = nltk.word_tokenize(text)
fd = nltk.FreqDist(words)

# This will create a frequency distribution object similar to a Python dictionary but with added features. 

In [7]:
# After building the object, we can use methods like .most_common() and .tabulate() to start visualizing information: 
print(fd.most_common(3))
print(fd.tabulate(3))

[(',', 2), ('a', 2), ('.', 2)]
, a . 
2 2 2 
None


# Extracting Concordance and Collocations 
In the context of NLP, a __concordance__ is a collection of word locations along with their context. You can use these to find: 
1. How many times a word appears 
2. Where each occurrence appears 
3. What words surround each occurrence 

In NLTK, you can do this by calling .concordance(). To use it, you need an unstance of the nltk.Text class, which can also be constructed with a word list. 

In [8]:
# Before invoking .concordance(), build a new word list from the original corpus text so that all the context, even stop words, will be there
text = nltk.Text(nltk.corpus.state_union.words())
text.concordance("america", lines=5)

Displaying 5 of 1079 matches:
 would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace
beyond any shadow of a doubt , that America will continue the fight for freedom
 to make complete victory certain , America will never become a party to any pl
nly in law and in justice . Here in America , we have labored long and hard to 


In [12]:
# To obtain a usable list that will also give you information about the location of each occurrence, use .concordance_list(): 

concordance_list = text.concordance_list("america", lines=2)
for entry in concordance_list:
    print(entry.line)

[]


In [13]:
# Revisiting nltk.word_tokenize(), check out how quickly you can create a custom nltk.Text instance and an accompanying frequency distribution: 
words: list[str] = nltk.word_tokenize(
    """Beautiful is better than ugly. 
    Explicit is better than implicit.
    Simple is better than complex."""
)
text = nltk.Text(words)
fd = text.vocab()
fd.tabulate(3)

    is better   than 
     3      3      3 


Another powerful feature of NLTK is its ability to quickly find __collocations__ with simmple function calls. Collocations are series of words that frequently appear together in a given text. In the State of the Union corpus, for example, you'd expect to find the words _United_ and _States_ appearing next to each other very often. Those two words appearing together is a collocation. 

Collocations can be made up of two or more words. NLTK provides classes to handle several types of collocations: 
- __Bigrams:__ Frequent two-word combinations 
- __Trigrams:__ Frequent three-word combinations
- __Quadgrams:__ Frequent four-word combinations

In [14]:
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
finder = nltk.collocations.TrigramCollocationFinder.from_words(words)

One of their most useful tools is the `ngram_fd` property. This property holds a frequency distribution that is build for each collocation rather than for individual words

In [15]:
# Using ngram_fd, we can find the most common collocations in this supplied text:
finder.ngram_fd.most_common(2)
finder.ngram_fd.tabulate(2)

  ('the', 'United', 'States') ('the', 'American', 'people') 
                          294                           185 


# Using NLTK's Pre-Trained Sentiment Analyzer 

NLTK already has a built-in, pretrained sentiment analyzer called VADER (__V__alence __A__ware __D__ictionary and s__E__ntiment __R__easoner). 

Since VADER is pretrained, you can get results more quickly then with many other analyzers. However, VADER is best suited for language used in social media, like short sentences with some slang and abbreviations. It's less accurate when rating longer, structured sentences, but it's often a good launching point. 

In [16]:
# To use VADER, first create an instance of nltk.sentiment.SentimentIntensityAnalyzer, then use .polarity_scores() on a raw string:

from nltk.sentiment import SentimentIntensityAnalyzer 
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("Wow, NLTK is really powerful!")


{'neg': 0.0, 'neu': 0.295, 'pos': 0.705, 'compound': 0.8012}

This returns a dictionary of different scores. The negative, neutral, and positive scores are related: they all add up to 1 and cannot be negative. The compound score is calculated differently. It is not just and average, and it can range from -1 to 1

In [17]:
# Let's load the twitter_samples corpus into a list of strings, making a replacement to render URLs inactive to avoid accidental clicks:
tweets = [t.replace("://", "//") for t in nltk.corpus.twitter_samples.strings()]

In [18]:
# Now, let's use the .polarity_scores() function of your SentimentIntensityAnalyzer instance to classify tweets:

from random import shuffle
def is_positive(tweet: str) -> bool:
    """True if tweet has positive compound sentiment, False otherwise."""
    return sia.polarity_scores(tweet)["compound"] > 0

shuffle(tweets)
for tweet in tweets[:10]:
    print(">", is_positive(tweet), tweet)

> False RT @Number10cat: In the usual #bbcqt slot the BBC is showing 30 minutes of just Nigel Farage; not sure I'll be able to tell the difference.…
> False RT @DVATW: Off to the #foodbank to fight off the hunger caused by evil Tory cuts... http//t.co/zjfy2Cq6zg
> False RT @serialsockthief: When Labour decided to side with Tories in September, they hurt Scotland. This time they'll hurt the whole of the UK. …
> False RT @andy2heart: watching nigel farage who speaks a lot nore sense than many other politicians he knows his facts and speaks a lot of sense
> False @Ed_Miliband What do you don if you're say 40 seats short of a majority and SNP can offer those seats?
> True RT @BurpTv: Now a vote for UKIP IS NOT A WASTED  VOTE!!!!
> True RT @MaggieBakesBuns: @Ed_Miliband you've won me over after a prolonged period of uncertainty. Please get rid of this Tory government.
> False RT @UKIP: #UKIP Leader @Nigel_Farage on UKIP's pledge for £3bn more for our #NHS #AskNigelFarage #bbcqt http//t.co/e

In this case, is_positive() uses only the positivity of the compound score to make the call. You can choose any combination of VADER scores to tweak the classification to your needs. 

Now take a look at the second corpus, movie_reviews. The special thing about this corpus is that it's already been classified. Therefore, you can use it to judge the accuracy of the algorithms you choose when rating similar texts. 

Keep in mind that VADER is likely better at rating tweets than it is at rating long movie reviews. To get better results, you'll set up VADER to rate individual sentences within the review rather than the entire text. 

Since VADER needs raw strings for its rating, you can't use .words() like you did earlier. Instead, make a list of the file IDs that the corpus uses, which you can use later to reference individual reviews: 

In [20]:
positive_review_ids = nltk.corpus.movie_reviews.fileids(categories=["pos"])
negative_review_ids = nltk.corpus.movie_reviews.fileids(categories=["neg"])
all_review_ids = positive_review_ids + negative_review_ids

In [21]:
from statistics import mean
def is_positive(review_id: str) -> bool: 
    """True if the average of all sentence compound scores is positive."""
    text = nltk.corpus.movie_reviews.raw(review_id)
    scores = [
        sia.polarity_scores(sentence)["compound"]
        for sentence in nltk.sent_tokenize(text)
    ]
    return mean(scores) > 0

In [22]:
shuffle(all_review_ids)
correct = 0
for review_id in all_review_ids:
    if is_positive(review_id):
        if review_id in positive_review_ids:
            correct += 1
    else:
        if review_id in negative_review_ids:
            correct += 1
print(F"{correct / len(all_review_ids):.2%} correct")            

64.00% correct


# Customizing NLTK's Sentimient Analysis
NLTK offers a few built-in classifiers that are suitable for various types of analyses, including sentiment analysis. The trick is to figure out which properties of your dataset are useful in classifying each piece of data into your desire categories. 

In the world of machine learning, these data properties are known as __features__, which you must reveal and select as you work with your data.

## Selecting Useful Features 
By using predefined categories in the movie_review corpus, you can create sets of positive and negative words, then determine which ones occur most frequently across each set. Begin by excluding unwanted workds and building the initial category groups:

In [24]:
unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

def skip_unwanted(pos_tuple):
    word, tag = pos_tuple
    if not word.isalpha() or word in unwanted: 
        return False
    if tag.startswith("NN"):
        return False
    return True

positive_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["pos"]))
)]
negative_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["neg"]))
)]

In [25]:
positive_fd = nltk.FreqDist(positive_words)
negative_fd = nltk.FreqDist(negative_words)

common_set = set(positive_fd).intersection(negative_fd)

for word in common_set:
    del positive_fd[word]
    del negative_fd[word]

top_100_positive = {word for word, count in positive_fd.most_common(100)}
top_100_negative = {word for word, count in negative_fd.most_common(100)}

In [28]:
unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

positive_bigram_finder = nltk.collocations.BigramCollocationFinder.from_words([
    w for w in nltk.corpus.movie_reviews.words(categories=["pos"])
    if w.isalpha() and w not in unwanted
])
negative_bigram_finder = nltk.collocations.BigramCollocationFinder.from_words([
    w for w in nltk.corpus.movie_reviews.words(categories=["neg"])
    if w.isalpha() and w not in unwanted 
])