<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/29_Word_Sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis

In its most basic form, sentiment analysis tries to measure the overall positivity and negativity from the tone of a text, with the idea that you are able to capture the opinion and/or mood of the author(s). You might also see this referred to as *valence* or *polarity*. The classic examples of sentiment analysis usually discuss how one can use sentiment analysis to detect overall negative versus positive tone in movie or product reviews.

A lot of sentiment analysis libraries are rule-based, which means that the creator of the resources has spent time trying to program the best set of rules to analyse language for these features.

## Where does sentiment come from?

Where does one obtain measures of positivity and negativity? In many cases the sentiment ratings are obtained from human perceptions of how positive or negative individual words are in isolation. These lexicons or wordlists will include high frequency content words (in particular adjectives, which should make sense), and store them in a manner where each word has a "score" indicating how positive or negative it is. These scores differ in how they are done - for instance someone could rate a word from 0 to 1, with 0 being negative and 1 being positive. Or one could use a Likert scale from 1 negative  to 7 positive, or one could gather votes for whether a word is negative or positive from a variety of people and then let the feature with the most votes "win." The point is, there are a few different approaches to capturing these perceptions.

## Who is giving these ratings?

Crowdsourcing is one method, where researchers can hire people on platforms such as Amazon Mechanical Turk to provide ratings for words. This allows for rapid and cheap data annotation (in fact Amazon Mechanical Turk has been called "artificial artificial intelligence" because so many annotation tasks were originally farmed out to workers on that platform). In this way, we can view the most simple form of sentiment analysis as a lexical resource, similar to the names corpus, WordNet, and a variety of other things we have already explored.

# AFINN Sentiment Lexicon

 Many sentiment lexicons are literally lists of words with scores. For example, I downloaded the [AFINN Sentiment Lexicon](http://www2.imm.dtu.dk/pubdb/pubs/6010-full.html). Inside the folder are three files: an older version of the list, a newer version of the list, and a README file explaining the resource. Here is the description from the README file:


*AFINN is a list of English words rated for valence with an integer between minus five (negative) and plus five (positive). The words have been manually labeled by Finn Årup Nielsen in 2009-2011. The file is tab-separated. There are two versions.*

*FINN-111: Newest version with 2477 words and phrases.*

*AFINN-96: 1468 unique words and phrases on 1480 lines. Note that there are 1480 lines, as some words are listed twice. The word list in not entirely in alphabetic ordering.*

The readme even provides the code for how to import the resource into Python, cool :)

Let's play with the AFINN lexicon. I've uploaded the file `AFINN-111.txt` to the course GitHub repository so it can be downloaded directly from there. I'll use the `requests` library to ask for the file via a URL.

In [None]:
import requests

# save the URL as a string to a variable
afinn_url = 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/lexical-resources/AFINN-111.txt'

# call the url and ask for the .text
AFINN = requests.get(afinn_url).text

Look at the AFINN file - it's literally a single string containing a word then a rating, but it also has those funky `\t` and `\n` characters in there, which mean TAB and NEWLINE, respectively. Ugh, we have to figure out a way to load this text in (let's pretend the authors of AFINN didn't give us a solution. I'm also not going to use their solution, although there is nothing wrong with it).


In [None]:
# just reading in AFINN gives us this, explore the first 500 characters
AFINN[:500]


We know that we can use `.split()` on a string to split a string wherever we like, with the default being whitespace. So we can use `.split('\n')` to split the string on newlines. Let's do that with our initial call to the resource.

In [None]:
AFINN = requests.get(afinn_url).text.split('\n')

In [None]:
# AFINN but now split on the newline characters.
AFINN[:10]

Now the resource is is a list where each element is a string in the form of `Word\tRating`. We can exploit this structure to loop through each string and use `.split()`

Let's follow the advice of the AFINN author and turn this into a dictionary. We want to create a dictionary, and then add each word as a key and the rating as a value. One way, and perhaps an easy way to do this, is to create a list comprehension first and then run `dict()` on the list comprehension. I'll break down the steps for you here, but you could probably figure out a way to do this in one line.



In [None]:
# Use a list comprehension to split each word into a word/rating tuple
afinn_list = [pair.split('\t') for pair in AFINN]

In [None]:
# Check it out - each element is now the word and the rating!
afinn_list[:10]

A remaining issue is that our numbers are represented by strings, but we want the numbers to be numbers, so we can calculate averages, etc. Let's change `afinn_list`  so that each rating is converted using `int()`. (We could have done this in one go above, but splitting it here to break down the process.)

In [None]:
# convert the rating to int() using another list comprehension
afinn_list = [(word, int(rating)) for word, rating in afinn_list]

Now let's put the list into a dictionary so that we can access word entries and their ratings.

In [None]:
# chuck that list into a dictionary
afinn_dict = dict(afinn_list)

In [None]:
# it works!
afinn_dict['happy']

We can now start thinking about the values associated with the words. Remember, the author of AFINN has told us that words range from -5 (negative) to +5 (positive). Look at the examples I've chosen, do their ratings make sense to you? Also, the word "banana" is not in the lexicon – think about it for a second – does that make sense? What polarity would *you* assign to the word banana?

In [None]:
# we can now look up words in the dictionary to get their ratings
test_words = ['ironic', 'tired', 'happy', 'banana', 'alive', 'hurt']

# loop through words, check if in dict, if so, print word and rating
for word in test_words:
  if word in afinn_dict.keys():
    print(word, afinn_dict[word])

If you were to loop through the dictionary and search for words with a rating of negative 5, you will find some words which are offensive and vulgar. Of course, we have to remain somewhat dispassionate here, because we would expect these words to exist in natural language and thus attempt to account for them in any analysis of natural language. I'll leave it to you to write and execute such a loop, but fair warning you will see some words which might offend you.

```
# print words with -5 or 5 as a rating
for word in afinn_dict.keys():
  if afinn_dict[word] == 5 or afinn_dict[word] == -5:
    print(word, afinn_dict[word])
```

Instead, I'll pick out some of the less offensive words (at least, I think so!) and show you what counts as fully negative or fully positive. There are actually fewer then 20 words with such extreme ratings.

In [None]:
# these words have a -5 or +5 rating
extreme_words = ['bastard', 'twat', 'superb', 'thrilled']

for word in extreme_words:
  print(word, afinn_dict[word])

## Using the AFINN sentiment scores

Let's see how one could use this resource, and whether or not the output makes sense.

Let's start with two extremely different sentences which we would expect to be very negative or very positive.

In [None]:
positive_sentence = 'That meal was excellent.'
negative_sentence = 'That meal was awful.'

Now let's write a function which will calculate the AFINN polarity for each word in the sentence, and then average that value. This should give us the average sentiment for each sentence.

In [None]:
# we'll need nltk for word_tokenize (which itself needs punkt)
import nltk
nltk.download('punkt')

In [None]:
def afinn_average(sentence, afinn):
  """calculates average afinn sentiment for a given string"""

  # tokenise a lower-cased version of our sentence
  tokens = nltk.word_tokenize(sentence.lower())

  # initalise empty output to store sentiment ratings
  output = []

  for token in tokens:
    # first check if the token is even in the dictionary
    if token in afinn.keys():
      #print(token) # for seeing which words are actually counting
      # add the rating to the output if so
      output.append(afinn[token])

  # we calculate the average sentiment of the text from all the values in our output
  if output:
    avg_sentiment = sum(output)/len(output)
  # let's inform the user how many words from the sentence were actually used in the calculation
  # this is a measure of "coverage"
    print(f'sentiment is {avg_sentiment},\ncalculation used {round(len(output)/len(tokens)*100, 2)}% of words,\nwhich is {len(output)} words total')
  else:
    print('Nothing in this text was in the sentiment dictionary')

In [None]:
# calculate the average of the positive sentence
afinn_average(positive_sentence, afinn_dict)

In [None]:
# caluclate the average of the negative sentence
afinn_average(negative_sentence, afinn_dict)

Seems to be working, however, you probably have realised that these "average" values are actually only taking sentiment from one word in each sentence. The function skips the words `that meal was` because none of those words were in the `afinn_dict`. This is an issue related to **coverage** – the performance of a parser or lexical resource is largely dependent upon how much of the text can actually be analysed. In the two examples above, our resource only provided coverage over 20% of each text.

Keep in mind, punctuation is still counted in those tokens, so our length variable will also be slightly different depending upon how we will deal with punctuation.

Let's see what happens if we combine our sentences. If we join our sentences  (with a space in between), we see that the calculations are still based on the same 2 words, this is still 20% of the text, but our text's sentiment is now 0. This is because the -3 rating of `awful` and the +3 rating of `excellent` are summing to zero.

So, in this sense, a "neutral" text is simply a text which does not have a strong tendency towards positive or negative, but we can clearly see that there exists both positive and negative sentiment in the text!

In [None]:
# the positive and negative words cancel each other out
afinn_average(positive_sentence + ' ' +  negative_sentence, afinn_dict)

Let's look at a larger text to see if adding words helps here.

This text is a satirical story from *The Civilian*, a New Zealand satirical newspaper which seems to no longer be around :(

In [None]:
xmas_threat = """While Auckland prepares to move to Alert Level 3 on Wednesday,
New Zealanders in the fun part of the country have been wondering when they might see a drop down to Level 1,
given the persistent lack of Covid cases since the outbreak.
But the Government isn't currently considering any such plans,
and Prime Minister Jacinda Ardern poured cold water on the idea at today's post-Cabinet press conference,
reiterating that Level 2 may last for “some time”
and that a move to Level 1 won't be considered for at least as long as there remains
a threat that organisers decide to go ahead with annual kids' concert Christmas in the Park.
"""

In [None]:
afinn_average(xmas_threat,afinn_dict)

Ok, so our results are saying that, on average, this text is more negative than positive (because a neutral text would have a rating of 0). At the same time, the entire sentiment of the text is based on only 5 words — we can check which words those are:



In [None]:
# let's scrutinize the exact words being used for the calculations here.
for word in nltk.word_tokenize(xmas_threat):
  if word.lower() in afinn_dict.keys(): # need to lower to check properly
    print(word, afinn_dict[word.lower()])

### Identifying the limitations of a lexicon-based approach

Ok, so we see that the overall sentiment in that text is dictated by the average sentiment of `alert, fun, drop, lack, threat`. We see the same problem here that occured above — the sentiment of the text is being driven by a very small number of words.

While it does make sense that specific words, namely **content** words such as adjectives, adverbs, verbs, and nouns are doing the heavy lifting to colour the overall sentiment of a text, it's a bit frustrating to see so few words being included in the lookup lexicon. This is again a matter of coverage and a limitation of any lexicon-based approach.

These results also identify a second problem with a lexicon-based approach. Each word is being treated in isolation, without consideration of the words that come before or after. As we have seen with POS tagging and WordNet, a single word can take on possible different meanings and senses, and knowing which meaning/sense/part of speech is intended relies on knowing how the word is *used* - how it patterns with *other* words. Our sentiment lexicon does not capture this.

For example, look at that sentence which contains the word `lack`:

`given the persistent lack of Covid cases since the outbreak.`

Is this sentence negative? One would think not, a *lack* of COVID cases is probably an overall *good* thing (unless someone doubts the authenticity of such a report, but let's just keep it simple for now). So, while the decontextualized meaning of `lack` is more or less negative (although even that is debatable), we see that it is referring to the lack of a bad thing – which we might want to reasonably argue as a good thing. This, hopefully, is starting to show you that word-based lexicons must be carefully used and interpreted, because words do not create meaning on their own...their meaning is also dependent upon the ways they are used with other words.

I hope that this shows you the limitations of using a lookup approach like this. That does not mean that this is a terrible approach to use, but rather we must be careful with such an approach. Certain texts will be more suitable for this type of analysis, and parsers with better rules will also help.

To finalise this point a bit more, also consider the topic of negation in English. Saying that something is *not* good means that it is bad. And saying something is "not bad" usually means that it is at least good, right? The AFINN dictionary cannot take this iniformation into account:

In [None]:
# something that is not terrible should be good?
afinn_average('This is not terrible', afinn_dict)

In [None]:
# something that is not supberb should be bad?
afinn_average("this is not superb", afinn_dict)

How could we resolve this? We could write a new set of rules which looks for negation words before our sentiment words. This sort of appoach and the limitations of existing sentiment libraries such as AFINN and many others is what prompted CJ Hutto to make a better rule-based sentiment library named VADER.

# Sentiment Analysis with VADER

VADER (Valence Aware Dictionary for sEntiment Reasoning) was created to address the shortcomings with some of the lexicon approaches to sentiment analysis. If you read or even skimmed the paper and other links associated with VADER, you can see the author provides a rationale for why VADER is "better," even though VADER is still rule-based!

The author of VADER incorporated information from a variety of prior sentiment dictionaries, and then also thought about features such as capitalization, emoticons, non-standard words, negation, and a variety of other ways that people actually *use* language.  

And, VADER is relatively simple to use in NLTK :)

Let's get VADER into our notebook, first we import `nltk` and then download the resource, per usual. The resource we need is `vader_lexicon`

In [None]:
# import nltk
import nltk
# download the vader lexicon
nltk.download('vader_lexicon')

Now we want to create a sentiment analyser function, to do so, we import the anlayzer from vader and save it to a variable name.

In [None]:
# Import the vader sentiment analyzer and save to the variable `sid`
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

We are interested primarily in two things from the `sid` object:'

- `lexicon`
- `polarity_scores`

Let's peek at the `lexicon` part of this. It's another dictionary with valence ratings for individual words.

In [None]:
# the lexicon is a dictionary!
type(sid.lexicon)

In [None]:
# words have valence measures just like the AFINN dictionary.

sid.lexicon['irony']

So, we could simply use this lexicon in the same way that we did for AFINN above. In fact, when I made `afinn_average`, I added a second argument to allow any dictionary to be used. Let's redo the function now but make it clear it can take any dictionary. This is identical to the program above except I have replaced `afinn` with `sentiment_dict` and changed the function name to `sentiment_lookup`

In [None]:
def sentiment_lookup(sentence, sentiment_dict):
  """calculates average sentiment for a given sentence"""

  # tokenise a lower-cased version of our sentence
  tokens = nltk.word_tokenize(sentence.lower())

  # initalise empty output to store sentiment ratings
  output = []

  for token in tokens:
    # first check if the token is even in the dictionary
    if token in sentiment_dict.keys():
      # add the rating to the output
      output.append(sentiment_dict[token])

 # we calculate the average sentiment of the text from all the values in our output
  if output:
    avg_sentiment = sum(output)/len(output)
  # let's inform the user how many words from the sentence were actually used in the calculation
    print(f'sentiment is {avg_sentiment},\ncalculation used {round(len(output)/len(tokens)*100, 2)}% of words,\nwhich is {len(output)} words total')
  # we need to prevent a division error if there are no words in the dictionary.
  else:
    print('Nothing in this text was in the sentiment dictionary')

Let's check out how well this VADER dictionary does on our xmas threat text, and compare it to AFINN.

In [None]:
# reminder of what the text is
xmas_threat

In [None]:
# using AFINN lexicon
sentiment_lookup(xmas_threat, afinn_dict)

In [None]:
# using VADER lexicon
sentiment_lookup(xmas_threat, sid.lexicon)

Interesting! Both dictionaries only used five words, but we get different scores (although both are negative). Let's again scutinize what words are contributing here.

In [None]:
# afinn
for word in nltk.word_tokenize(xmas_threat):
  if word.lower() in afinn_dict.keys():
    print(word, afinn_dict[word.lower()])

In [None]:
# vader
for word in nltk.word_tokenize(xmas_threat):
  if word.lower() in sid.lexicon.keys():
    print(word, sid.lexicon[word.lower()])

Welp, looks like they are *the exact same words*, but the values are different. This is because VADER was initially created by looking up the previous sentiment dictionaries and tweaking the values based on an assessment of prior dictionaries and lexical resources as well as human annotation. So the hard-coded sentiments in VADER are "better" in some regard. But that's not the only improvement VADER has made. There are a number of additional rules that VADER considers in order to consider the ways words are being used to calculate their sentiment. For instance, VADER can better account for negated uses of words:

## VADER is more than a lookup dictionary

Above I used VADER as a lookup dictionary the exact same way as AFINN. But, VADER is more than a lookup dictionary. VADER also has a bunch of other rules which are part of its scoring algorithm that take into account the contexts words appear in. You can scan the [rules here](https://www.nltk.org/_modules/nltk/sentiment/vader.html). You'll see certain rules, such as declaring certain word as intensifiers and also taking negation into account.

To get the *real* polarity scores, you use the main VADER function, `polarity_scores`, to obtain sentiment scores that have taken these additional rules into account. The function will return a set of values: `neg`, `neu`, `pos`, and `compound`.

VADER takes into account various punctuation and capitalization features, which means we actually want to pass raw strings to VADER.  

> `sid.polarity_scores('string input')`

In [None]:
sid.polarity_scores("You can't beat Wellington on a good day.")

Consider how two additions change the ratings (an intensifer and an exclamation point):

In [None]:
sid.polarity_scores("You can't friggen beat Wellington on a good day!")

## Understanding VADER output

The output in the examples above have a "negative" score, a "neutral" score, and a "positive" score. For both texts, the negative score is 0 the neutral is the highest, and the positive score is about .3. This suggests both versions of the statement are not negative, but more neutral or positive. However, the value we *rerally* we care about is the "compound" score which takes into account a variety of extra rules the author of VADER has incorporated into the program. You can find the scoring system [here](https://github.com/cjhutto/vaderSentiment#about-the-scoring), which says that:

>>> "The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate."

The following guidelines apply to the compound score:

* positive sentiment: compound score >= 0.05
* neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
* negative sentiment: compound score <= -0.05

So, the above sentiment is best seen in the compoud scores for the sentences of .44/.49, suggesting they are both overly positive. Do you agree? :)

Let's see what VADER says about our *Christmas in the Park* text.

## VADER and negation

We saw above that the inclusion of the word `friggen` and the use of an exclamation mark enhanced the positivity of the sentence. Another thing VADER takes into account is negation:

In [None]:
# happy is being negated, so it should be negative
sid.polarity_scores('I am not happy')

In [None]:
# no negation, so happy should be positive.
sid.polarity_scores('I am happy')

Just to remind you, look how poorly the AFINN dictionary performs using basic look up and averaging. It says this sentence is *positive* because it is naively looking at the word `happy` and not considering the surrounding words. Clearly, VADER is providing a better understanding of the sentiment in the sentence.

In [None]:
# compare the "negated" sentence with AFINN
sentiment_lookup('I am not happy', afinn_dict)

## VADER and emoticons

VADER also has some support for parsing the sentiment of emoticons. Look how the frowny face makes the word "Thanks" neutral, whereas the happy face makes "Thanks" more positive than just "Thanks" alone.

In [None]:
# explore how emoticons are parsed by VADER
thanks = ['Thanks', 'Thanks :(',  'Thanks :)']

for thank in thanks:
    print(f'Score for {thank} is: {sid.polarity_scores(thank)}')

In [None]:
# basically, the emoticons are just more words with their own valence
emoticons = [':)', ':(', ':D']
for emoticon in emoticons:
  print(emoticon, sid.lexicon[emoticon])

## Adding new words to VADER

Because VADER is a Python dictionary, we can update the dictionary if we want to add new words which don't exist in the VADER lexicon. For example:

In [None]:
# find a word not in the vader lexicon
'blarged' in sid.lexicon.keys()

In [None]:
# there is effectively nothing to measure here, 'blarged' was probably identified as neutral because it is not in the dictionary.
sid.polarity_scores('I really blarged that one up.')

In [None]:
# let's pretend blarg is negative
sid.lexicon['blarged'] = -4

In [None]:
# our new word is in there with a very neg rating :)
sid.polarity_scores('I really blarged that one up.')

You could even overwrite the sentiment values for existing words in the dictionary. Maybe we think the word `happy` is actually negative:

In [None]:
# happy is positive
sid.polarity_scores('happy')

In [None]:
# let's make it negative!
sid.lexicon['happy'] = -2

In [None]:
# ultimate power over our words!!
sid.polarity_scores('happy')

# **The point**

Sentiment anlaysis can be pretty bad when used naively. Averaging single word counts of valence/polarity is probably not a good method, but with larger texts some of that noise might wash out. However, fortunately we have VADER which is a much smarter sentiment tool, even if it is rule-based.

Importantly, VADER will perform its own tokenization, so we have the luxury of passing raw strings to VADER, rather than needing to preprocess text.

You might want to load in some of your own texts now and check the sentiment. You could compare the afinn/VADER dictionaries, or think about looking up and comparing genres in brown, or anything else that seems interesting.

This could also show you how using sentiment or any other look-up resource might work in your final project, should you choose to incorporate it!

## **A final word of caution**

VADER was intended to be used on short texts, such as social media posts and other similar types of data. Running VADER on very long texts usually results in poor performance. What might be some ways to address this?