<a href="https://colab.research.google.com/github/scskalicky/SNAP-CL/blob/main/04_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis**

Sentiment analysis is one of those topics which I think can spark a lot of interest. You may have heard of it before. In its most basic form, sentiment analysis tries to measure the overall positivity and negativity from the tone of a text, with the idea that you are able to capture the opinion and/or mood of the author(s). You might also see this referred to as valence or polarity. The classic examples of sentiment analysis usually discuss how one can use sentiment analysis to detect overall negative versus positive tone in movie or product reviews.

A lot of sentiment analysis libraries are rule-based, which means that the creator of the resources has spent time trying to program the best set of rules to analyse language for these features.

Where does sentiment come from? In many cases the sentiment ratings are obtained from human perceptions of how positive or negative individual words are in isolation. These lexicons or wordlists will include high frequency content words (in particular adjectives, which should make sense), and store them in a manner where each word has a "score" indicating how positive or negative it is. These scores differ in how they are done - for instance someone could rate a word from 0 to 1, with 0 being the most negative and 1 being the most positive. Or one could use a Likert scale from 1-7, or one could gather votes for whether a word is negative or positive from a variety of people and then let the feature with the most votes "win." The point is, there have been a lot of approaches to capturing these perceptions.




## **Sentiment Analysis with VADER**
VADER (Valence Aware Dictionary for sEntiment Reasoning) is a really cool library which was created to address shortcomings with early lexicon-based approaches to sentiment analysis. The author of VADER incorporated information from a variety of prior sentiment dictionaries, and then also thought about features such as capitalization, emoticons, non-standard words, and a variety of other ways that people actually use language.

The end result is a very simple to use and much more accurate sentiment analysis library which has been incorporated into NLTK.

Let's get VADER into our notebook, first we import nltk and then download the necessary resources.

In [None]:
# import nltk
import nltk

# download the vader lexicon
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

To use VADER, we initialise a version of a built-in class and save it to a variable, you can do so by running the cell below. 


In [None]:
# Import the vader sentiment analyzer and save to the variable `sid`
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

VADER takes into account various punctuation and capitalization features, which means we actually want to pass raw strings to VADER. 

To calculate the sentiment of a string, use the `.polarity_scores()` function from our variable, like this:


> `sid.polarity_scores('string input')`

For example:

In [None]:
sid.polarity_scores("You can't beat Wellington on a good day.")

{'neg': 0.0, 'neu': 0.674, 'pos': 0.326, 'compound': 0.4404}

## **Understanding VADER output**

The output in the example above has a "negative" score of 0, a "neutral" score of .674, and a "positive" score of .326, suggesting the text is not negative but more neutral or positive.

The value we care about is the "compound" which takes into account a variety of extra rules the author of VADER has incorporated into the program. You can find the scoring system [here](https://github.com/cjhutto/vaderSentiment#about-the-scoring), which says that:

>>> "The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate."

The following guidelines apply to the compound score:

* positive sentiment: compound score >= 0.05
* neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
* negative sentiment: compound score <= -0.05

One of the cooler features of VADER is that it understands negation and some emoticons. 


In [None]:
# this should be positive
sid.polarity_scores("I am happy!")

{'neg': 0.0, 'neu': 0.2, 'pos': 0.8, 'compound': 0.6114}

In [None]:
# this should be negative
sid.polarity_scores("I am not happy!")

{'neg': 0.622, 'neu': 0.378, 'pos': 0.0, 'compound': -0.509}

In [None]:
# emoticons matter
sid.polarity_scores('hi there')

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

In [None]:
sid.polarity_scores('hi there :)')

{'neg': 0.0, 'neu': 0.4, 'pos': 0.6, 'compound': 0.4588}

## **Adding new words to VADER**

You can check the sentiment scores for individual words in the VADER lexicon by looking up a word using the dictionary format (the same way we asked the frequency distribution to tell us the frequency of any one particular word). To do so, use this syntax:

> `sid.lexicon['word']`

For example:


In [None]:
# Look at the values of some words in VADER
target_words = ['sad', 'happy', 'tired', 'stupid', 'smart', 'sassafrass']

for word in target_words:
  if word in sid.lexicon.keys():
    print(f'The word {word} has a sentiment of {sid.lexicon[word]}')
  else:
    print(f'The word {word} is not in the VADER dictionary.')

The word sad has a sentiment of -2.1
The word happy has a sentiment of 2.7
The word tired has a sentiment of -1.9
The word stupid has a sentiment of -2.4
The word smart has a sentiment of 1.7
The word sassafrass is not in the VADER dictionary.


You probably saw that one of the words was missing. Fortunately, we can update VADER with any word that we like. You simply need to add the word and sentiment score to the `.lexicon` dictionary VADER uses. You should follow the advice given by VADER's author when doing this, and also know the scores should range between -4 and 4. What this means is you can customise VADER for your own purposes. 

An example is provided below.

This means VADER is a really nice resource you can improve for your own specific purposes :)

FOr example:

In [None]:
# find a word not in the vader lexicon
'blarged' in sid.lexicon.keys()

False

In [None]:
# there is effectively nothing to measure here, 'blarged' was probably identified as neutral because it is not in the dictionary.
sid.polarity_scores('I really blarged that one up.')

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

In [None]:
# let's pretend blarg is negative
sid.lexicon['blarged'] = -4

In [None]:
# our new word is in there with a very neg rating :)
sid.polarity_scores('I really blarged that one up.')

{'neg': 0.57, 'neu': 0.43, 'pos': 0.0, 'compound': -0.7425}

You could even overwrite the sentiment values for existing words in the dictionary. Maybe we think the word `happy` is actually negative:



In [None]:
# happy is positive
sid.polarity_scores('happy')

{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.5719}

In [None]:
# let's make it negative!
sid.lexicon['happy'] = -2

In [None]:
# ultimate power over our words!!
sid.polarity_scores('happy')

{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound': -0.4588}

**Your Turn**

Try out the sentiment dictionary on some of your own texts.