# Lesson: Introduction to Sentiment Analysis
by Jay Brodeur and Devon Mordell for [DMDS 2023-2024](https://scds.github.io/dmds-22-23/ComputationalText.html)

## 0. Introduction
In this exercise we're going to explore Sentiment Analysis, a natural language processing technique to ascribe a sentiment valence to language. This type of analysis is useful in a variety of contexts, including:
- Assessing perception of and affinity for products and performances from user reviews.
- Characterizing discourse around subjects and individuals from dialogue in correspondence text, social media data, etc.
- Assisting in rhetorical analyses of written text or spoken communications.  

And before we begin, a shout out to the excellent tutorials that have inspired various aspects of this exercise:
- Real Python's [Sentiment Analysis: First Steps With Python's NLTK Library](https://realpython.com/python-nltk-sentiment-analysis/)
- The [VADER documentation](https://github.com/cjhutto/vaderSentiment)
- Towards Data Science's [Social Media Sentiment Analysis In Python With VADER](https://towardsdatascience.com/social-media-sentiment-analysis-in-python-with-vader-no-training-required-4bc6a21e87b8)
- Constellate's [Sentiment analysis with vader](https://github.com/ithaka/constellate-notebooks/blob/master/sentiment-analysis-with-vader.ipynb) tutorial.

### Rule-based vs. machine learning approaches
Sentiment analysis methods are generally of two types: rule-based algorithms and machine learning models.  

From Nathan Kelber and JSTOR Labs' [Constellate tutorial on sentiment analysis](https://github.com/ithaka/constellate-notebooks/blob/master/sentiment-analysis-with-vader.ipynb):

> **Rule-based algorithms** assign sentiment scores to particular words or multi-word constructions. Simple algorithms may simply assess each word individually in a feedback document and add up an overall score. More complex algorithms may assess multi-word (or n-gram) constructions and have special rules for addressing issues such as negation, emojis, and emoticons. They can detect the difference between "bad", "not bad", and "bad ass". Some algorithms also support emojis and emoticons, such as "=)" and "😁".

> **Machine learning models** rely on feedback data that has already been assessed by humans to have a particular sentiment. Each piece of feedback is **labeled** by a human reader who may place the feedback into a particular category. The categories could be as simple as positive, negative, or neutral. As long as there exists **labeled** data, a machine learning model can often identify complex concepts. For example, a car manufacturer may desire to classify the sentiment of feedback from past buyers as: "budget-conscious", "eco-conscious", "tech-enthusiastic", "luxury-driven", "performance-driven", etc. Assuming there is an adequately labeled **training data** for each of these categories, a machine learning model could assign a score for each category. This could help analysts understand the brand better, answering questions about what consumers do or do not like about a particular vehicle.

### VADER
For this exercise, we will use a tool called [VADER](https://github.com/cjhutto/vaderSentiment) (**V**alence **A**ware **D**ictionary and s**E**ntiment **R**easoner) -- a lexicon and rule-based sentiment analyzer that is specifically targeted towards social media data, but also useful for some other forms of text and media.

VADER is built into the [Natural Language Toolkit](https://www.nltk.org/) (NLTK), a comprehensive, popular, and powerful python-based platform for natural language processing. VADER is ready-to-use, so you can get started quickly with sentiment analysis. It also means that you should be critical of the results, especially if you don't know how the analyzer works. i.e., reading the [documentation](https://github.com/cjhutto/vaderSentiment/tree/master) is a good idea!

A notable benefit of using a rule-based algorithm like VADER is that the methods are transparent and interrogable. In addition to reading the documentation and the journal article that describes the algorithm, a lot can be gleaned about VADES by inspecting the files in the [`vaderSentiment`](https://github.com/cjhutto/vaderSentiment/tree/master/vaderSentiment) directory of its GitHub repository:
- `emoji_utf8_lexicon.txt`: a list of emojis with corresponding descriptive text, to allow assignment of sentiment to emojis.
- `vaderSentiment.py`: the main python script that performs analyses
- `vader_lexicon.txt`: the VADER lexicon, a list of tokens (text-based emojis, abbreviations, and words) and their corresponding assigned valence values.

A snippet of the lexicon (shown below) contains an extensive list of tokens, mean and standard deviation of scores, as well as the raw rater values used to create the mean valence scores:

| Token      | Mean score | Std dev | Raw rater scores                         |
|------------|------------|---------|------------------------------------------|
| ]-:        | -2.1       | 0.53852 | [-2, -3, -3, -2, -2, -2, -1, -2, -2, -2] |
| ]:         | -1.6       | 0.66332 | [-1, -2, -1, -2, -3, -2, -1, -1, -1, -2] |
| ]:<        | -2.5       | 0.80623 | [-2, -2, -2, -3, -4, -2, -2, -2, -2, -4] |
| ^<_<       | 1.4        | 1.11355 | [3, 1, 3, 2, 1, 1, 1, -1, 2, 1]          |
| ^urs       | -2.8       | 0.6     | [-2, -3, -3, -2, -3, -3, -2, -3, -4, -3] |
| abandon    | -1.9       | 0.53852 | [-1, -2, -2, -2, -2, -3, -2, -2, -1, -2] |
| abandoned  | -2.0       | 1.09545 | [-1, -1, -3, -2, -1, -4, -1, -3, -3, -1] |
| abandoner  | -1.9       | 0.83066 | [-1, -1, -3, -2, -1, -3, -1, -2, -3, -2] |
| abandoners | -1.9       | 0.83066 | [-2, -3, -2, -3, -2, -1, -2, -2, 0, -2]  |
| abandoning | -1.6       | 0.8     | [-3, -2, -3, -2, -1, -1, -1, -1, -1, -1] |

Because the VADER GitHub repository is open-sourced under an [MIT License](https://opensource.org/license/mit/), you can download the library and modify it to suit your analysis needs.

## 1. Install required packages | Prepare VADER
In this exercise, we'll use a number of Python packages to perform our text prep tasks. Some of these packages are built in, but others need to be installed first. For this exercise, we'll use:
- NLTK for NLP
- matplotlib for plotting
- numpy for analysis

Let's install those packages, download the built-in NLTK resources we need (learn more [here](https://realpython.com/python-nltk-sentiment-analysis/#installing-and-importing)), and create an instance of VADER:


In [None]:
!pip install nltk matplotlib numpy
import nltk
# Download the lexicon and a variety of other resources
nltk.download(["stopwords","state_union","twitter_samples","movie_reviews","averaged_perceptron_tagger","vader_lexicon","punkt"])
# Import the lexicon
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Create an instance of SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
print("VADER is ready!")

## 2. Give it a try
Let's run some tests with our sentiment analyzer by creating some sentences of varying polarity and seeing what sentiment score is returned:




In [None]:
sent = "I am very excited about this tutorial!"
print("Polarity scores for sentence 1: ")
print(sia.polarity_scores(sent))

Running the phrase "I am very excited about this tutorial!" through our sentiment analyzer returns the following output:
```
{'neg': 0.0, 'neu': 0.626, 'pos': 0.374, 'compound': 0.4561}
```
These scores provide information as follows:
- ```neg```: proportion of text with a negative sentiment
- ```neu```: proportion of text with a neutral sentiment
- ```pos```: proportion of text with a positive sentiment
- ```compound```: a "normalized, weighted composite score" across all words in the inputted string. Normalized between -1 (most negative) to +1 (most positive). Positive strings have a compound score >= 0.05; negative strings a score <= -0.05; neutral strings are in between.


Edit the code below to experiment with your own examples:

In [None]:
sent2 = "Replace this with your own sentence"
print("Polarity scores for sentence 2: " + sent2)
print(sia.polarity_scores(sent2))

sent3 = "Can you make an overwhelmingly positive sentence?"
print("Polarity scores for sentence 3: " + sent3)
print(sia.polarity_scores(sent3))

sent4 = "Try a negative sentence and include some CAPITALIZED letters"
print("Polarity scores for sentence 4: " + sent4)
print(sia.polarity_scores(sent4))


Did your results turn out as you thought? Were there any interesting outcomes? Were you able to infer anything about how VADER operates on a string of text?

This is probably a good time to [read more](https://github.com/cjhutto/vaderSentiment), so that you can better understand how the algorithm works and the judgments that have been imposed by its developers.

## 3. Analyzing social media data
Let's use VADER for what it purports to do best, which is analyzing social media data. For this, we'll use the ```twitter_samples``` library that is built into NLTK.  

Let's load the tweet corpus and display the first 15:

In [None]:

tweets = nltk.corpus.twitter_samples.strings()
for sample_tweet in tweets[:15]:
  print(sample_tweet)
  print("##")


Next, let's turn VADER loose on a randomly-selected subset of these tweets and investigate the ```compound score``` for each:

In [None]:
from random import shuffle

# Create a small function that returns only the compound score from VADER
def return_compound(tweet_in: str) -> bool:
  return sia.polarity_scores(tweet_in)["compound"]

# Randomize order
shuffle(tweets)

# Iterate through the first 15 tweets, return the tweet text and the compound score:
for sample_tweet in tweets[:15]:
  print(sample_tweet, "| compound score = ",return_compound(sample_tweet))
  print("###")

Do the scores align with your expectations? Are there similarities between tweets with very high sentiment scores? Between very low ones? Between neutral ones?

You can rerun this section many times to explore the results for more tweets.

## 4. Analyzing a larger text

Let's try VADER on something a bit longer. As source material for this exercise, we will use the [Speeches from the Throne](https://www.canada.ca/en/privy-council/campaigns/speech-throne/info-speech-from-throne.html) prepared during the first sessions of the 43rd and 44th Parliaments of the Government of Canada, by Julie Payette, Governor General of Canada on December 5, 2019 and Mary Simon, Governor General of Canada on November 23, 2021 respectively.

### Prepare your materials
In the interest of time, plain text versions of the Hansards (transcripts) have already been created for you and shared in the workshop [data directory](https://u.mcmaster.ca/dmds-text-2324). If interested in creating these (or similar) transcripts in the future, follow the steps below:

>- Navigate to the Hansard (transcript) for the [43rd session](https://www.ourcommons.ca/DocumentViewer/en/43-1/house/sitting-1/hansard#Int-10724437).
  - Copy the text corresponding of the speech from the throne to a new text document. Save the document as ```ca-throne-speech-43.txt```.
- Navigate to the Hansard for the [44th session](https://www.ourcommons.ca/DocumentViewer/en/44-1/house/sitting-2/hansard#Int-11426929)
  - Copy the text corresponding of the speech from the throne to a new text document. Save the document as ```ca-throne-speech-44.txt```.

### Load and inspect
First, let's load the files into our colab, open it in our notebook and preview its contents.

Before running the text below, upload the ```ca-throne-speech-44.txt``` and ```ca-throne-speech-43.txt``` files that you downloaded from the workshop [data directory](https://u.mcmaster.ca/dmds-text-2324) in preparation for the previous exercise. Make sure that the files are uploaded into ***the same directory as*** `sample_data` -- NOT IN the `sample_data` directory).

Use the top-left upload button, or drag the files into the **Files** area.

When done properly, your Files area should look like this:

![image of a file directory with two files uploaded to the working directory](https://github.com/scds/dmds-22-23/blob/main/assets/img/throne-uploaded.png?raw=true)

### Read the text file and preview
Let's begin with the 44th throne speeach. Load the file and inspect:

In [None]:
# import the tokenize module from nltk
from nltk import tokenize

# Read our text file of the 2022 speech from the throne
with open('ca-throne-speech-44.txt') as f:
    text_44 = f.read()

# Preview the first 500 characters to the screen
print(text_44[:500])

### Tokenize and analyze
Now, we will split this text into sentences and run the Sentiment Analyzer on each sentence. We'll print the sentence and its ```compound score``` beside it.

At the end of the output text, an average compound sentiment score is provided for the entire document.  

In [None]:
# Use the tokenizer to split the paragraph into sentences
sentences = tokenize.sent_tokenize(text_44)
paragraphSentiments = 0.0
vs_compound = {}
# Iterate through each sentence and analyze its sentiment. Print it to screen
for i in range(len(sentences)):
    vs = sia.polarity_scores(sentences[i])
    vs_compound[i] = float(vs["compound"])
    print("S",i,": ","{:-<69} {}".format(sentences[i], str(vs["compound"])))
    paragraphSentiments += vs["compound"]
print("AVERAGE SENTIMENT FOR THRONE SPEECH 44: \t" + str(round(paragraphSentiments / len(sentences), 4)))


Take a look through the output and assess how well you think VADER worked. Are the sentences more positive than negative? Is everything neutral?

### Analyze the results
Here, we will create a few simple charts to better understand how sentiment is distributed throughout the document. This will provide us with more insight than the aggregate score of 0.1947.

Plot a histogram of all sentence ```compound scores```:

In [None]:
 # Import matplotlib and numpy for analyses
import matplotlib.pyplot as plt
import numpy as np
# Convert vs_compound from a dictionary to a more usable array
vs_values = np.array(list(vs_compound.values())).astype(float)
plt.hist(vs_values)
plt.xlabel("Sentiment score")
plt.ylabel("Count")


Now we can see that the text has a relatively large proportion of very positive sentences, with a sprinkling of very negative ones, and a lot of neutral to slightly negative ones. Another question you may ask is "does sentiment change throughout the document?" with the ebbs and flows of rhetoric?

We can plot the sentence scores against sentence number to investigate:

In [None]:
vs_values = np.array(list(vs_compound.values())).astype(float)
# 5-period moving average
vs_values_movavg = np.convolve(vs_values, np.ones(5)/5, mode='valid')

plt.plot(vs_values,label="raw") # Plot values as a time series according to their position in the text
plt.plot(vs_values_movavg,label="movavg") # Plot the moving average
plt.xlabel("Sentence number")
plt.ylabel("Sentiment score")
plt.legend(loc = "best")

Looking at the results, it seems as though there is clear autocorrelation, in that the heart of the speech is predominantly positive, with a clear turn towards more negative sentiments near the end, with a final positive finish.

Is this a standard pattern for all throne speeches? All of the throne speeches? Let's investigate by comparing sentiments between the speeches of the 44th and [43rd sessions](https://www.ourcommons.ca/DocumentViewer/en/43-1/house/sitting-1/hansard).

### Load and preview the 43rd session throne speech

In [None]:
# Read our text file of the 2022 speech from the throne
with open('ca-throne-speech-43.txt') as f:
    text_43 = f.read()

sentences_43 = tokenize.sent_tokenize(text_43)
paragraphSentiments = 0.0
vs_compound_43 = {}
# Iterate through each sentence and analyze its sentiment. Print it to screen
for i in range(len(sentences_43)):
    vs = sia.polarity_scores(sentences_43[i])
    vs_compound_43[i] = float(vs["compound"])
    print("S",i,": ", "{:-<69} {}".format(sentences_43[i], str(vs["compound"])))
    paragraphSentiments += vs["compound"]
print("AVERAGE SENTIMENT FOR THRONE SPEECH 43: \t" + str(round(paragraphSentiments / len(sentences_43), 4)))

# Convert vs_compound from a dictionary to a more usable array
vs_values_43 = np.array(list(vs_compound_43.values())).astype(float)
# 5-period moving average
vs_values_movavg_43 = np.convolve(vs_values_43, np.ones(5)/5, mode='valid')



### Plot the results for both speeches

In [None]:
plt.plot(vs_values_movavg, label="TS 44") # Plot values as a time series according to their position in the text
plt.plot(vs_values_movavg_43, label="TS 43") # Plot the moving average
plt.xlabel("Sentence number")
plt.ylabel("Sentiment score")
plt.legend(loc = "best")

What can you say about the patterns observed in each speech?
What happened around the 100th sentence of the 43rd session speech? Go back up to the output above and inspect the sentences around 100 to find out. Is the extreme negative sentiment reasonable?

## 5. Learn more with Constellate
If you are interested in engaging more with sentiment analysis and natural language processing, generally, JSTOR Labs' [Constellate](https://constellate.org/) is a useful platform for building custom datasets, learning about, and carrying out text analyses within a singular platform.

All users have basic access; members of subscribing academic institutions have access to the full features, which includes a personal data storage and analysis environment. Check with your local library for information about upgraded access.

McMaster members can access the full suite by [creating an account](https://constellate.org/register) and accessing Constellate through the University Library's [Off Campus Access service](https://u.mcmaster.ca/constellate).  

All Constellate tutorials are also available via [GitHub](https://github.com/ithaka/constellate-notebooks/tree/master) under a [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/). For example, their [sentiment analysis](https://github.com/ithaka/constellate-notebooks/blob/master/sentiment-analysis-with-vader.ipynb) tutorial demonstrates how to use both VADER and a machine learning classifier for this purpose.