# Introduction to Sentiment Analysis

_R Version_
<br>
Authors: Anneke Dresselhuis,

![Cover Art Image](media/sentiment_analysis_cover_art.png "title")

### Prerequisites
1. Introduction to Jupyter
2. Introduction to R

### Learning outcomes
After completing this notebook, you will be able to: <br>
1. Understand and apply the principles of “tidy text” data to clean a textual dataset
2. Perform basic sentiment analysis using ...

### Outline
_To be finalized when notebook is complete_

## What is Sentiment Analysis?

“Sentiment analysis is the practice of applying natural language processing and text analysis techniques to identify and extract subjective information from text” (Hussein, 2018). As this definition alludes, sentiment analysis is a part of natural language processing (NLP) which is a field that exists at the intersection of human language and computation. Because humans are complex, emotional beings, the language we use is often shaped by our affective (emotional) dispositions. Sentiment analysis, sometimes referred to as “opinion mining” is one way that researchers can methodologically understand the emotional intentions that lie in a textual dataset.

> **🔎 **Let’s think critically****
>
>  🟠 At the heart of sentiment analysis is the assumption that language reveals interior, affective states, and that these states can be codified and generalized to broader populations. In her book, [Atlas of AI](https://katecrawford.net/atlas) the artificial intelligence scholar Kate Crawford explores how many assumptions found in contemporary sentiment research (ie, that there are 7 universal emotions, etc) are largely unsubstantiated notions that emerged from mid 20th century research funded by US Department of Defense. Rather than maintaining that emotions can be universally categorized, her work invites researchers to think about how emotional expression is highly contextualized by social and cultural factors and the distinct subject positions of content makers.
>
> 🟠 Consider the research question for your sentiment analysis project. How might the text you are working with be shaped by the distinct communities that have generated it?
>
> 🟠 Are there steps you can take to educate yourself around the unique language uses of your dataset (for example, directly speaking with someone from that group or learning from a qualified expert on the subject)?
>
> 🟠 If you’re interested, you can learn more about data justice in community research in a [guide](https://genderplusresearchcollective.sites.olt.ubc.ca/files/2022/09/2022-Gender-Guide-1.pdf) created by UBC’s Office for Regional and International Community Engagement. 

The rise of [web 2.0](https://en.wikipedia.org/wiki/Web_2.0) has produced prolific volumes of user-generated content (UGC) on the internet, particularly as people engage in a variety of social platforms and forums to share opinions, ideas and express themselves. Maybe you are interested in understanding how people feel about a particular political candidate by examining tweets around election time, or you wonder what people think about a particular bus route on reddit. UGC is often unstructured data, meaning that it isn’t organized in a recognizable way.
<br>

**Structured data** for a microwave product review might look something like this:

|<span style="color: #CC7A00">Pro</span> | <span style="color: #CC7A00">Con</span> | <span style="color: #CC7A00">Neutral</span>
| :---| :----------- | :-- |
| <span style="color: #CC7A00">Interface is visually appealing</span> | <span style="color: #CC7A00">Hard to change the time</span> | <span style="color: #CC7A00">Purchased from store #553</span> |
| <span style="color: #CC7A00">Heats up food perfectly</span> | <span style="color: #CC7A00">Plug cord length is too short</span> | <span style="color: #CC7A00">Product weighed 23lbs</span> |

**Unstructured data** for a microwave product review might look something like this:

> <span style="color: #CC7A00">I bought the WAV0 X5K microwave last week. When i got home I was tryign to set it up and needed to go out and buy an extension cord because the one on the thing was too short. Took me 20 mins to figure out how to change the time, but teh interface was visually appealing. When I finally got working, it heated up my leftover take-out dinner perfectly.<span style="color: #CC7A00">
<br>
    
In the structured data example above, the reviewer defines which parts of the feedback are positive, negative or neutral. In the unstructured example on the other hand, there are many typos and a given sentence might include a positive and a negative review as well as more nuanced contextual information (ie, that the person had to buy an additional product to make the microwave work). While messy, this contextual information often carries valuable insights that can be very useful for researchers.
<br>
The task of sentiment analysis is to make sense of these kinds of nuanced textual data - often for the purpose of understanding people, predicting human behaviour, or even in some cases, manipulating human behaviour.
<br>
    
**Language is complex and always changing.**
<br>
    
In the English language, for example, the word “present” has multiple meanings which could have positive, negative or neutral connotations. Further, a contemporary sentiment lexicon might code the word “miss” as being associated with negative or sad emotional experiences such as longing; if such a lexicon were applied to a 19th century novel which uses the word “miss” to describe single women, then, it might incorrectly associate negative sentiment where it shouldn’t be. While sentiment analysis can be a useful tool, it demands ongoing criticality and reflexivity from a researcher (you!). Throughout your analysis, be sure to continually ask yourself whether a particular sentiment lexicon is appropriate for your project.


## Working with Textual Data

### Loading Packages

In [None]:
# these may take a while to load (it might be worth moving this cell to the top)
#install.packages("quanteda")
#install.packages("tidytext")
#install.packages("SentimentAnalysis")
#install.packages("textdata")
library(tidytext)
library(readr)
library(tidyverse)
library(quanteda)
library(janeaustenr)
library(dplyr)
library(tidyr)
library(stringr)

In [None]:
# Construct dataframe
username <- c("@potus", "@abject.ron", "@tess888", "@ayden99", "@curious_reggie", 
                    "@peter.the.third", "@xavier_w", "@humble.pacifist", 
                    "@krz4377", "@not.nat")
policy_text <- c("Today we changed prehistoric policies held our great country back from progress.", 
            "@potus this policy change is an abomination of everything America stands for", 
            "I have completely lost trust in the government", 
            "I am hopeful things will get better after this valuable change", 
            "Navigating the past is always a challenge, but one that can be overcome through hard work.",
            "Can our country recover from this?",
            "@ayden99 - Progress wins. A victory for America today.",
            "Poor call @potus - old rules kept us from making the mistakes of the past...",
            "I'm sick of aristocracy stamping out the people's power",
            "Definitely some mixed feelings about today's decision. Some wins, some losses, but hey - that's democracy.")

policy_df <- tibble(username = username, text=policy_text) %>%
group_by(username)

policy_df 

In [None]:
policy_token <- policy_df  %>%
    unnest_tokens(output = word, 
                  input = text,
                  token = "words", # this specifies that we want a token to be 1 word
                  to_lower = TRUE) # converts all text to uniform lowercase
            
head(policy_token)

In the above code, try changing the argument `token = "words"` to `token = "characters"` or `token = "sentences"` <br>
<br>
What do you see? <br>
<br>
If we were interested in running our sentiment analysis at a higher level, for example, by considering sentences as tokens, we could also do that. If you are interested, you can read more about documentation for the `unnest_tokens` function [here](https://www.rdocumentation.org/packages/tidytext/versions/0.4.2/topics/unnest_tokens) or by typing 
`?unnest_tokens` into a code cell. For the purpose of this analysis, we will be working at the word level; be sure to return the above argument to `token = "words"` when you are ready to continue the analysis.


#### Negative Sentiment
If we are only interested in identifying the words in our corpus of tweets that contain negative (as opposed to positive) sentiment, we can use the `bing` library, created by [Bing and collaborators](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html).

In [None]:
head(get_sentiments("bing"))

This library contains a list of 6,786 words that have been pre-classified as being associated with negative or positive sentiment.

In [None]:
negative_sentiments <- get_sentiments("bing") %>% 
    filter(sentiment == "negative") # select only the negative words

negative_policy <- policy_token %>%
    inner_join(negative_sentiments) %>% 
    count(word, sort = TRUE) # count the number of negative words

head(negative_policy)

#### Negative and Positive Sentiment

In [None]:
sentiment_policy <- policy_token %>%
    inner_join(get_sentiments("bing")) %>% # adds column with binary sentiment library
    count(word, sentiment) %>%
    pivot_wider(names_from = sentiment, values_from = n, values_fill = 0)
head(sentiment_policy)

#### summarizing sentence level sentiment

In [None]:
sentiment_policy2 <- policy_token %>%
    inner_join(get_sentiments("bing")) %>% # adds column with binary sentiment library
    count(username, word, sentiment) %>%
    pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
    mutate(sentiment = positive - negative)
sentiment_policy2

In [None]:
policy_token 

In [None]:
bing_word_counts <- policy_token %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts


We can further look at the summed score for a given user's tweet beyond the individual words. <br>
For example, if we ran the code, `with(sentiment_policy2, sum(sentiment[username == "@curious_reggie"]))` we would get a value of `-1` because `(-1) + (-1) + (+1) = -1`
<br>
Try out a few different usernames in the blank `...` code below: 
* "@potus"
* "@abject.ron" 
* "@tess888" 
* "@ayden99" 
* "@curious_reggie" 
* "@peter.the.third"
* "@xavier_w"
* "@humble.pacifist"
* "@krz4377"
* "@not.nat"


In [None]:
with(sentiment_policy2, sum(sentiment[username == "@curious_reggie"]))

## Analysis

## References

* Air Force Institute of Technology. (n.d.). Text Mining: Sentiment Analysis · AFIT Data Science Lab R Programming Guide. Retrieved May 31, 2024, from https://afit-r.github.io/sentiment_analysis
* Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., & Matsuo, A. (2018). quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30), 774. https://doi.org/10.21105/joss.00774
* Hicks, S. (2022, October 13). Tidytext and sentiment analysis: Introduction to tidytext and sentiment analysis. https://www.stephaniehicks.com/jhustatcomputing2022/posts/2022-10-13-working-with-text-sentiment-analysis/
* Hussein, D. M. E.-D. M. (2018). A survey on sentiment analysis challenges. Journal of King Saud University - Engineering Sciences, 30(4), 330–338. https://doi.org/10.1016/j.jksues.2016.04.002
* Liu, B. (2011). Sentiment Analysis and Opinion Mining. Department of Computer Science University Of Illinois at Chicago. https://www.cs.uic.edu/~liub/FBS/Sentiment-Analysis-tutorial-AAAI-2011.pdf
* Robinson, D. (2016, July 21). Does sentiment analysis work? A tidy analysis of Yelp reviews. Variance Explained. http://varianceexplained.org/r/yelp-sentiment/
* Silge, J., & Hvitfeldt, E. (2022). Supervised Machine Learning for Text Analysis in R. https://smltar.com/
* Silge, J., & Robinson, D. (2017). Welcome to Text Mining with R | Text Mining with R. https://www.tidytextmining.com/


# Seeing the stop words in our analysis

In [None]:
policy_df %>%
unnest_tokens(output = word, input = text) %>%
count(word, sort = TRUE) %>%
head

# Playing around

In [None]:
data <- corpus_subset(data_corpus_inaugural, Year > 1939 & Year < 1945)

# make a dfm
data_dfm <- data %>%
    tokens() %>%
    dfm()


In [None]:
#data <- corpus_subset(data_corpus_amicus)

# make a dfm
data_dfm <- data %>%
    tokens() %>%
    dfm()
print(data_dfm)