## Tidy Sentiment Analysis
This notebook was posted by Simon Lindgren // [@simonlindgren](http://www.twitter.com/simonlindgren) // [simonlindgren.com](http://simonlindgren.com)

This type of sentiment analysis considers the text as a combination of its individual words, and the sentiment content of the whole text as the sum of the sentiment content of the individual words. This is the approach described in the book [Text Mining with R](http://tidytextmining.com) by [Julia Silge](http://juliasilge.com) and [David Robinson](http://varianceexplained.org).

Dictionary-based methods like these find the total sentiment of a piece of text by adding up the individual sentiment scores for each word in the text. These methods do not take into account qualifiers before a word, such as in “no good” or “not true”; a lexicon-based method like this is based on unigrams only.

The size of the chunk of text that we use to add up unigram sentiment scores can have an effect on the analysis. A text the size of many paragraphs can often have positive and negative sentiment averaged out to about zero, while sentence-sized or paragraph-sized text often works better.

##### Sentiment lexicons
Sentiment analysis demands that we use a sentiment lexicon, a dictionary of words coded by which sentiment they represent. The `tidytext` package comes with three general purpose sentiment lexicons: [AFINN](http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010), [bing](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html), and [nrc](http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm).

In [None]:
library(tidytext)
library(tidyverse)

In [None]:
get_sentiments("afinn")

In [None]:
get_sentiments("bing")

In [None]:
get_sentiments("nrc")

##### Read documents
The code below is based on what was done in [another notebook](https://github.com/simonlindgren/Tidy-Text-first-steps/blob/master/Tidy%2Btext%2Bfirst%2Bsteps.ipynb). It reads a `csv` file into a tidy dataset.

In [None]:
documents <- read_csv2("tidyraw2.csv")
tidy_documents <- documents %>%
    unnest_tokens(word,text)
    #unnest_tokens(ngram, text, token = "ngrams", n = 2)
data(stop_words)
tidy_documents <- anti_join(tidy_documents, stop_words, by="word")
my_stop_words <- read_csv2("swestop.csv")
tidy_documents <- anti_join(tidy_documents, my_stop_words, by="word")

In [None]:
# View dataframe
tidy_documents

##### Finding sentiments
Different sentiments are coded into the `nrc` lexicon. Let's choose 'joy' and read it into a dataframe.

In [None]:
nrcjoy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")
nrcjoy

We will perform the actual sentiment analysis by using the `inner_join` function in `dplyr`. For more about different `join` functions, you can have a look [here](http://www.simonlindgren.com/stuff/2017/4/18/dplyr-joins). The code below asks: What are the most common joy words in posts by Joe?

In [None]:
tidy_documents %>%
  filter(blogger == "joe") %>%
  inner_join(nrcjoy) %>%
  count(word, sort = TRUE)

##### Map sentiments throughout texts
Now, let's see how sentiment changes throughout texts.

For this, we need to sort the dataframe by blogger and date, and to write line numbers per blogger that reflect the chronological sequence.

In [None]:
# Make linenumbers
tidy_documents <- tidy_documents %>% 
    arrange(blogger, date) %>% 
    group_by(blogger) %>% 
    mutate(linenumber = row_number()) %>%
    ungroup()

- First, we use the `bing` lexicon and `inner_join` to get a sentiment score for each word.

In [None]:
sentiment_scores <- tidy_documents %>%
    inner_join(get_sentiments("bing"))
sentiment_scores

- Second, we count how many positive and negative words there are in defined sections of the text. We define an `index` here which keeps track of which 20-line section of text we are counting sentiments for.

In [None]:
sentiment_scores <- tidy_documents %>%
    inner_join(get_sentiments("bing")) %>%  
    count(blogger, index = linenumber %/% 20, sentiment)
sentiment_scores

- Third, we use the `spread()` function from the `tidyr` package to get the negative and positive sentiments in separate columns.

In [None]:
sentiment_scores <- tidy_documents %>%
    inner_join(get_sentiments("bing")) %>%  
    count(blogger, index = linenumber %/% 20, sentiment) %>%
    spread(sentiment, n, fill = 0)
sentiment_scores

- Fourth, finally, we calculate a net sentiment (positive minus negative).

In [None]:
sentiment_scores <- tidy_documents %>%
    inner_join(get_sentiments("bing")) %>%  
    count(blogger, index = linenumber %/% 20, sentiment) %>%
    spread(sentiment, n, fill = 0) %>%
    mutate(sentiment = positive - negative)
sentiment_scores

Now we can plot the sentiment scores across the "plot trajectory" of each set of documents.

In [None]:
#### library(ggplot2)
ggplot(sentiment_scores, aes(index, sentiment, fill = blogger)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~blogger, ncol = 2, scales = "free_x")

##### Sentiment wordclouds

In [None]:
library(wordcloud)

In [None]:
tidy_documents %>%
    count(word) %>%
    with(wordcloud(word,n, max.words = 100))
    

Below, we get sentiments from `bing` and tag positive and negative words in `tidy_documents` by doing an `inner_join`. We then find the most common positive and negative words (through `count`).

Then, to use the `comparison.cloud()` function in the `wordcloud` package, the dataframe must be turned into a matrix. We do this using the `acast()` function from the `reshape2` package.

In [None]:
library(reshape2)

tidy_documents %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("#F8766D", "#00BFC4"),
                   max.words = 100)