## Tidy text comparisons
This notebook was posted by Simon Lindgren // [@simonlindgren](http://www.twitter.com/simonlindgren) // [simonlindgren.com](http://simonlindgren.com)

The code in this notebook helps with comparing two corpuses of texts to a third corpus. I call the latter the 'base' corpus, and the two others 'conservatives' and 'radicals' respectively.

When creating this notebook, I drew heavily on [the book](http://tidytextmining.com/) by the creators of the `tidytext` package, and on [Julia Silge](https://twitter.com/juliasilge)'s amazing blog post on [The Life-Changing Magic of Tidying Text](http://juliasilge.com/blog/Life-Changing-Magic/).

#### Preparing the corpuses

First, we read and tidyfy the three corpuses. Make sure that the `csv`'s are prepared correctly. It is a working minimum that they contain one column each with `text` as the header. If you have messy social media text, the characters `;` and `"` often raise problems, so remove those. For all other details about these initial steps, see this [other notebook](https://github.com/simonlindgren/Tidy-Text-first-steps/blob/master/Tidy%2Btext%2Bfirst%2Bsteps.ipynb).

In [None]:
library(tidyverse)
library(tidytext)

In [None]:
# create custom stop word list
my_stop_words <- read_csv2("swestop.csv")
#my_stop_words <- read_csv2("swestop_custom.csv")

#### Importing documents
##### First corpus

In [None]:
# Our base corpus (that we want to compare the others to)
# read it, tidy it
base <- read_csv2("base.csv")
tidy_base <- base %>%
    unnest_tokens(word,text)
    #unnest_tokens(ngram, text, token = "ngrams", n = 2)

In [None]:
# remove built in English stop words
data(stop_words)
tidy_base <- anti_join(tidy_base, stop_words, by="word")

In [None]:
# remove custom stop words
tidy_base <- anti_join(tidy_base, my_stop_words, by="word")
tidy_base

##### Second corpus

In [None]:
radicals <- read_csv2("radicals.csv")
tidy_radicals <- radicals %>%
    unnest_tokens(word,text)
    #unnest_tokens(ngram, text, token = "ngrams", n = 2)

data(stop_words)
tidy_radicals <- anti_join(tidy_radicals, stop_words, by="word")
tidy_radicals <- anti_join(tidy_radicals, my_stop_words, by="word")
tidy_radicals

##### Third corpus

In [None]:
conservatives <- read_csv2("conservatives.csv")
tidy_conservatives <- conservatives %>%
    unnest_tokens(word,text)
    #unnest_tokens(ngram, text, token = "ngrams", n = 2)

data(stop_words)
tidy_conservatives <- anti_join(tidy_conservatives, stop_words, by="word")
tidy_conservatives <- anti_join(tidy_conservatives, my_stop_words, by="word")
tidy_conservatives

##### Check counts

In [None]:
tidy_base %>%
    count(word, sort = TRUE)

In [None]:
# check countz
tidy_radicals %>%
  count(word, sort = TRUE)

In [None]:
tidy_conservatives %>%
  count(word, sort = TRUE)

##### Bind corpus two and three together

In [None]:
library(stringr)

tidy_both <- bind_rows(
        mutate(tidy_radicals, author="Radicals"),
        mutate(tidy_conservatives, author="Conservatives"))
tidy_both

##### Calculate frequencies for all three
The next two chunks of code are _definitely_ courtesy of [Julia Silge](http://juliasilge.com/blog/Life-Changing-Magic/)!

In [None]:
frequency <- tidy_both %>%
    mutate(word = str_extract(word, "[a-z]+")) %>%
    count(author, word) %>%
    rename(other = n) %>%
    inner_join(count(tidy_both, word)) %>%
    rename(Base = n) %>%
    mutate (other = other / sum(other), Base = Base/sum(Base)) %>%
    ungroup()
frequency

Now plot it!

In [None]:
library(scales)

ggplot(frequency, aes(x = other, y = Base, color = abs(Base - other))) +
        geom_abline(color = "gray40") +
        geom_jitter(alpha = 0.1, size = 2.5, width = 0.4, height = 0.4) +
        geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
        scale_x_log10(labels = percent_format()) +
        scale_y_log10(labels = percent_format()) +
        scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
        facet_wrap(~author, ncol = 2) +
        theme_minimal(base_size = 14) +
        theme(legend.position="none") +
        labs(title = "Title of graph",
             subtitle = "Subtitle of graph",
             y = "Base", x = NULL)

##### Interpreting the plots
- Words that are close to the line in these plots have similar frequencies in both sets of texts.
- Words that are far from the line are words that are found more in one set of texts than another. 

(The percent frequencies for individual words are different in one plot when compared to another because of the inner join; not all the words are found in all three sets of texts so the percent frequency is a different quantity.)

##### Correlation test
Let’s quantify how similar and different these sets of word frequencies are using a correlation test.

In [None]:
cor.test(data = frequency[frequency$author == "Conservatives",], ~ other + Base)

In [None]:
cor.test(data = frequency[frequency$author == "Radicals",], ~ other + Base)