### Tidytext first steps

This notebook was posted by Simon Lindgren // [@simonlindgren](http://www.twitter.com/simonlindgren) // [simonlindgren.com](http://simonlindgren.com)

Data scientists Julia Silge and David Robinson have developed the R package [tidytext](http://joss.theoj.org/papers/10.21105/joss.00037), the aim of which is to make text mining workflows more efficient by treating text as data frames of individual words.

Storing words as rows in dataframes is different to how text is often stored in current analyses, perhaps as strings or in a document-term matrix. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an [n-gram](https://en.wikipedia.org/wiki/N-gram), sentence, or paragraph.

Tidytext is written to work together with [_tidyverse_](http://tidyverse.org) — a collection of R packages that share common principles and are designed to work together seamlessly.

When creating this notebook, I drew heavily on [the book](http://tidytextmining.com/) by the creators of the package.


Install tidyverse by opening an R prompt in Terminal and enter `install.packages("tidyverse")`, then install tidytext with `install.packages("tidytext")`.

In [None]:
library(tidyverse)
library(tidytext)

#### Importing documents

Tidytext leverages the `readr` package to read documents. The function `read_csv2` reads csv files with ';' as column separator.

`readr` is great in many ways. For example, it often successfully guesses what data format the columns are in. See its documentation [here](https://github.com/tidyverse/readr).

We read the csv into a dataframe:

In [None]:
documents <- read_csv2("tidyraw.csv")
documents

#### Tidyfy the documents
To work with our documents as a tidy dataset, let's restructure the dataframe in the one-token-per-row format using the `unnest_tokens` function in `tidytext`. The parentheses, for example `(text, text)`, states (the column to be written, the column to read).

This makes use of the [`tokenizers`](https://github.com/ropensci/tokenizers) package to separate the text from the initial csv into tokens. The default is to tokenize by words. The code below also includes alternatives for ngrams and...

The procedure retains any other columns, and strips any punctuation from the text.

In [None]:
tidy_documents <- documents %>%
    unnest_tokens(word,text)
    #unnest_tokens(ngram, text, token = "ngrams", n = 2)
tidy_documents

#### Filter rows
Filter away some of the rows, using the `filter()` function in `dplyr`. In this case, the rows where the content of the column 'word' consists of numbers.

In [None]:
tidy_documents<- tidy_documents %>%
    filter(!grepl("[0-9]", word))
tidy_documents

#### Remove stop words
The code below can be used to remove either the standard English stop words, or a list of custom stopwords from a file on your computer.

In [None]:
# remove built in English stop words
data(stop_words)
tidy_documents <- anti_join(tidy_documents, stop_words, by="word")

In [None]:
# create custom stop word list
my_stop_words <- read_csv2("swestop.csv")
#my_stop_words <- read_csv2("swestop_custom.csv")

my_stop_words

In [None]:
# remove custom stop words
tidy_documents <- anti_join(tidy_documents, my_stop_words, by="word")

#### See the most common words
As we now have our corpus stored as a tidy dataframe, we can make use of the manipulation grammar provided by the [`dplyr` package](https://github.com/tidyverse/dplyr) to select, filter, and arrange the data.

There is also lots of useful stuff about what can be done with data in this format in Grolemund & Wickham's book [R for Data Science](http://r4ds.had.co.nz).

Now, let's see which words are the most frequent ones in our dataset:

In [None]:
tidy_documents %>%
    count(word, sort = TRUE)

Also because we use tidy tools, we can pipe our data directly to the [`ggplot2` package](http://ggplot2.tidyverse.org) to visualise things. The code below is one example. Note that the `filter` sets the threshold for how many times a word must occur to be shown in the graph.

In [None]:
tidy_documents %>%
  count(word, sort = TRUE) %>%
  filter(n > 2) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()