# n-gram analysis

This notebook was posted by Simon Lindgren // [@simonlindgren](http://www.twitter.com/simonlindgren) // [simonlindgren.com](http://simonlindgren.com)

It is about stuff that can be done with the `tidytext` package, if we tokenize by consecutive sequences of words ([n-grams](https://en.wikipedia.org/wiki/N-gram)), rather than just by single words. 

The code below is based on the book [Text Mining with R](http://tidytextmining.com) by [Julia Silge](http://juliasilge.com) and [David Robinson](http://varianceexplained.org).

In [None]:
library(tidyverse)
library(tidytext)

###### Read documents
The code below reads a `csv` file into a tidy dataset. We use `unnest_tokens()` with the `ngram` option to get ngrams of `n` consecutive words.

In [None]:
documents <- read_csv2("tidyraw2.csv")
tidy_documents <- documents %>%
    #unnest_tokens(word,text) 
    unnest_tokens(bigram, text, token = "ngrams", n = 2)

In [None]:
# Inspect the dataframe
tidy_documents

In [None]:
# The most common bigrams
tidy_documents %>%
  count(bigram, sort = TRUE)

Lots of those were stopword-type words. We use `tidyr`'s `separate()` function to split the column 'bigram' into 'word 1' and 'word 2' based on the blank space as a separator between them

In [None]:
bigrams_separated <- tidy_documents %>%
  separate(bigram, c("word1", "word2"), sep = " ")
bigrams_separated

Now, remove all rows that have any of `tidytext`'s stopwords in any of them.

In [None]:
bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)
bigrams_filtered

# This can also be done with own custom list of stopwords:
# Read your own stops: my_stop_words <- read_csv2("swestop.csv")
# use that instead of stop_words

In [None]:
# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_counts

After stopword removal, we might want to recombine the bigrams into one column again:

In [None]:
bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")

bigrams_united

In [None]:
# ((The same process but with trigrams))
documents <- read_csv2("tidyraw2.csv")
trigram_documents <- documents %>%
    unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
    separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
    filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word3 %in% stop_words$word) %>%
    count(word1, word2, word3, sort = TRUE)

In [None]:
trigram_documents

In [None]:
# ((recombine trigrams into one column again))
trigrams_united <- trigram_documents %>%
  unite(trigram, word1, word2, word3, sep = " ")

trigrams_united

###### Exploratory analysis of bigrams

In [None]:
# Back to the bigrams
# What are the most common pairs with word x as word1 or 2?
# We use the non united columns for that

bigrams_filtered %>%
  filter(word2 == "awoke") %>%
  count(blogger, word1, sort = TRUE)

In [None]:
# Can do similar things such as with single words
# For example look at the tf-idf of bigrams across the documents
# We use the united column for that

bigram_tf_idf <- bigrams_united %>%
  count(blogger, bigram) %>%
  bind_tf_idf(bigram, blogger, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf

Now, let's visualise the high tf-idf bigrams.

In [None]:
my_plot <- bigram_tf_idf %>%
    arrange(desc(tf_idf)) %>%
    mutate(bigram = factor(bigram, levels = rev(unique(bigram)))) 

my_plot %>%
    top_n(2) %>%
    ggplot(aes(bigram, tf_idf, fill = blogger)) +
    geom_col() +
    labs(x = NULL, y = "tf-idf") +
    coord_flip()

Now, we look at blogs individually:

In [None]:
my_plot %>% 
  group_by(blogger) %>% 
  top_n(5) %>% 
  ungroup %>%
  ggplot(aes(bigram, tf_idf, fill = blogger)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~blogger, ncol = 2, scales = "free") +
  coord_flip()

###### Visualising bigrams in a graph
A network graph can be constructed from a tidy object since it has three variables:

- source: the node an edge is coming from
- target: the node an edge is going towards
- weight: A numeric value associated with each edge

We use the `igraph` package and its function `graph_from_data_frame()`. Our dataframe `bigram_counts` from earlier has columns corresponding to 'source', 'target', and 'edge weight' (in this case: `n`).

You may run into problems installing `igraph` via CRAN, but if you use Anaconda you can do: `conda install r-igraph`.

In [None]:
library(igraph)
bigram_counts # dataframe from before

Let's filter for the most common pairs (edges), and create the network graph.

In [None]:
bigram_graph <- bigram_counts %>%
  filter(n > 1) %>% # filter edges with weight above x
  graph_from_data_frame()

bigram_graph

The `ggraph` package is better than `igraph` at the visualisation bit, so let's use that one to draw the graph. Installing `ggraph` can be tricky, but [this](https://stackoverflow.com/questions/42315364/how-to-install-ggraph-package-to-the-latest-r-v-3-3-2) may help.

In [None]:
library(ggraph)
ggraph(bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

Now, tweak the graph.

- add a theme that removes the wrongful axes, theme_void()
- tinker with `geom_node_point` to make the nodes blue and larger
- add directionality to `geom_edge_link` with an arrow, constructed using `grid::arrow()`, including an end_cap option that tells the arrow to end before touching the node
- add the `edge_alpha` aesthetic to the link layer to make links transparent based on how common or rare the bigram is

In [None]:
# create the arrows
a <- grid::arrow(type = "closed", length = unit(.10, "inches"))

# the graph
ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 6) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)+
  theme_void()