## Co-word analysis

This notebook was posted by Simon Lindgren // [@simonlindgren](http://www.twitter.com/simonlindgren) // [simonlindgren.com](http://simonlindgren.com)

It is about examining pairwise *correlation* among words. This is a form of [co-occurrence](https://en.wikipedia.org/wiki/Co-occurrence_networks), or [co-word](http://journals.sagepub.com/doi/abs/10.1177/053901883022002003), analysis.

The code below draws on the book [Text Mining with R](http://tidytextmining.com) by [Julia Silge](http://juliasilge.com) and [David Robinson](http://varianceexplained.org).

In [None]:
library(tidyverse)
library(tidytext)

First, we read a mass of text using the `readLines()` function, and converting it into a dataframe object.

In [None]:
text_df <- readLines("canterville_ghost.txt") %>%
    data_frame(text = .) %>%
    mutate(line = row_number()) # add line numbers (not necessary)
text_df

Then, we divide the text into 10-line sections.

In [None]:
text_section_words <- text_df %>%
    mutate(section = row_number() %/% 10)
text_section_words

Now, which *words* (tokens) appear within which section

In [None]:
data(stop_words)

text_section_words <- text_section_words %>% 
    unnest_tokens(word, text) %>%
    filter(!word %in% stop_words$word) # remove stopwords!
text_section_words

We use the `pairwise_count()` function from the [`widyr`](https://github.com/dgrtwo/widyr) package. Using the prefix `pairwise_` results in one row for each pair of words in the word variable. This lets us count common pairs of words co-appearing within the same section

In [None]:
library(widyr)

In [None]:
# count words co-occuring within sections
word_pairs <- text_section_words %>%
    pairwise_count(word, section, sort = TRUE)
word_pairs

We can ask questions such as: Which word occurs most often with word X? 

In [None]:
word_pairs %>%
  filter(item1 == "ghost")

###### Pairwise correlations 
We can now use the `pairwise_cor()` function in `widyr` to find the [phi coefficient](https://en.wikipedia.org/wiki/Phi_coefficient) between words based on how often they appear in the same section.

In [None]:
word_cors <- text_section_words %>%
  group_by(word) %>%
  filter(n() >= 20) %>% # filter for the words with n > 20
  pairwise_cor(word, section, sort = TRUE)

word_cors

In [None]:
# Which words are the most correlated with word X?

word_cors %>%
  filter(item1 == "lord")

This lets us pick particular interesting words and find the other words most associated with them.

In [None]:
word_cors %>%
  filter(item1 %in% c("lord", "ghost", "canterville", "family")) %>%
  group_by(item1) %>%
  top_n(6) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip()

######  Co-word network graph
We can use `igraph` and `ggraph` to visualise the correlations found by the `widyr` package.

In [None]:
library(ggraph)
library(igraph)

In [None]:
word_cors %>%
  filter(correlation > .01) %>% # adjust filter level
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

# Pairs of words in the analysed text that show at least a .01 
# correlation of appearing within the same 10-line section