# Tidy Topic Modelling

This notebook was posted by Simon Lindgren // [@simonlindgren](http://www.twitter.com/simonlindgren) // [simonlindgren.com](http://simonlindgren.com)

It is about doing [LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) topic modelling with the R package [`topicmodels`](https://cran.r-project.org/web/packages/topicmodels/topicmodels.pdf) which provides an interface to the C code for LDA models (and also the CTM models) by [David Blei](https://en.wikipedia.org/wiki/David_Blei) and co-authors, and the C++ code for fitting LDA models using [Gibbs sampling](https://en.wikipedia.org/wiki/Gibbs_sampling) by [Phan and Nguyen](http://gibbslda.sourceforge.net).

The code below draws on the book [Text Mining with R](http://tidytextmining.com) by [Julia Silge](http://juliasilge.com) and [David Robinson](http://varianceexplained.org).

In [None]:
library(tidyverse)
library(tidytext)
library(tm)

Do do topic modelling with `tidytext`, we need a document-term-matrix (DTM) as input. We use here the DTM implementation of the `DocumentTermMatrix` class in the R package `tm`.

So first, we use `tm` to create a DTM. Step one is to create a `corpus` from a directory of text files. In this example, the text files are in the directory `/data`.

In [None]:
myCorpus <- Corpus(DirSource("data"))
summary(myCorpus) # Check what went in

We apply some cleaning of the documents.

In [None]:
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus , stripWhitespace)
myCorpus <- tm_map(myCorpus, tolower)
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"))
myCorpus <- tm_map(myCorpus, stemDocument, language = "english")

Then, we create the DTM.

In [None]:
myDTM <-DocumentTermMatrix(myCorpus)
myDTM <- removeSparseTerms(myDTM, 0.75) # can be adjusted, lower means smaller DTM
myDTM

(We don't need it for this particular analysis, but `tidytext` can also be used to tidy the DTM, that is, to turn it into a data frame with one-token-per-document-per-row. The tidied version includes only the non-zero values. It has no rows where count is zero.)

In [None]:
tidyDTM <- tidy(myDTM)
tidyDTM

### LDA Topic Modelling
Moving on, we use the `LDA()` function from the `topicmodels` package to fit an LDA model with `k` topics.

In [None]:
library(topicmodels)

In [None]:
dtm_lda <- LDA(myDTM, k = 8)

# Or: also set a seed so the model is predictable
# dtm_lda <- LDA(myDTM, k = 2, control = list(seed = 1234))

dtm_lda

Now, we use the `tidy()` method to extract the per-topic-per-word probabilities ("beta") from the model.

We get a dataframe in a one-topic-per-term-per-row format. For each combination, the model computes the probability of that term being generated from that topic.

In the dataframe: A given '`term`' has a '`beta`' probability of being generated from a '`topic`'.

In [None]:
topics <- tidy(dtm_lda, matrix = "beta")
topics

Based on this, we can use the `top_n()` function in the `dplyr` package to find the `n` number of terms that are most common within each topic. 

In [None]:
top_terms <- topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
top_terms

And, because this is a tidy data frame, it can be easily visualised through `ggplot2`.

In [None]:
# Visualise the terms most common within each topic
# = word-topic probabilities
top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

#### Topic differences
We can also consider the terms that had the greatest difference in β between two chosen topics. We first tweak the dataframe with `mutate()` and `spread()`. Then we `filter()` for the relatively common words (that have a beta greater than .001 in at least one of the two topics). Finally we `mutate()` a column with the `log_ratio` (a symmetrical measure: β2 being twice as large leads to a log ratio of 1, while β1 being twice as large results in -1).

Below, we do this for topic1 and topic2.

In [None]:
beta_spread <- topics %>%
  mutate(topic = paste0("topic", topic))%>% # add 'topic' to the topic name
  spread(topic, beta) %>% # spread to columns
  filter(topic1 > .001 | topic2 > .001) %>% # choose the 2 topics to compare
  mutate(log_ratio = log2(topic2 / topic1)) # choose topics to calculate
beta_spread

We then grab the top 10 and the bottom 10 log_ratios to get the words with the greatest differences (in both directions) between the two compared topics.

In [None]:
beta_spread <- beta_spread %>%
    arrange(desc(log_ratio)) # sort the dataframe by log ratio

top10 <- beta_spread %>% # get the top 10
    top_n(10)

bottom10 <- beta_spread %>% # get the bottom 10
    top_n(-10)

top_bottom <- full_join(top10, bottom10) # join the top and bottom


In [None]:
top_bottom # includes all topics, but log_ratio columns is based on the two compared topics(above)

In [None]:
# Plot it
top_bottom %>%
    mutate(term = reorder(term, log_ratio))%>% #reorder terms by log_ratio to get correct bar order in graph
    ggplot + # set up plot
    aes(term, log_ratio, fill=log_ratio) + # plot terms by log ratio
    geom_col()+ # choose col graph
    coord_flip() # flip the view

###### Document-topic probabilities
Now, we calculate gamma (document-topic probabilities) instead of beta (word-topic probabilities). 

In the resulting dataframe, the gamma values reflect the estimated proportion of words from a given document that are generated from a given topic.

In [None]:
topics <- tidy(dtm_lda, matrix = "gamma")%>%
    mutate(proportion = round(gamma,8)) %>%
    arrange(desc(document), desc(proportion)) # sort by document, then proportion
topics

Visualise document topic proportions.

In [None]:
topics %>%
    ggplot(aes(document, proportion, fill = factor(topic))) + #variables
    ggtitle("Topic proportions in documents")+ # plot title
    geom_col() + # plot type    
    ylab("Topic proportions")+ # y-axis title
    xlab("Documents")+ # x-axis title
    scale_fill_discrete(name = "Topics") +# legend title
    theme(legend.position = "right")+
    coord_flip()
  


In [None]:
# Try a different visualisation
topics %>%
  mutate(document = reorder(document, gamma * topic)) %>%
  ggplot(aes(factor(topic), gamma)) +
  geom_col() +
  facet_wrap(~ document)

If we find that a given document stands for a very large proportion of a given topic, we may want to view that document.

In [None]:
tidy(myDTM) %>%
    filter(document == "Jonna.txt") %>%
    arrange(desc(count))