text_2_lda.Rmd

LDA Topic Modeling
==============

_(C) 2014 Wouter van Atteveldt, license: [CC-BY-SA]_

Latent Dirichlet Allocation is a topic modeling algorithm that automatically clusters words that for a cohesive pattern of co-occurrence. 
LDA assumes a 'generative model', where a text is generated by selecting one or more topics, and then drawing words from each of those topics.
Thus, each document has multiple topics and each word can occur in multiple topics. 

Creating a topic model
----

Topic models are constructed directly from a term-document matrix using the `topicmodels` package. 
As before, we use the `create_matrix` function from the `RTextTools` package to create the term-document matrix from a set of customer reviews.
Note that we need to remove empty rows or columns (e.g. empty reviews).
The `achmea.csv` file can be downloaded from [github](https://raw.githubusercontent.com/vanatteveldt/learningr/master/achmea.csv).

```{r}
library(RTextTools)
library(slam)
d = read.csv("achmea.csv")
m = create_matrix(d$CONTENT, language="dutch", removeStopwords=T, )
m = m[row_sums(m) > 0,col_sums(m) > 0]
dim(m)
```

Now, we can fit the topic model, say with k=10 topics and alpha=.5.
(A smaller alpha means that topics are more 'concentrated' in the documents)

```{r}
library(topicmodels)
fit = LDA(m, k=10, method="Gibbs", control=list(iter=500, alpha=.5))
```

We can visually inspect the words per topics using the `terms` function:

```{r}
terms(fit, 10)
```

And let's make a word cloud of the first topic:

```{r}
library(RColorBrewer)
library(wordcloud)
x = posterior(fit)$terms[1,]
x = sort(x, decreasing=T)[1:100]
x = x[!is.na(x)]
pal <- brewer.pal(6,"YlGnBu")
wordcloud(names(x), x, scale=c(6,.5), min.freq=1, max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal)
```

As in the 'Corpus Analysis' howto, we can define a function to compute the term statistics to filter on informative words:

```{r}
library(tm)
term.statistics <- function(dtm) {
    dtm = dtm[row_sums(dtm) > 0,col_sums(dtm) > 0]    # get rid of empty rows/columns
    vocabulary = colnames(dtm)
    data.frame(term = vocabulary,
               characters = nchar(vocabulary),
               number = grepl("[0-9]", vocabulary),
               nonalpha = grepl("\\W", vocabulary),
               termfreq = col_sums(dtm),
               docfreq = col_sums(dtm > 0),
               reldocfreq = col_sums(dtm > 0) / nDocs(dtm),
               tfidf = tapply(dtm$v/row_sums(dtm)[dtm$i], dtm$j, mean) * log2(nDocs(dtm)/col_sums(dtm > 0)))
}

terms = term.statistics(m)
words = terms$term[order(-terms$tfidf)[1:10000]]
m_filtered = m[, colnames(m) %in% words]
m_filtered = m_filtered[row_sums(m_filtered) > 0,col_sums(m_filtered) > 0]

fit = LDA(m_filtered, k=10, method="Gibbs", control=list(iter=500, alpha=.5))
terms(fit, 10)
```  

Creating a topic model per sentiment category
----

We can also make a topic model of a subset of the data, for example of all the negative reviews:

```{r}
neg = d$CONTENT[!is.na(d$SENTIMENT) & d$SENTIMENT == -1]
m_neg = create_matrix(neg, removeStopwords=T, language="dutch")
m_neg = m_neg[row_sums(m_neg) > 0,col_sums(m_neg) > 0]
fit = LDA(m_neg, k=10, method="Gibbs", control=list(iter=500, alpha=.5))
terms(fit, 10)
```

Extracting the topics per document
----

If you want to e.g. correlate topics with sentiment or add the topics as features to the machine learning, it is useful to extract which documents belong to which topic.
The `fit` object contains the needed information, which can be cast into a matrix:

```{r}
library(reshape2)
assignments = data.frame(i=fit@wordassignments$i, j=fit@wordassignments$j, v=fit@wordassignments$v)
docsums = acast(assignments, i ~ v, value.var='j', fun.aggregate=length) 
dim(docsums)
head(docsums)
```