Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
105 lines (83 sloc) 3.68 KB

LDA Topic Modeling

(C) 2014 Wouter van Atteveldt, license: [CC-BY-SA]

Latent Dirichlet Allocation is a topic modeling algorithm that automatically clusters words that for a cohesive pattern of co-occurrence. LDA assumes a 'generative model', where a text is generated by selecting one or more topics, and then drawing words from each of those topics. Thus, each document has multiple topics and each word can occur in multiple topics.

Creating a topic model

Topic models are constructed directly from a term-document matrix using the topicmodels package. As before, we use the create_matrix function from the RTextTools package to create the term-document matrix from a set of customer reviews. Note that we need to remove empty rows or columns (e.g. empty reviews). The achmea.csv file can be downloaded from github.

d = read.csv("achmea.csv")
m = create_matrix(d$CONTENT, language="dutch", removeStopwords=T, )
m = m[row_sums(m) > 0,col_sums(m) > 0]

Now, we can fit the topic model, say with k=10 topics and alpha=.5. (A smaller alpha means that topics are more 'concentrated' in the documents)

fit = LDA(m, k=10, method="Gibbs", control=list(iter=500, alpha=.5))

We can visually inspect the words per topics using the terms function:

terms(fit, 10)

And let's make a word cloud of the first topic:

x = posterior(fit)$terms[1,]
x = sort(x, decreasing=T)[1:100]
x = x[!]
pal <- brewer.pal(6,"YlGnBu")
wordcloud(names(x), x, scale=c(6,.5), min.freq=1, max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal)

As in the 'Corpus Analysis' howto, we can define a function to compute the term statistics to filter on informative words:

term.statistics <- function(dtm) {
    dtm = dtm[row_sums(dtm) > 0,col_sums(dtm) > 0]    # get rid of empty rows/columns
    vocabulary = colnames(dtm)
    data.frame(term = vocabulary,
               characters = nchar(vocabulary),
               number = grepl("[0-9]", vocabulary),
               nonalpha = grepl("\\W", vocabulary),
               termfreq = col_sums(dtm),
               docfreq = col_sums(dtm > 0),
               reldocfreq = col_sums(dtm > 0) / nDocs(dtm),
               tfidf = tapply(dtm$v/row_sums(dtm)[dtm$i], dtm$j, mean) * log2(nDocs(dtm)/col_sums(dtm > 0)))

terms = term.statistics(m)
words = terms$term[order(-terms$tfidf)[1:10000]]
m_filtered = m[, colnames(m) %in% words]
m_filtered = m_filtered[row_sums(m_filtered) > 0,col_sums(m_filtered) > 0]

fit = LDA(m_filtered, k=10, method="Gibbs", control=list(iter=500, alpha=.5))
terms(fit, 10)

Creating a topic model per sentiment category

We can also make a topic model of a subset of the data, for example of all the negative reviews:

m_neg = create_matrix(neg, removeStopwords=T, language="dutch")
m_neg = m_neg[row_sums(m_neg) > 0,col_sums(m_neg) > 0]
fit = LDA(m_neg, k=10, method="Gibbs", control=list(iter=500, alpha=.5))
terms(fit, 10)

Extracting the topics per document

If you want to e.g. correlate topics with sentiment or add the topics as features to the machine learning, it is useful to extract which documents belong to which topic. The fit object contains the needed information, which can be cast into a matrix:

assignments = data.frame(i=fit@wordassignments$i, j=fit@wordassignments$j, v=fit@wordassignments$v)
docsums = acast(assignments, i ~ v, value.var='j', fun.aggregate=length)