Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
281 lines (214 sloc) 10.3 KB

Corpus analysis: the document-term matrix

(C) 2014 Wouter van Atteveldt, license: [CC-BY-SA]

The most important object in frequency-based text analysis is the document term matrix. This matrix contains the documents in the rows and terms (words) in the columns, and each cell is the frequency of that term in that document.

In R, these matrices are provided by the tm (text mining) package. Although this package provides many functions for loading and manipulating these matrices, using them directly is relatively complicated.

Fortunately, the RTextTools package provides an easy function to create a document-term matrix from a data frame. To create a term document matrix from a simple data frame with a 'text' column, use the create_matrix function

input = data.frame(text=c("Chickens are birds", "The bird eats"))
m = create_matrix(input$text, removeStopwords=F)

We can inspect the resulting matrix using the regular R functions:


So, m is a DocumentTermMatrix, which is derived from a simple_triplet_matrix as provided by the slam package. Internally, document-term matrices are stored as a sparse matrix: if we do use real data, we can easily have hundreds of thousands of rows and columns, while the vast majority of cells will be zero (most words don't occur in most documents). Storing this as a regular matrix would waste a lot of memory. In a sparse matrix, only the non-zero entries are stored, as 'simple triplets' of (document, term, frequency).

As seen in the output of dim, Our matrix has only 2 rows (documents) and 6 columns (unqiue words). Since this is a rather small matrix, we can visualize it using as.matrix, which converts the 'sparse' matrix into a regular matrix:


Stemming and stop word removal

So, we can see that each word is kept as is. We can reduce the size of the matrix by dropping stop words and stemming: (see the create_matrix documentation for the full range of options)

m = create_matrix(input$text, removeStopwords=T, stemWords=T, language='english')

As you can see, the stop words (the and are) are removed, while the two verb forms of to eat are joined together.

In RTextTools, the language for stemming and stop words can be given as a parameter, and the default is English. Note that stemming works relatively well for English, but is less useful for more highly inflected languages such as Dutch or German. An easy way to see the effects of the preprocessing is by looking at the colSums of a matrix, which gives the total frequency of each term:


For Dutch, the result is less promising:

text = c("De kip eet", "De kippen hebben gegeten")
m = create_matrix(text, removeStopwords=T, stemWords=T, language="dutch")

As you can see, de and hebben are correctly recognized as stop words, but gegeten and kippen have a different stem than eet and kip.

Loading and analysing a larger dataset

Let's have a look at a more serious example. The file achmea.csv contains 22 thousand customer reviews, of which around 5 thousand have been manually coded with sentiment. This file can be downloaded from github

d = read.csv("achmea.csv")

For this example, we will only be using the CONTENT and SENTIMENT columns. We will load it, without stemming but with stopword removal, using create_matrix:

m = create_matrix(d$CONTENT, removeStopwords=T, language="dutch")

Corpus analysis: word frequency

What are the most frequent words in the corpus? As shown above, we could use the built-in colSums function, but this requires first casting the sparse matrix to a regular matrix, which we want to avoid (even our relatively small dataset would have 400 million entries!). So, we use the col_sums function from the slam package, which provides the same functionality for sparse matrices:

freq = col_sums(m)
# sort the list by reverse frequency using built-in order function:
freq = freq[order(-freq)]
head(freq, n=10)

As can be seen, the most frequent terms are all related to Achmea (unsurprisingly). It can be useful to compute different metrics per term, such as term frequency, document frequency (how many documents does it occur), and td.idf (term frequency * inverse document frequency, which removes both rare and overly frequent terms).

To make this easy, let's define a function term.statistics to compute this information from a document-term matrix (also available from the corpustools package)

term.statistics <- function(dtm) {
    dtm = dtm[row_sums(dtm) > 0,col_sums(dtm) > 0]    # get rid of empty rows/columns
    vocabulary = colnames(dtm)
    data.frame(term = vocabulary,
               characters = nchar(vocabulary),
               number = grepl("[0-9]", vocabulary),
               nonalpha = grepl("\\W", vocabulary),
               termfreq = col_sums(dtm),
               docfreq = col_sums(dtm > 0),
               reldocfreq = col_sums(dtm > 0) / nDocs(dtm),
               tfidf = tapply(dtm$v/row_sums(dtm)[dtm$i], dtm$j, mean) * log2(nDocs(dtm)/col_sums(dtm > 0)))
terms = term.statistics(m)

So, we can remove all words containing numbers and non-alphanumeric characters, and sort by document frequency:

terms = terms[!terms$number & !terms$nonalpha, ]
terms = terms[order(-terms$termfreq), ]

This is still not a very useful list, as the top terms occur in too many documents to be informative. So, let's remove all words that occur in more than 10% of documents, and let's also remove all words that occur in less than 10 documents:

terms = terms[terms$reldocfreq < .1 & terms$docfreq > 10, ]

This seems more useful. We now have 2316 terms left of the original 20 thousand. To create a new document-term matrix with only these terms, index on the right columns:

m_filtered = m[, colnames(m) %in% terms$term]

As a bonus, using the wordcloud package, we can visualize the top words as a word cloud:

pal <- brewer.pal(6,"YlGnBu") # color model
wordcloud(terms$term[1:100], terms$termfreq[1:100], 
          scale=c(6,.5), min.freq=1, max.words=Inf, random.order=FALSE, 
          rot.per=.15, colors=pal)

Comparing corpora

If we have two different corpora, we can see which words are more frequent in each corpus. Let's create two d-t matrices, one containing all positive comments, and one containing all negative comments.

m_pos = create_matrix(pos, removeStopwords=T, language="dutch")
m_neg = create_matrix(neg, removeStopwords=T, language="dutch")

So, which words are used in positive reviews? Lets make a function to speed it up

wordfreqs = function(m) {freq = col_sums(m); freq[order(-freq)]}

And what words are used in negative reviews?


For the positive reviews, the words made sense (goed, snel). The negative contain more general terms, and the term fbto actually occurs in both.

Can we check which words are more frequent in the negative reviews than in the positive? We can define a function compara.corpora that makes this comparison by normalizing the term frequencies by dividing by corpus size, and then computing the 'overrepresentation' and the chi-squared statistic (also available from the corpustools package).

chi2 <- function(a,b,c,d) {
  ooe <- function(o, e) {(o-e)*(o-e) / e}
  tot = 0.0 + a+b+c+d
  a = as.numeric(a)
  b = as.numeric(b)
  c = as.numeric(c)
  d = as.numeric(d)
  (ooe(a, (a+c)*(a+b)/tot)
   +  ooe(b, (b+d)*(a+b)/tot)
   +  ooe(c, (a+c)*(c+d)/tot)
   +  ooe(d, (d+b)*(c+d)/tot))

compare.corpora <- function(dtm.x, dtm.y, smooth=.001) {
  freqs = term.statistics(dtm.x)[, c("term", "termfreq")]
  freqs.rel = term.statistics(dtm.y)[, c("term", "termfreq")]
  f = merge(freqs, freqs.rel, all=T, by="term")    
  f[] = 0
  f$relfreq.x = f$termfreq.x / sum(freqs$termfreq)
  f$relfreq.y = f$termfreq.y / sum(freqs.rel$termfreq)
  f$over = (f$relfreq.x + smooth) / (f$relfreq.y + smooth)
  f$chi = chi2(f$termfreq.x, f$termfreq.y, sum(f$termfreq.x) - f$termfreq.x, sum(f$termfreq.y) - f$termfreq.y)

cmp = compare.corpora(m_pos, m_neg)

As you can see, for each term the absolute and relative frequencies are given for both corpora. In this case, x is positive and y is negative. The 'over' column shows the amount of overrepresentation: a high number indicates that it is relatively more frequent in the x (positive) corpus. 'Chi' is a measure of how unexpected this overrepresentation is: a high number means that it is a very typical term for that corpus.

Let's sort by overrepresentation:

cmp = cmp[order(cmp$over), ]

So, the most overrepresented words in the negative corpus are words like risico, beter, and maanden. Note that beter is sort of surprising, a sentiment word list would probably think this is a positive words.

We can also sort by chi-squared, taking only the underrepresented (negative) words:

neg = cmp[cmp$over < 1, ]
neg = neg[order(-neg$chi), ]

As you can see, the list is very comparable, but more frequent terms are generally favoured in the chi-squared approach since the chance of 'accidental' overrepresentation is smaller.

Let's make a word cloud of the most frequent negative terms:

pal <- brewer.pal(6,"YlGnBu") # color model
wordcloud(neg$term[1:100], neg$chi[1:100], 
          scale=c(6,.5), min.freq=1, max.words=Inf, random.order=FALSE, 
          rot.per=.15, colors=pal)

And the positive terms:

pos = cmp[cmp$over > 1, ]
pos = pos[order(-pos$chi), ]
wordcloud(pos$term[1:100], pos$chi[1:100]^.5, 
          scale=c(6,.5), min.freq=1, max.words=Inf, random.order=FALSE, 
          rot.per=.15, colors=pal)