Corpus analysis: the document-term matrix
Wouter van Atteveldt
head = function(...) knitr::kable(utils::head(...))
The most important object in frequency-based text analysis is the document term matrix. This matrix contains the documents in the rows and terms (words) in the columns, and each cell is the frequency of that term in that document.
In R, these matrices are provided by the
tm (text mining) package.
Although this package provides many functions for loading and manipulating these matrices,
using them directly is relatively complicated.
RTextTools package provides an easy function to create a document-term matrix from a data frame.
To create a term document matrix from a simple data frame with a 'text' column, use the
create_matrix function (with removeStopwords=F to make sure all words are kept):
library(RTextTools) input = data.frame(text=c("Chickens are birds", "The bird eats")) m = create_matrix(input$text, removeStopwords=F)
We can inspect the resulting matrix m using the regular R functions to get e.g. the type of object and the dimensionality:
class(m) dim(m) m
m is a
DocumentTermMatrix, which is derived from a
simple_triplet_matrix as provided by the
Internally, document-term matrices are stored as a sparse matrix:
if we do use real data, we can easily have hundreds of thousands of rows and columns, while the vast majority of cells will be zero (most words don't occur in most documents).
Storing this as a regular matrix would waste a lot of memory.
In a sparse matrix, only the non-zero entries are stored, as 'simple triplets' of (document, term, frequency).
As seen in the output of
dim, Our matrix has only 2 rows (documents) and 6 columns (unqiue words).
Since this is a fairly small matrix, we can visualize it using
as.matrix, which converts the 'sparse' matrix into a regular matrix:
Stemming and stop word removal
So, we can see that each word is kept as is. We can reduce the size of the matrix by dropping stop words and stemming (changing a word like 'chickens' to its base form or stem 'chicken'): (see the create_matrix documentation for the full range of options)
m = create_matrix(input$text, removeStopwords=T, stemWords=T, language='english') dim(m) as.matrix(m)
As you can see, the stop words (the and are) are removed, while the two verb forms of to eat are joined together.
In RTextTools, the language for stemming and stop words can be given as a parameter, and the default is English. Note that stemming works relatively well for English, but is less useful for more highly inflected languages such as Dutch or German. An easy way to see the effects of the preprocessing is by looking at the colSums of a matrix, which gives the total frequency of each term:
For more richly inflected languages like Dutch, the result is less promising:
text = c("De kip eet", "De kippen hebben gegeten") m = create_matrix(text, removeStopwords=T, stemWords=T, language="dutch") colSums(as.matrix(m))
As you can see, de and hebben are correctly recognized as stop words, but gegeten (eaten) and kippen (chickens) have a different stem than eet (eat) and kip (chicken). German gets similarly bad results.
Loading and analysing a larger dataset from AmCAT
If we want to move beyond stemming, one option is to use AmCAT to parse articles. Before we can proceed, we need to save our AmCAT password (only needed once, please don't save the password in your file!) and log onto amcat:
library(amcatr) amcat.save.password("https://amcat.nl", "username", "password") conn = amcat.connect("https://amcat.nl")
library(amcatr) conn = amcat.connect("https://amcat.nl")
Now, we can upload articles from R using the
amcat.upload.articles function, which we now demonstrate with a single article but which can also be used to upload many articles at once:
articles = data.frame(text = "John is a great fan of chickens, and so is Mary", date="2001-01-01T00:00", headline="test") aset = amcat.upload.articles(conn, project = 1, articleset="Test", medium="test", text=articles$text, date=articles$date, headline=articles$headline)
And we can then lemmatize this article and download the results directly to R
tokens = amcat.gettokens(conn, project=1, articleset = aset, module = "corenlp_lemmatize") head(tokens)
And we can see that e.g. for "is" the lemma "be" is given.
For a more serious application, we will use an existing article set: set 16017 in project 559, which contains the state of the Union speeches by Bush and Obama (each document is a single paragraph)
This data is available directly from the corpustools package:
library(corpustools) data(sotu) nrow(sotu.tokens) head(sotu.tokens, n=20)
As you can see, the result is similar to the ad-hoc lemmatized tokens, but now we have around 100 thousand tokens rather than 6.
We can create a document-term matrix using the dtm.create command from
dtm = dtm.create(documents=sotu.tokens$aid, terms=sotu.tokens$lemma, filter=sotu.tokens$pos1 %in% c("M","N")) dtm
So, we now have a "sparse" matrix of almost 7,000 documents by more than 70,000 terms.
Sparse here means that only the non-zero entries are kept in memory,
because otherwise it would have to keep all 70 million cells in memory (and this is a relatively small data set).
Thus, it might not be a good idea to use functions like
colSums on such a matrix,
since these functions convert the sparse matrix into a regular matrix.
The next section investigates a number of useful functions to deal with (sparse) document-term matrices.
Corpus analysis: word frequency
What are the most frequent words in the corpus?
As shown above, we could use the built-in
but this requires first casting the sparse matrix to a regular matrix,
which we want to avoid (even our relatively small dataset would have 400 million entries!).
However, we can use the
col_sums function from the
slam package, which provides the same functionality for sparse matrices:
library(slam) freq = col_sums(dtm) # sort the list by reverse frequency using built-in order function: freq = freq[order(-freq)] head(freq, n=10)
As can be seen, the most frequent terms are America and recurring issues like jobs and taxes.
It can be useful to compute different metrics per term, such as term frequency, document frequency (how many documents does it occur), and td.idf (term frequency * inverse document frequency, which removes both rare and overly frequent terms).
term.statistics from the
corpus-tools package provides this functionality:
terms = term.statistics(dtm) terms = terms[order(-terms$termfreq), ] head(terms, 10)
As you can see, for each word the total frequency and the relative document frequency is listed,
as well as some basic information on the number of characters and the occurrence of numerals or non-alphanumeric characters.
This allows us to create a 'common sense' filter to reduce the amount of terms, for example removing all words containing a letter or punctuation mark, and all short (
characters<=2) infrequent (
termfreq<25) and overly frequent (
subset = terms[!terms$number & !terms$nonalpha & terms$characters>2 & terms$termfreq>=25 & terms$reldocfreq<.25, ] nrow(subset) head(subset, n=10)
This seems more to be a relatively useful set of words. We now have about 8 thousand terms left of the original 72 thousand. To create a new document-term matrix with only these terms, we can use normal matrix indexing on the columns (which contain the words):
dtm_filtered = dtm.filter(dtm, terms=subset$term) dim(dtm_filtered)
Which yields a much more managable dtm.
As a bonus, we can use the
dtm.wordcloud function in corpustools (which is a thin wrapper around the
to visualize the top words as a word cloud: