Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
96 lines (76 sloc) 3.97 KB

Using Lemmatization and POS tagging

(C) 2014 Wouter van Atteveldt, license: [CC-BY-SA]

For Dutch, rule-based stemming does not work very well. Consider:

library(RTextTools)
library(slam)
text = "De verzekeringen zijn betaald, ik betaal de verzekering"
col_sums(create_matrix(text, language="dutch", stemWords=T, removeStopwords=F))

As you can see, verzekering and verzekeringen are both reduced to verzeker, betalen and betaald are kept separate. By comparison, a proper lemmatizer will handle such cases properly. For example, the Frog lemmatizer accessible at http://ilk.uvt.nl/cgntagger/ produces the following output for that sentence (use the text box on the right)

sent. # word # word tag lemma
0 0 De LID(bep,stan,rest) de
0 1 verzekeringen N(soort,mv,basis) verzekering
0 2 zijn WW(pv,tgw,mv) zijn
0 3 betaald WW(vd,vrij,zonder) betalen
0 4 , LET() ,
0 5 ik VNW(pers,pron,nomin,vol,1,ev) ik
0 6 betaal WW(pv,tgw,ev) betalen
0 7 de LID(bep,stan,rest) de
0 8 verzekering N(soort,ev,basis,zijd,stan) verzekering

The final column contains the lemma, and as you can see the different verb conjugations are now matched together. You can also see the 'tag' column, which lists the part of speech (POS), indicating what kind of word it is. It is often very useful to select e.g. only nouns, adjectives, and verbs.

Lemmatizing text

Performing actual lemmatization is beyond the scope of this document. Frog can be installed easily using apt-get install frog frogdata ucto on a modern debian or ubuntu system (see also http://ilk.uvt.nl/frog/), and for most common languages free lemmatizers are available.

For this workshop, I've lemmatized the 'achmea.csv' documents using the frog lemmatizer built into AmCAT, and exported the tokens as a data frame. This file can be downloaded from github

load("tokens.rdata")
head(tokens)

As you can see, this data frame contains the pos, word and lemma columns from the table above. The pos1 columns contains an abbreviated POS tag, which is easier to use.

From tokens to document-term matrix

Although tokens are in principle the same 'triplet' representation as a sparse matrix, there are a number of steps to move from tokens to a document-term matrix. We define a function dtm.create here, which is also available corpustools package.

library(tm)
library(Matrix)
cast.sparse.matrix <- function(rows, columns, values) {
  d = data.frame(rows=rows, columns=columns, values=values)
  d = aggregate(values ~ rows + columns, d, FUN='sum')
  unit_index = unique(d$rows)
  char_index = unique(d$columns)
  sm = spMatrix(nrow=length(unit_index), ncol=length(char_index),
                match(d$rows, unit_index), match(d$columns, char_index), d$values)
  rownames(sm) = unit_index
  colnames(sm) = char_index
  sm
}

dtm.create <- function(ids, terms, freqs) {
  # remove NA terms
  d = data.frame(ids=ids, terms=terms, freqs=freqs)
  d = aggregate(freqs ~ ids + terms, d, FUN='sum')
  id_index = unique(d$ids)
  term_index = unique(d$terms)
  sm = spMatrix(nrow=length(id_index), ncol=length(term_index),
                match(d$ids, id_index), match(d$terms, term_index), d$freqs)
  rownames(sm) = id_index
  colnames(sm) = term_index
  as.DocumentTermMatrix(sm, weighting=weightTf)
}

The following code creates a document-term matrix containing only the nouns

nouns = tokens[tokens$pos1 == 'N', ]
m = dtm.create(nouns$achmea_id, nouns$lemma, nouns$freq)
class(m)
dim(m)

Since this is a 'normal' DocumentTermMatrix, it can be used for corpus analysis, topic modeling or machine learning like the matrices created using the create_matrix function from RTextTools