Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
158 lines (115 sloc) 5.08 KB

Machine Learning using RTextTools

(C) 2014 Wouter van Atteveldt, license: [CC-BY-SA]

Machine Learning or automatic text classification is a set of techniques to train a statistical model on a set of annotated (coded) training texts, that can then be used to predict the category or class of new texts.

R has a number of different packages for various machine learning algorithm such as maximum entropy modeling, neural networks, and support vector machines. RTextTools provides an easy way to access a large number of these algorithms.

In principle, like 'classical' statistical models, machine learning uses a number of (independent) variables called features to predict a target (dependent) category or class. In text mining, the independent variables are generally the term frequencies, and as such the input for the machine learning is the document-term matrix.

Training models using RTextTools

So, the first step is to create a document-term matrix. We only want to use the documents for which the sentiment is known (for now). As before, we use the achmea.csv that can be downloaded from github.

library(RTextTools)
d = read.csv("data/reviews.csv")
ds = d[!is.na(d$SENTIMENT), ]
m = create_matrix(ds$CONTENT, language="dutch", stemWords=F)

The next step is to create the RTextTools container. This contains both the d-t matrix and the manually coded classes, and you specify which parts to use for training and which for testing.

To make sure that we get a random sample of documents for training and testing, we sample 80% of the set for training and the remainder for testing. (Note that it is important to sort the indices as otherwise GLMNET will fail)

n = nrow(m)
train = sort(sample(1:n, n*.8))
test = sort(setdiff(1:n, train))
length(train)
length(test)

Now, we are ready to create the container:

c = create_container(m, ds$SENTIMENT, trainSize=train, testSize=test, virgin=F)

Using this container, we can train different models:

SVM <- train_model(c,"SVM")
MAXENT <- train_model(c,"MAXENT")
GLMNET <- train_model(c,"GLMNET")

Testing model performance

Using the same container, we can classify the 'test' dataset

SVM_CLASSIFY <- classify_model(c, SVM)
MAXENT_CLASSIFY <- classify_model(c, MAXENT)
GLMNET_CLASSIFY <- classify_model(c, GLMNET)

Let's have a look at what these classifications yield:

head(SVM_CLASSIFY)
nrow(SVM_CLASSIFY)

For each document in the test set, the predicted label and probability are given. We can compare these predictions to the correct classes manually:

t = table(SVM_CLASSIFY$SVM_LABEL, as.character(ds$SENTIMENT[test]))
t

(Note that the as.character cast is necessary to align the rows and columns) And compute the accuracy:

sum(diag(t)) / sum(t)

Analytics

To make it easier to compute the relevant metrics, RTextTools has a built-in analytics function:

analytics <- create_analytics(c, cbind(SVM_CLASSIFY, GLMNET_CLASSIFY, MAXENT_CLASSIFY))
names(attributes(analytics))

The algorithm_summary gives the performance of the various algorithms, with precision, recall, and f-score given per algorithm:

analytics@algorithm_summary

The label_summary gives the performance per label (class):

analytics@label_summary

Finally, the ensemble_summary gives an indication of how performance changes based on the amount of classifiers that agree on the classification:

analytics@ensemble_summary

The last attribute, document_summary, contains the classifications of the various algorithms per document, and also lists how many agree and whether the consensus and the highest probability classifier where correct:

head(analytics@document_summary)

Classifying new material

New material (called 'virgin data' in RTextTools) can be coded by placing the old and new material in a single container. We now set all documents with a sentiment score as training material, and specify virgin=T to indicate that we don't have coded classes on the test material:

m_full = create_matrix(d$CONTENT, language="dutch", stemWords=F)
coded = which(!is.na(d$SENTIMENT))
c_full = create_container(m_full, d$SENTIMENT, trainSize=coded, virgin=T)

We can now build and test the model as before:

SVM <- train_model(c_full,"SVM")
MAXENT <- train_model(c_full,"MAXENT")
GLMNET <- train_model(c_full,"GLMNET")
SVM_CLASSIFY <- classify_model(c_full, SVM)
MAXENT_CLASSIFY <- classify_model(c_full, MAXENT)
GLMNET_CLASSIFY <- classify_model(c_full, GLMNET)
analytics <- create_analytics(c_full, cbind(SVM_CLASSIFY, GLMNET_CLASSIFY, MAXENT_CLASSIFY))
names(attributes(analytics))

As you can see, the analytics now only has the label_summary and document_summary:

analytics@label_summary
head(analytics@document_summary)

The label summary now only contains an overview of how many where coded using consensus and probability. The document_summary lists the output of all algorithms, and the consensus and probability code.