Skip to content

Classification Tutorial

jasonbaldridge edited this page Apr 26, 2013 · 3 revisions

The tutorial below assumes that you are already familiar with supervised classification using linear models.

The starting point that one will usually have with classification is access to a collection of labeled examples in some native format, and this data needs to be read in and turned into features. A typical process is to have code that transforms the data into an indexed format that can be fed directly into a system for learning the parameters of the model. With Nak, you can define a pipeline that directly transforms the data from raw format into features and then into indexed features for training.

I'll flesh this out in more detail later, but for now, there are two example classification objects with lots of helpful comments:

Here's a quick bit of explanation on how a trained classifier is used to analyze new test instances. Look at the 20 news groups example code.

Notice the line that says:

val comparisons = for (ex <- fromLabeledDirs(evalDir).toList) yield
      (ex.label, maxLabelNews(classifier.evalRaw(ex.features)), ex.features)

The part that outputs the label distribution is classifier.evalRaw(ex.features), where ex.features is just the raw text of a document (or instance as you have defined it). And then the best label is pulled out with the maxLabelNews function. There is a bit of indirection there, so some explanation would probably help. There is a function NakContext.maxLabel that when given a set of labels and a set of scores for each label, picks out the best label. It is a curried function, so we can supply the labels for this classifier to maxLabel to make the new function maxLabelNews:

val maxLabelNews = maxLabel(classifier.labels) _

This function is ready to give the best label for any output from the classifier, thus the bit that says maxLabelNews(classifier.evalRaw(ex.features)) extracts the best label.

Clone this wiki locally