Classification Tutorial

The tutorial below assumes that you are already familiar with supervised classification using linear models.

The starting point that one will usually have with classification is access to a collection of labeled examples in some native format, and this data needs to be read in and turned into features. A typical process is to have code that transforms the data into an indexed format that can be fed directly into a system for learning the parameters of the model. With Nak, you can define a pipeline that directly transforms the data from raw format into features and then into indexed features for training.

I'll flesh this out in more detail later, but for now, there are two example classification objects with lots of helpful comments:

nak.example.PpaExample: classifying prepositional phrase attachment data (see nak/data/classify/ppa)
nak.example.TwentyNewsGroupsExample: classifying the 20 news groups data

Here's a quick bit of explanation on how a trained classifier is used to analyze new test instances. Look at the 20 news groups example code.

Notice the line that says:

val comparisons = for (ex <- fromLabeledDirs(evalDir).toList) yield
      (ex.label, maxLabelNews(classifier.evalRaw(ex.features)), ex.features)

The part that outputs the label distribution is classifier.evalRaw(ex.features), where ex.features is just the raw text of a document (or instance as you have defined it). And then the best label is pulled out with the maxLabelNews function. There is a bit of indirection there, so some explanation would probably help. There is a function NakContext.maxLabel that when given a set of labels and a set of scores for each label, picks out the best label. It is a curried function, so we can supply the labels for this classifier to maxLabel to make the new function maxLabelNews:

val maxLabelNews = maxLabel(classifier.labels) _

This function is ready to give the best label for any output from the classifier, thus the bit that says maxLabelNews(classifier.evalRaw(ex.features)) extracts the best label.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classification Tutorial

Clone this wiki locally