Skip to content

Commit

Permalink
starting on user guide
Browse files Browse the repository at this point in the history
  • Loading branch information
amueller committed Apr 22, 2015
1 parent 0c41140 commit 1550032
Show file tree
Hide file tree
Showing 2 changed files with 132 additions and 0 deletions.
123 changes: 123 additions & 0 deletions doc/user_guide.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
Multi-class SVM
===============
A precursor for structured SVMs was the mult-class SVM by Crammer and Singer.
While in practice it is often faster to use an One-vs-Rest approach and an optimize binary SVM,
this is a good hello-world example for structured predicition and using pystruct.
In the case of multi-class SVMs, in contrast to more structured models, the
labels set Y is just the number of classes, so inference can be performed by
just enumerating Y.


Lets say we want to classify the classical iris dataset. There are three classes and four features:


The Crammer-Singer model implemented in ... is implemented as the joint feature function:


and inference is just the argmax over the three responses (one for each class):

To perform max-margin learning, we also need the loss-augmented inference. PyStruct has an optimized version,
but a pure python version would look like this:


Essentialy the response (score / energy) of wrong label is down weighted by 1, the loss of doing an incorrect prediction.
We could also implement a custom loss function, that assigns custom losses for predicting, say ... as ...

For training, we pick the n-slack SSVM, which works well with few samples and requires little tuning:


Multi-label SVM
===============
A multi-label classification task is one where each sample can be labeled with any number of classes.
In other words, there are n_classes many binary labels, each indicating whether a sample belongs
to a given class or not. This could be treated as n_classes many independed binary classification
problems, as the scikit-learn OneVsRest classifier does.
However, it might be beneficial to exploit correlations between labels to achieve better generalization.

In the scene classification dataset, each sample is a picture of an outdoor scene,
representated using simple color aggregation. The labels characterize the kind of scene, which can be
"beach", "sunset", "fall foilage", "field", "mountain" or "urban". Each image can belong to multiple classes,
such as "fall foilage" and "field". Clearly some combinations are more likely than others.

We could try to model all possible combinations, which would result in a 2 ** 6
= 64 class multi-class classification problem. This would allow us explicitly model all correlations between labels,
but it would prevent us from predicting combinations that don't appear in the training set.
Even if a combination did appear in the training set, the numer of samples in each class would be very small.
A compromise between modeling all correlations and modelling no correlations is modeling only pairwise correlations,
which is the approach implemented in :class:`models.MultiLabelClf`.
It creates a graph over ``n_classes`` binary nodes, together with edges between each pair of classes.
Each binary node has represents one class, and therefor will get its own column
in the weight-vector, similar to the crammer-singer multi-class classification.

In addition, there is a pairwise weight betweent each pair of labels.
This leads to a feature function of this form:

If our graph has only 6 nodes, we can actually enumerate all states.
Unfortunately, in general, inference in a fully connected binary graph is in
gerneral NP-hard, so we might need to rely on approximate inference, like loopy believe propagation or AD3.
#FIXME do enumeration! benchmark!!

An alternative to using approximate inference for larger numbers of labels is to not create a fully connected graph,
but restrict ourself to pairwise interactions on a tree over the labels. In the above example of outdoor scenes,
some labels might be informative about others, maybe a beach picture is likely to be of a sunset, while
an urban scene might have as many sunset as non-sunset samples. The optimum tree-structure for such a problem
can easily be found using the Chow-Liu tree, which is simply the maximum weight spanning tree over the graph, where
edge-weights are given by the mutual information between labels on the training set.
You can use the Chow-Liu tree method simply by specifying ``edges="chow_liu"``.
This allows us to use efficient and exact max-product message passing for
inference.

#FIXME sample

The implementation of the inference for this model creates a graph with unary
potentials (given by the inner product of features and weights), and pairwise
potentials given by the pairwise weight. This graph is then passed to the
general graph-inference, which runs the selected algorithm.


Conditional-Random-Field-like graph models
==========================================
The following models are all pairwise models over nodes, that is they model a labeling of a graph,
using features at the nodes, and relation between neighboring nodes.
The main assumption in these models in PyStruct is that nodes are homogeneous, that is they all
have the same meaning. That means that each node has the same number of classes, and these classes
mean the same thing. In practice that means that weights are shared across all nodes and edges,
and the model adapts via features.
This is in contrast to the :class:`MultiLabelClf`, which builds a binary graph
were nodes mean different things (each node represents a different class), so they do not share weights.

#FIXME alert!
I call these models Conditional Random Fields (CRFs), but this a slight abuse of notation,
as PyStruct actually implements perceptron and max-margin learning, not maximum likelihood learning.
So these models might better be called Maximum Margin Random Fields. However, in the computer vision
community, it seems most pairwise models are called CRFs, independent of the method of training.

Chain CRF
----------
One of the most common use-cases for structured prediction is chain-structured
outputs. These occur naturaly in sequence labeling tasks, such as
Part-of-Speech tagging or named entity recognition in natural language
processing, or segmentation and phoneme recognition in speech processing.

FIXME alert
While pystruct is able to work with chain CRFs, it is not explicitly built with these in mind,
and there are libraries that optimize much more for this special case, such as seqlearn and CRF++.

models
-------
how to use

multi-class
multi-label
chain-crf
graph-crf
edge-feature graph crf


how to write your own model
----------------------------


tips on solvers

tips on inference algorithms
9 changes: 9 additions & 0 deletions pystruct/datasets/dataset_loaders.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,15 @@ def load_scene():
"""Load the scene multi-label dataset.
This is a benchmark multilabel dataset.
n_classes = 6
n_fetures = 294
n_samples_test = 1196
n_samples_train = 1211
References
----------
Matthew R. Boutell, Jiebo Luo, Xipeng Shen, and Christopher M. Brown.
Learning multi-label scene classification.
"""
module_path = dirname(__file__)
return _safe_unpickle(join(module_path, 'scene.pickle'))
Expand Down

0 comments on commit 1550032

Please sign in to comment.