WIP: Semisupervised Naive Bayes using Expectation Maximization #430

Closed
wants to merge 20 commits into
from
Commits
+446 −11
Split
@@ -694,6 +694,7 @@ Pairwise metrics
naive_bayes.GaussianNB
naive_bayes.MultinomialNB
naive_bayes.BernoulliNB
+ naive_bayes.SemisupervisedNB
.. _neighbors_ref:
@@ -176,3 +176,53 @@ It is advisable to evaluate both models, if time permits.
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.61.5542>`_
3rd Conf. on Email and Anti-Spam (CEAS).
+
+.. _semisupervised_naive_bayes:
+
+Semisupervised training with EM
+-------------------------------
+
+The class :class:`SemisupervisedNB` implements the expectation maximization
+(EM) algorithm for semisupervised training of Naive Bayes models,
+where a part of the training samples are unlabeled.
+Unlabeled data are indicated by a ``-1`` value in the label vector.
+
+This EM algorithm fits an initial model, then iteratively
+
+ * uses the current to predict fractional class memberships;
+ * fits a new model on its own predictions
@amueller
amueller Dec 20, 2011 scikit-learn member

I think it should somehow say that it is related to self trained learning and link to wikipedia.

+
+until convergence.
+Convergence is determined by measuring the difference
+between subsequent models' parameter vectors.
+Note that this differs from the typical treatment of
+EM for Naive Bayes in the literature,
@amueller
amueller Dec 20, 2011 scikit-learn member

I think it should explicitly be "Semi-supervised Naive Bayes".

+where convergence is usually checked by computing
+the log-likelihood of the model given the training samples.
+The resulting algorithm is similar to the more general technique of
+self-training (see Zhu 2008).
+
+:class:`SemisupervisedNB` is a meta-estimator that builds upon
+a regular Naive Bayes estimator.
+To use this class, construct it with an ordinary Naive Bayes model as follows::
+
+ >>> from sklearn.naive_bayes import MultinomialNB, SemisupervisedNB
+ >>> clf = SemisupervisedNB(MultinomialNB())
+ >>> clf
+ SemisupervisedNB(estimator=MultinomialNB(alpha=1.0, fit_prior=True),
+ n_iter=10, relabel_all=True, tol=0.001, verbose=False)
+
+Then use ``clf.fit`` as usual.
+
+.. note::
@amueller
amueller Dec 20, 2011 scikit-learn member

Add referece to the example.

+
+ EM is not currently supported for Gaussian Naive Bayes estimators.
+
+.. topic:: References:
+
+ * K. Nigam, A.K. McCallum, S. Thrun and T. Mitchell (2000).
+ Text classification from labeled and unlabeled documents using EM.
+ Machine Learning 39(2):103–134.
+ * X. Zhu (2008). `"Semi-supervised learning literature survey"
+ <http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf>`_.
+ CS TR 1530, U. Wisconsin-Madison.
@@ -0,0 +1,201 @@
+"""
+===============================================
+Semisupervised classification of text documents
+===============================================
+
+This variation on the document classification theme (see
+document_classification_20newsgroups.py) showcases semisupervised learning:
@amueller
amueller Dec 20, 2011 scikit-learn member

This should be a link.

+classification with training on partially unlabeled data.
+
+The dataset used in this example is the 20 newsgroups dataset which will be
+automatically downloaded and then cached; this set is labeled, but the
+labels from a random part will be removed.
@amueller
amueller Dec 20, 2011 scikit-learn member

I think it should be explicit that the fully supervised version are trained only on the labeled subset of the data while the semi-supervised can also use the additional unlabeled data.

@amueller
amueller Dec 20, 2011 scikit-learn member

It would be good to have a link to the narrative documentation in the docstring.

+
+"""
+
+# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>
+# Olivier Grisel <olivier.grisel@ensta.org>
+# Mathieu Blondel <mathieu@mblondel.org>
+# Lars Buitinck <L.J.Buitinck@uva.nl>
+# License: Simplified BSD
+
+import logging
+import numpy as np
+from operator import itemgetter
+from optparse import OptionParser
+import sys
+from time import time
+
+from sklearn.cross_validation import StratifiedKFold
+from sklearn.datasets import fetch_20newsgroups
+from sklearn.feature_extraction.text import Vectorizer
+from sklearn.naive_bayes import BernoulliNB, SemisupervisedNB, MultinomialNB
+from sklearn import metrics
+
+
+# Display progress logs on stdout
+logging.basicConfig(level=logging.INFO,
+ format='%(asctime)s %(levelname)s %(message)s')
+
+
+# parse commandline arguments
+op = OptionParser()
+op.add_option("--confusion_matrix",
+ action="store_true", dest="print_cm",
+ help="Print the confusion matrix.")
+op.add_option("--labeled",
+ action="store", type="float", dest="labeled_fraction",
+ help="Fraction of labels to retain (roughly).")
+op.add_option("--report",
+ action="store_true", dest="print_report",
+ help="Print a detailed classification report.")
+op.add_option("--top10",
+ action="store_true", dest="print_top10",
+ help="Print ten most discriminative terms per class"
+ " for every classifier.")
+
+(opts, args) = op.parse_args()
+if len(args) > 0:
+ op.error("this script takes no arguments.")
+ sys.exit(1)
+
+print __doc__
+op.print_help()
+print
+
+
+def split_indices(y, fraction):
+ """Random stratified split of indices into y
+
@amueller
amueller Dec 20, 2011 scikit-learn member

pep8: Whitespace on blank line.

+ Returns (unlabeled, labeled)
+ """
+ k = int(round(1 / fraction))
+ folds = list(StratifiedKFold(y, k))
+ return folds[rng.randint(k)]
+
+
+def trim(s):
+ """Trim string to fit on terminal (assuming 80-column display)"""
+ return s if len(s) <= 80 else s[:77] + "..."
+
+
+###############################################################################
+# Load some categories from the training set
+categories = [
+ 'alt.atheism',
+ 'talk.religion.misc',
+ 'comp.graphics',
+ 'sci.space',
+]
+# Uncomment the following to do the analysis on all the categories
+#categories = None
+
+print "Loading 20 newsgroups dataset for categories:"
+print categories if categories else "all"
+
+rng = np.random.RandomState(42)
+
+data_train = fetch_20newsgroups(subset='train', categories=categories,
+ shuffle=True, random_state=rng)
+
+data_test = fetch_20newsgroups(subset='test', categories=categories,
+ shuffle=True, random_state=rng)
+print 'data loaded'
+
+categories = data_train.target_names # for case categories == None
+
+print "%d documents (training set)" % len(data_train.data)
+print "%d documents (testing set)" % len(data_test.data)
+print "%d categories" % len(categories)
+print
+
+# split a training set and a test set
+y_train, y_test = data_train.target, data_test.target
+
+if opts.labeled_fraction is None:
+ fraction = .1
+else:
+ fraction = opts.labeled_fraction
+ if fraction <= 0. or fraction > 1.:
+ print "Invalid fraction %.2f"
+ sys.exit(1)
+
+print "Extracting features from the training dataset using a sparse vectorizer"
+t0 = time()
+vectorizer = Vectorizer()
+X_train = vectorizer.fit_transform(data_train.data)
+print "done in %fs" % (time() - t0)
+print "n_samples: %d, n_features: %d" % X_train.shape
+print
+
+print "Extracting features from the test dataset using the same vectorizer"
+t0 = time()
+X_test = vectorizer.transform(data_test.data)
+print "done in %fs" % (time() - t0)
+print "n_samples: %d, n_features: %d" % X_test.shape
+print
+
+unlabeled, labeled = split_indices(y_train, fraction)
+print "Removing labels of %d random training documents" % len(unlabeled)
+print
+X_labeled = X_train[labeled]
+y_labeled = y_train[labeled]
+y_train[unlabeled] = -1
+
+vocabulary = np.array([t for t, i in sorted(vectorizer.vocabulary.iteritems(),
+ key=itemgetter(1))])
+
+
+###############################################################################
+# Benchmark classifiers
+def benchmark(clf, supervised=False):
+ print 80 * '_'
+ print "Training: "
+ print clf
+ t0 = time()
+ if supervised:
+ clf.fit(X_labeled, y_labeled)
+ else:
+ clf.fit(X_train, y_train)
+ train_time = time() - t0
+ print "train time: %0.3fs" % train_time
+
+ t0 = time()
+ pred = clf.predict(X_test)
+ test_time = time() - t0
+ print "test time: %0.3fs" % test_time
+
+ score = metrics.f1_score(y_test, pred)
+ print "f1-score: %0.3f" % score
+
+ if hasattr(clf, 'coef_'):
+ print "dimensionality: %d" % clf.coef_.shape[1]
+
+ if opts.print_top10:
+ print "top 10 keywords per class:"
+ for i, category in enumerate(categories):
+ top10 = np.argsort(clf.coef_[i, :])[-10:]
+ print trim("%s: %s" % (category, " ".join(vocabulary[top10])))
+ print
+
+ if opts.print_report:
+ print "classification report:"
+ print metrics.classification_report(y_test, pred,
+ target_names=categories)
+
+ if opts.print_cm:
+ print "confusion matrix:"
+ print metrics.confusion_matrix(y_test, pred)
+
+ print
+ return score, train_time, test_time
+
+print 80 * '='
+print "Baseline: fully supervised Naive Bayes"
+benchmark(MultinomialNB(alpha=.01), supervised=True)
@amueller
amueller Dec 20, 2011 scikit-learn member

From the output it is not clear to me what the difference between multinomial nb and binary nb is. How are the features used in these two cases? Or is there something in the narrative docs about this use case?

@larsmans
larsmans Dec 20, 2011 scikit-learn member

(For the record,) the predict algorithm (posterior computation) is different for the multinomial and Bernoulli event models. This is described in the narrative docs, with references: http://scikit-learn.org/dev/modules/naive_bayes.html

@amueller
amueller Dec 20, 2011 scikit-learn member

Ok. Sorry should have looked before complaining.

+benchmark(BernoulliNB(alpha=.01), supervised=True)
+
+print 80 * '='
+print "Naive Bayes trained with Expectation Maximization"
+benchmark(SemisupervisedNB(MultinomialNB(alpha=.01)))
+benchmark(SemisupervisedNB(BernoulliNB(alpha=.01)))
Oops, something went wrong.