Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

WIP: Semisupervised Naive Bayes using Expectation Maximization #430

Closed
wants to merge 20 commits into from

5 participants

@larsmans
Owner

Here's the EM algorithm for semisupervised Naive Bayes. The implementation checks for convergence based on the coefficients following the advice of Bishop 2006, so it could be used more generally for linear classifier self-training, but I only implemented the necessary machinery (fitting on a 1-of-K vector) on the discrete NB estimators and we might want to switch to log-likelihood based convergence checking later.

Narrative documentation follows if there's interest in this pull request. I've adapted the document classification example script into a new, semisupervised example. We might also merge these scripts.

@ogrisel
Owner

Looks very interesting. For lazy people here is the outcome of the document classification example with 4 of the 20newsgroups with only 10% of samples in the training set having labels (around 200 labeled samples and 1800 without labels):

================================================================================
Baseline: fully supervised Naive Bayes
________________________________________________________________________________
Training: 
MultinomialNB(alpha=0.01, fit_prior=True)
train time: 0.009s
test time:  0.003s
f1-score:   0.809
dimensionality: 32101


________________________________________________________________________________
Training: 
BernoulliNB(alpha=0.01, binarize=0.0, fit_prior=True)
train time: 0.010s
test time:  0.022s
f1-score:   0.818
dimensionality: 32101


================================================================================
Naive Bayes trained with Expectation Maximization
________________________________________________________________________________
Training: 
EMNB(estimator=MultinomialNB(alpha=0.01, fit_prior=True),
   estimator__alpha=0.01, estimator__fit_prior=True, n_iter=10, tol=0.001,
   verbose=False)
train time: 0.197s
test time:  0.003s
f1-score:   0.883
dimensionality: 32101


________________________________________________________________________________
Training: 
EMNB(estimator=BernoulliNB(alpha=0.01, binarize=0.0, fit_prior=True),
   estimator__alpha=0.01, estimator__binarize=0.0,
   estimator__fit_prior=True, n_iter=10, tol=0.001, verbose=False)
train time: 0.416s
test time:  0.021s
f1-score:   0.864
dimensionality: 32101

Semi supervised learning is practically important in my opinion because of the cost of annotating data with supervised signal. It can help annotators bootstrap a processus by annotating a small proportion of a dataset. Then we can do some sort of active learning by asking the fitted model for the samples with the least confidence (according to predict_proba / decision function being close to the threshold) and ask the human annotator to label those examples first.

Do you have an idea with the CPU time of semi-supervised over supervised is always 10x or is there some asymptotic complexity that makes it not scalable to larger problems (i.e. 100k samples)?

This work is related to the LabelPropagation pull request (that needs a final review, i.e. profiling it and checking whether the eigen problem cannot be sped up). I would be curious to see if the label propagation branch could be able to handle sparse input so as to compare both methods on the 20newsgroups dataset.

sklearn/naive_bayes.py
((75 lines not shown))
+ print "Naive Bayes EM, iteration %d," % i,
+
+ clf._fit1ofK(X, Y, sample_weight, class_prior)
+
+ d = (np.abs(old_coef - clf.coef_).sum()
+ + np.abs(old_intercept - clf.intercept_).sum())
+ if self.verbose:
+ print "diff = %.3g" % d
+ if d < tol:
+ if self.verbose:
+ print "Naive Bayes EM converged"
+ break
+
+ old_coef = np.copy(clf.coef_)
+ old_intercept = np.copy(clf.intercept_)
+ Y = clf.predict_proba(X)
@mblondel Owner
mblondel added a note

Instead of relabeling all the examples, have you considered relabeling only the unlabeled ones? One one hand, relabeling the examples for which we know the human-assigned label seems like a waste. On the other hand, it can be more robust if many labels are actually incorrect.

@larsmans Owner
larsmans added a note

Good point. That is in fact what Nigam et al. do.

@mblondel Owner
mblondel added a note

I read a paper on label propogation recently in which the authors advocated relabeling everything. So we may want to create an option to let the user choose?

@larsmans Owner
@mblondel: added this in 1f7c45823bb88318296c0db49b17013d6e8cadc5.

Sorry, hadn't seen your comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans
Owner

@ogrisel: the time required should be simply n_iter times the time it takes to fit a single NB classifier + some overhead in the convergence checking.

sklearn/naive_bayes.py
((64 lines not shown))
+ Y = clf._label_1ofK(y)
+
+ unlabeled = np.where(y == -1)[0]
+
+ n_features = X.shape[1]
+ n_classes = Y.shape[1]
+ tol = self.tol * n_features
+
+ old_coef = np.zeros((n_classes, n_features))
+ old_intercept = np.zeros(n_classes)
+
+ for i in xrange(self.n_iter):
+ if self.verbose:
+ print "Naive Bayes EM, iteration %d," % i,
+
+ clf._fit1ofK(X, Y, sample_weight, class_prior)
@mblondel Owner
mblondel added a note

If I understand the code correctly, in the very first call to this line (first iteration of the loop), Y should include -1 as labels. Does it affect the underlying classifier or does the classifier handle the -1 differently?

@larsmans Owner
larsmans added a note

I've hacked the LabelBinarizer to produce uniform probabilities where it encounters an unlabeled sample. I just changed the EMNB code to start from a supervised model, though, so this no longer happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@mblondel
Owner
  • I'm wondering if the view (sclicing?) X[y==-1] has an impact on performance. Before we merge this branch and the label propagation one, I would really like to clear that up. Taking the convention that unlabeled examples should always be at the end of the dataset may help.

  • It seems to me that we may want to not tight this EM class to Naive Bayes: we should be able to apply the same kind of EM algorithm to any classifier with soft labeling using predict_proba if available and hard labeling with predict if not. We need to establish a way to know what are the model parameters though (for the convergence check).

  • A plot accuracy vs percentage of labeled data would be nicer than a text-based output :)

@larsmans
Owner

I guess that if we check convergence based on the classifier output (training set accuracy), then this becomes a very general self-training algorithm.

Turns out the slicing is indeed incredibly expensive. I just cached it and training time decreases by a factor of five!

@ogrisel
Owner

This is array masking (that indeed causes memory allocation) rather than slicing which is fast as it creates a cheap view without memory allocation.

@larsmans
Owner

@ogrisel: I might not be familiar enough with NumPy/SciPy's masking and slicing yet, but I just picked the option that works with sparse matrices, since text classification is the intended use case. We can add switching on type for smarter labeled/unlabeled selection later; currently, EMNB wants a BaseDiscreteNB anyway, since I don't have a use case for Gaussian NB. Please try the demo on the full 20newsgroups set and decide whether the performance is ok.

@mblondel: as regards extending this to other linear classifiers, I suggest we keep EMNB and maybe later introduce a new, generalized self-training meta-estimator for other classifiers. Naive Bayes EM has plenty of interesting routes for expansion, including optimizing for likelihood and the new SFE algorithm, so any duplication is not necessarily harmful.

I'm willing to write docs and fix bugs, but not to put very much more time into this PR for now.

@fannix

Strange, the following is what I got. EM made the accuracy drop.

================================================================================
Baseline: fully supervised Naive Bayes
________________________________________________________________________________
Training: 
MultinomialNB(alpha=0.01, fit_prior=True)
train time: 0.008s
test time:  0.004s
f1-score:   0.799
dimensionality: 32101


________________________________________________________________________________
Training: 
BernoulliNB(alpha=0.01, binarize=0.0, fit_prior=True)
train time: 0.007s
test time:  0.021s
f1-score:   0.799
dimensionality: 32101


================================================================================
Naive Bayes trained with Expectation Maximization
________________________________________________________________________________
Training: 
EMNB(estimator=MultinomialNB(alpha=0.01, fit_prior=True),
   estimator__alpha=0.01, estimator__fit_prior=True, n_iter=10, tol=0.001,
   verbose=False)
train time: 0.057s
test time:  0.004s
f1-score:   0.761
dimensionality: 32101


________________________________________________________________________________
Training: 
EMNB(estimator=BernoulliNB(alpha=0.01, binarize=0.0, fit_prior=True),
   estimator__alpha=0.01, estimator__binarize=0.0,
   estimator__fit_prior=True, n_iter=10, tol=0.001, verbose=False)
train time: 0.101s
test time:  0.020s
f1-score:   0.790
dimensionality: 32101
@larsmans
Owner

@fannix: how much labeled and unlabeled data did you use?

@fannix

@larsmans: It was just the demo examples.

@fannix
@fannix
@larsmans
Owner

Yes, this is the intended result. When no label is given, assume uniform probability. I'll look into the performance next week.

@fannix
@larsmans
Owner

You're right about that.

@larsmans
Owner

@fannix: I've reproduced the bad performance you saw. This seems to be due to commit b5bc737, i.e. only re-labeling the unlabeled samples.

@fannix
@mblondel
Owner
@larsmans
Owner

@mblondel: I'm writing docs, care to review the code?

sklearn/naive_bayes.py
((17 lines not shown))
+ Maximum number of iterations.
+ relabel_all : bool, optional
+ Whether to re-estimate class memberships for labeled samples as well.
+ Disabling this may result in bad performance, but follows Nigam et al.
+ closely.
+ tol : float, optional
+ Tolerance, per coefficient, for the convergence criterion.
+ Convergence is determined based on the coefficients (log probabilities)
+ instead of the model log likelihood.
+ verbose : boolean, optional
+ Whether to print progress information.
+ """
+
+ def __init__(self, estimator, n_iter=10, relabel_all=True, tol=1e-3,
+ verbose=False):
+ self.estimator = estimator
@mblondel Owner

shall we raise an error if estimator is not a BaseDiscreteNB?

@larsmans Owner

Good point. I want to extend it to handle BaseNB first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@mblondel
Owner

The estimator and the modifications you made to LabelBinarizer look good to me.

Regarding the example, it would be nice to modify the 20 newsgroup dataset loader so as to return ready-to-use features. This way, we wouldn't have to call Vectorizer (the purpose of the example is to illustrate semi-supervised learning, not feature extraction). If you feel like implementing it, a plot with the proportion of labeled data as x-axis and the accuracy as y-axis would be nice, but this PR can be merged without it IMO.

@ogrisel
Owner

@mblondel I agree that this code gets duplicated in many examples.

I think we should have a load_vectorized_20newsgroups utility in this module that uses joblib memoizer to cache the results of the vectorization in the data_home folder.

@larsmans
Owner

Gone from WIP to MRG. I think I'm done for the moment.

@amueller
Owner

Doctest failure

File "scikit-learn/sklearn/preprocessing/__init__.py", line 494, in sklearn.preprocessing.LabelBinarizerFailed example:
    clf.transform([1, 6])
Expected:
    array([[ 1.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  1.]])
Got:
    array([[ 0.25,  0.25,  0.25,  0.25],
           [ 0.  ,  0.  ,  0.  ,  1.  ]])

>>  raise self.failureException(self.format_failure(.getvalue()))


@amueller
Owner

Please rebase onto master

doc/modules/naive_bayes.rst
@@ -173,3 +173,48 @@ It is advisable to evaluate both models, if time permits.
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.61.5542>`_
3rd Conf. on Email and Anti-Spam (CEAS).
+
+.. _semisupervised_naive_bayes:
+
+Semisupervised training with EM
+-------------------------------
+
+The class ``SemisupervisedNB`` implements the expectation maximization (EM)
@amueller Owner

:class:SemisupervisedNB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
doc/modules/naive_bayes.rst
@@ -173,3 +173,48 @@ It is advisable to evaluate both models, if time permits.
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.61.5542>`_
3rd Conf. on Email and Anti-Spam (CEAS).
+
+.. _semisupervised_naive_bayes:
+
+Semisupervised training with EM
+-------------------------------
+
+The class ``SemisupervisedNB`` implements the expectation maximization (EM)
+algorithm for semisupervised training of Naive Bayes models,
+where a part of the training samples are unlabeled.
+Unlabeled data are indicated by a ``-1`` value in the label vector.
@amueller Owner

Why "-1" ? That makes using it with binary classification harder to use. I would prefer if unlabeled data had label "nan".

@fannix
fannix added a note
@larsmans Owner

nan is hard to handle since nan != nan. We've been avoiding it everywhere so far, while we have been using -1 for outliers in both DBSCAN and OneClassSVM.

@amueller Owner

Agreed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller amueller commented on the diff
doc/modules/naive_bayes.rst
((9 lines not shown))
+
+The class ``SemisupervisedNB`` implements the expectation maximization (EM)
+algorithm for semisupervised training of Naive Bayes models,
+where a part of the training samples are unlabeled.
+Unlabeled data are indicated by a ``-1`` value in the label vector.
+
+This EM algorithm fits an initial model, then iteratively
+
+ * uses the current to predict fractional class memberships;
+ * fits a new model on its own predictions
+
+until convergence.
+Convergence is determined by measuring the difference
+between subsequent models' parameter vectors.
+Note that this differs from the typical treatment of
+EM for Naive Bayes in the literature,
@amueller Owner

I think it should explicitly be "Semi-supervised Naive Bayes".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
doc/modules/naive_bayes.rst
((13 lines not shown))
+Unlabeled data are indicated by a ``-1`` value in the label vector.
+
+This EM algorithm fits an initial model, then iteratively
+
+ * uses the current to predict fractional class memberships;
+ * fits a new model on its own predictions
+
+until convergence.
+Convergence is determined by measuring the difference
+between subsequent models' parameter vectors.
+Note that this differs from the typical treatment of
+EM for Naive Bayes in the literature,
+where convergence is usually checked by computing
+the log-likelihood of the model given the training samples.
+
+``SemisupervisedNB`` is a meta-estimator that builds upon
@amueller Owner

This needs to be changed to reflect our discussion just now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller amueller commented on the diff
doc/modules/naive_bayes.rst
((25 lines not shown))
+where convergence is usually checked by computing
+the log-likelihood of the model given the training samples.
+
+``SemisupervisedNB`` is a meta-estimator that builds upon
+a regular Naive Bayes estimator.
+To use this class, construct it with an ordinary Naive Bayes model as follows::
+
+ >>> from sklearn.naive_bayes import MultinomialNB, SemisupervisedNB
+ >>> clf = SemisupervisedNB(MultinomialNB())
+ >>> clf
+ SemisupervisedNB(estimator=MultinomialNB(alpha=1.0, fit_prior=True),
+ n_iter=10, relabel_all=True, tol=0.001, verbose=False)
+
+Then use ``clf.fit`` as usual.
+
+.. note::
@amueller Owner

Add referece to the example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller amueller commented on the diff
examples/semisupervised_document_classification.py
((180 lines not shown))
+
+ if opts.print_report:
+ print "classification report:"
+ print metrics.classification_report(y_test, pred,
+ target_names=categories)
+
+ if opts.print_cm:
+ print "confusion matrix:"
+ print metrics.confusion_matrix(y_test, pred)
+
+ print
+ return score, train_time, test_time
+
+print 80 * '='
+print "Baseline: fully supervised Naive Bayes"
+benchmark(MultinomialNB(alpha=.01), supervised=True)
@amueller Owner

From the output it is not clear to me what the difference between multinomial nb and binary nb is. How are the features used in these two cases? Or is there something in the narrative docs about this use case?

@larsmans Owner

(For the record,) the predict algorithm (posterior computation) is different for the multinomial and Bernoulli event models. This is described in the narrative docs, with references: http://scikit-learn.org/dev/modules/naive_bayes.html

@amueller Owner

Ok. Sorry should have looked before complaining.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller amueller commented on the diff
examples/semisupervised_document_classification.py
@@ -0,0 +1,201 @@
+"""
+===============================================
+Semisupervised classification of text documents
+===============================================
+
+This variation on the document classification theme (see
+document_classification_20newsgroups.py) showcases semisupervised learning:
+classification with training on partially unlabeled data.
+
+The dataset used in this example is the 20 newsgroups dataset which will be
+automatically downloaded and then cached; this set is labeled, but the
+labels from a random part will be removed.
@amueller Owner

I think it should be explicit that the fully supervised version are trained only on the labeled subset of the data while the semi-supervised can also use the additional unlabeled data.

@amueller Owner

It would be good to have a link to the narrative documentation in the docstring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller amueller commented on the diff
examples/semisupervised_document_classification.py
@@ -0,0 +1,201 @@
+"""
+===============================================
+Semisupervised classification of text documents
+===============================================
+
+This variation on the document classification theme (see
+document_classification_20newsgroups.py) showcases semisupervised learning:
@amueller Owner

This should be a link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller amueller commented on the diff
examples/semisupervised_document_classification.py
((54 lines not shown))
+ help="Print ten most discriminative terms per class"
+ " for every classifier.")
+
+(opts, args) = op.parse_args()
+if len(args) > 0:
+ op.error("this script takes no arguments.")
+ sys.exit(1)
+
+print __doc__
+op.print_help()
+print
+
+
+def split_indices(y, fraction):
+ """Random stratified split of indices into y
+
@amueller Owner

pep8: Whitespace on blank line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller amueller commented on the diff
doc/modules/naive_bayes.rst
@@ -173,3 +173,48 @@ It is advisable to evaluate both models, if time permits.
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.61.5542>`_
3rd Conf. on Email and Anti-Spam (CEAS).
+
+.. _semisupervised_naive_bayes:
+
+Semisupervised training with EM
+-------------------------------
+
+The class ``SemisupervisedNB`` implements the expectation maximization (EM)
+algorithm for semisupervised training of Naive Bayes models,
+where a part of the training samples are unlabeled.
+Unlabeled data are indicated by a ``-1`` value in the label vector.
+
+This EM algorithm fits an initial model, then iteratively
+
+ * uses the current to predict fractional class memberships;
+ * fits a new model on its own predictions
@amueller Owner

I think it should somehow say that it is related to self trained learning and link to wikipedia.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans
Owner

I still don't fully agree with the idea of "flattening" SemisupervisedNB so as not to be a meta-estimator. One problem with this is that the class with have to duplicate the parameters of the underlying model to get a comprehensive repr:

In [2]: SemisupervisedNB("bernoulli", alpha=1, binarize=2.)
Out[2]: 
SemisupervisedNB(event_model='bernoulli', n_iter=10, relabel_all=True,
          tol=1e-05, verbose=False)

alpha and binarize aren't printed. In the meta-estimator case, we got this for free.

So, I suggest keeping the meta-estimator design after. I know that "Flat is better than nested", but "There should be one--and preferably only one--obvious way to do it" and "that way may not be obvious at first unless you're Dutch." ;)

@amueller
Owner

Thanks :)

@ogrisel
Owner

@larsmans the Dutch argument is cheating :)

I will try to find some time to read the code / examples to make my own French / Canadian opinion on how the meta-estimator-style API feels in practice.

@fannix

I think one thing is missing here: the document length normalization, which is used in the original paper.

@ogrisel
Owner

@fannix the Vectorizer class always takes care of that by default.

@larsmans
Owner

I gave alternative representations of "unlabeled" some more thought, but there seems to be no more "natural" value than -1 that is representable in all relevant types. Notably, None translates to nan as an np.float, but is not representable in np.int, which is one of the obvious candidates for class label types.

@amueller
Owner

@larsmans As I said in Malaga, I agree with you. It wasn't obvious to me why -1 is the best choice but we shouldn't over-complicate it ;)

@larsmans
Owner

@amueller, yes, I just wanted it on record here for @fannix and others :)

@ogrisel
Owner
======================================================================
FAIL: Doctest: sklearn.preprocessing.LabelBinarizer
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/doctest.py", line 2166, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for sklearn.preprocessing.LabelBinarizer
  File "/home/ogrisel/coding/scikit-learn/sklearn/preprocessing/__init__.py", line 458, in LabelBinarizer

----------------------------------------------------------------------
File "/home/ogrisel/coding/scikit-learn/sklearn/preprocessing/__init__.py", line 495, in sklearn.preprocessing.LabelBinarizer
Failed example:
    clf.fit([1, 2, 6, 4, 2])
Expected:
    LabelBinarizer()
Got:
    LabelBinarizer(unlabeled=-1)
----------------------------------------------------------------------
File "/home/ogrisel/coding/scikit-learn/sklearn/preprocessing/__init__.py", line 499, in sklearn.preprocessing.LabelBinarizer
Failed example:
    clf.transform([1, 6])
Expected:
    array([[ 1.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  1.]])
Got:
    array([[ 0.25,  0.25,  0.25,  0.25],
           [ 0.  ,  0.  ,  0.  ,  1.  ]])

>>  raise self.failureException(self.format_failure(<StringIO.StringIO instance at 0x46b2e60>.getvalue()))
@ogrisel ogrisel commented on the diff
sklearn/tests/test_naive_bayes.py
@@ -117,3 +118,15 @@ def test_sample_weight():
sample_weight=[1, 1, 4])
assert_array_equal(clf.predict([1, 0]), [1])
assert_array_almost_equal(np.exp(clf.intercept_), [1 / 3., 2 / 3.])
+
+
+def test_semisupervised():
+ X = scipy.sparse.csr_matrix([[4, 3, 1],
+ [5, 2, 1],
+ [0, 1, 7],
+ [0, 1, 6]])
+ y = np.array([1, -1, -1, 2])
+ for clf in (BernoulliNB(), MultinomialNB()):
+ semi_clf = SemisupervisedNB(clf, n_iter=20, tol=1e6)
+ semi_clf.fit(X, y)
+ assert_array_equal(semi_clf.predict([[5, 0, 0], [1, 1, 4]]), [1, 2])
@ogrisel Owner
ogrisel added a note

The coverage report shows that this test is a bit two easy as the convergence is reached at the first iteration (old_coef and old_intercept are not updated). Also I would update this test to check that the input of a dense array X and its CSR variant yield the same outcome (in terms of predicted proba for instance).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel ogrisel commented on the diff
sklearn/naive_bayes.py
((65 lines not shown))
+
+ Returns
+ -------
+ self : object
+ Returns self.
+ """
+
+ clf = self.estimator
+ X = atleast2d_or_csr(X)
+ Y = clf._label_1ofK(y)
+
+ labeled = np.where(y != -1)[0]
+ if self.relabel_all:
+ unlabeled = np.where(y == -1)[0]
+ X_unlabeled = X[unlabeled, :]
+ Y_unlabeled = Y[unlabeled, :]
@ogrisel Owner
ogrisel added a note

Y_unlabeled is not defined if relabel_all is False. This case needs a test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel ogrisel commented on the diff
sklearn/naive_bayes.py
((84 lines not shown))
+
+ clf._fit1ofK(X[labeled, :], Y[labeled, :],
+ sample_weight[labeled, :] if sample_weight else None,
+ class_prior)
+ old_coef = clf.coef_.copy()
+ old_intercept = clf.intercept_.copy()
+
+ for i in xrange(self.n_iter):
+ if self.verbose:
+ print "Naive Bayes EM, iteration %d," % i,
+
+ # E
+ if self.relabel_all:
+ Y = clf.predict_proba(X)
+ else:
+ Y_unlabeled[:] = clf.predict_proba(X_unlabeled)
@ogrisel Owner
ogrisel added a note

Y_unlabeled seems to never be used in this loop. I guess the M-step should also test whether self.relabel_all is true or false to know which Y to use for fitting the new model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel ogrisel commented on the diff
sklearn/naive_bayes.py
((68 lines not shown))
+ self : object
+ Returns self.
+ """
+
+ clf = self.estimator
+ X = atleast2d_or_csr(X)
+ Y = clf._label_1ofK(y)
+
+ labeled = np.where(y != -1)[0]
+ if self.relabel_all:
+ unlabeled = np.where(y == -1)[0]
+ X_unlabeled = X[unlabeled, :]
+ Y_unlabeled = Y[unlabeled, :]
+
+ n_features = X.shape[1]
+ tol = self.tol * n_features
@ogrisel Owner
ogrisel added a note

I think the tol should also be multiplied by the mean std of the feature values (or their max absolute value) so as to make the tolerance criterion insensitive to feature re-scaling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel
Owner

About the nested constructor issue: I am still not sold to the current API. Here is another proposal: what about making the SemisupervisedNB a mixin class and generate two concrete classes SemisupervisedBernoulliNB and SemisupervisedMultinomialNB? As the current implementation is using lambda expressions to simulate some kind of inheritance to build 3 readonly properties and one method, using real inheritance through mixins might make more sense here and futhermore we would get flat constructor for free and the ability for advanced users to extend through inheritance rather than composition. WDYT?

@fannix

This sounds good. I think it might be better to have a Semisupervised mixin class which can admit unlabeled data.

@amueller
Owner

As estimator can only be BernoulliNB or MultinomialNB and the only one that will probably added in the future will be GaussianNB, I think I am +1 for keywords or what @ogrisel proposed above.

@larsmans
Owner

If I ever try this again I'll just rewrite the code, so I'm closing this PR.

@larsmans larsmans closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Dec 20, 2011
  1. @larsmans
  2. @larsmans

    ENH semisupervised text classification demo

    larsmans authored
    Based heavily on ordinary text classification example
  3. @larsmans

    ENH only re-estimate unlabeled samples in EMNB

    larsmans authored
    As per Nigam et al. 2000.
  4. @larsmans
  5. @larsmans
  6. @larsmans

    COSMIT rename EMNB to SemisupervisedNB

    larsmans authored
    Less cryptic than a four-letter acronym
  7. @larsmans
  8. @larsmans
  9. @larsmans
  10. @larsmans
  11. @larsmans
  12. @larsmans
  13. @larsmans
  14. @larsmans
Commits on Dec 21, 2011
  1. @larsmans
  2. @larsmans
  3. @larsmans

    Merge branch 'master' into emnb

    larsmans authored
    Conflicts:
    	sklearn/preprocessing/__init__.py
  4. @larsmans
  5. @larsmans

    DOC link SemisupervisedNB

    larsmans authored
Commits on Dec 22, 2011
  1. @larsmans

    COSMIT pep8

    larsmans authored
This page is out of date. Refresh to see the latest.
View
1  doc/modules/classes.rst
@@ -694,6 +694,7 @@ Pairwise metrics
naive_bayes.GaussianNB
naive_bayes.MultinomialNB
naive_bayes.BernoulliNB
+ naive_bayes.SemisupervisedNB
.. _neighbors_ref:
View
50 doc/modules/naive_bayes.rst
@@ -176,3 +176,53 @@ It is advisable to evaluate both models, if time permits.
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.61.5542>`_
3rd Conf. on Email and Anti-Spam (CEAS).
+
+.. _semisupervised_naive_bayes:
+
+Semisupervised training with EM
+-------------------------------
+
+The class :class:`SemisupervisedNB` implements the expectation maximization
+(EM) algorithm for semisupervised training of Naive Bayes models,
+where a part of the training samples are unlabeled.
+Unlabeled data are indicated by a ``-1`` value in the label vector.
+
+This EM algorithm fits an initial model, then iteratively
+
+ * uses the current to predict fractional class memberships;
+ * fits a new model on its own predictions
@amueller Owner

I think it should somehow say that it is related to self trained learning and link to wikipedia.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+
+until convergence.
+Convergence is determined by measuring the difference
+between subsequent models' parameter vectors.
+Note that this differs from the typical treatment of
+EM for Naive Bayes in the literature,
@amueller Owner

I think it should explicitly be "Semi-supervised Naive Bayes".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+where convergence is usually checked by computing
+the log-likelihood of the model given the training samples.
+The resulting algorithm is similar to the more general technique of
+self-training (see Zhu 2008).
+
+:class:`SemisupervisedNB` is a meta-estimator that builds upon
+a regular Naive Bayes estimator.
+To use this class, construct it with an ordinary Naive Bayes model as follows::
+
+ >>> from sklearn.naive_bayes import MultinomialNB, SemisupervisedNB
+ >>> clf = SemisupervisedNB(MultinomialNB())
+ >>> clf
+ SemisupervisedNB(estimator=MultinomialNB(alpha=1.0, fit_prior=True),
+ n_iter=10, relabel_all=True, tol=0.001, verbose=False)
+
+Then use ``clf.fit`` as usual.
+
+.. note::
@amueller Owner

Add referece to the example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+
+ EM is not currently supported for Gaussian Naive Bayes estimators.
+
+.. topic:: References:
+
+ * K. Nigam, A.K. McCallum, S. Thrun and T. Mitchell (2000).
+ Text classification from labeled and unlabeled documents using EM.
+ Machine Learning 39(2):103–134.
+ * X. Zhu (2008). `"Semi-supervised learning literature survey"
+ <http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf>`_.
+ CS TR 1530, U. Wisconsin-Madison.
View
201 examples/semisupervised_document_classification.py
@@ -0,0 +1,201 @@
+"""
+===============================================
+Semisupervised classification of text documents
+===============================================
+
+This variation on the document classification theme (see
+document_classification_20newsgroups.py) showcases semisupervised learning:
@amueller Owner

This should be a link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+classification with training on partially unlabeled data.
+
+The dataset used in this example is the 20 newsgroups dataset which will be
+automatically downloaded and then cached; this set is labeled, but the
+labels from a random part will be removed.
@amueller Owner

I think it should be explicit that the fully supervised version are trained only on the labeled subset of the data while the semi-supervised can also use the additional unlabeled data.

@amueller Owner

It would be good to have a link to the narrative documentation in the docstring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+
+"""
+
+# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>
+# Olivier Grisel <olivier.grisel@ensta.org>
+# Mathieu Blondel <mathieu@mblondel.org>
+# Lars Buitinck <L.J.Buitinck@uva.nl>
+# License: Simplified BSD
+
+import logging
+import numpy as np
+from operator import itemgetter
+from optparse import OptionParser
+import sys
+from time import time
+
+from sklearn.cross_validation import StratifiedKFold
+from sklearn.datasets import fetch_20newsgroups
+from sklearn.feature_extraction.text import Vectorizer
+from sklearn.naive_bayes import BernoulliNB, SemisupervisedNB, MultinomialNB
+from sklearn import metrics
+
+
+# Display progress logs on stdout
+logging.basicConfig(level=logging.INFO,
+ format='%(asctime)s %(levelname)s %(message)s')
+
+
+# parse commandline arguments
+op = OptionParser()
+op.add_option("--confusion_matrix",
+ action="store_true", dest="print_cm",
+ help="Print the confusion matrix.")
+op.add_option("--labeled",
+ action="store", type="float", dest="labeled_fraction",
+ help="Fraction of labels to retain (roughly).")
+op.add_option("--report",
+ action="store_true", dest="print_report",
+ help="Print a detailed classification report.")
+op.add_option("--top10",
+ action="store_true", dest="print_top10",
+ help="Print ten most discriminative terms per class"
+ " for every classifier.")
+
+(opts, args) = op.parse_args()
+if len(args) > 0:
+ op.error("this script takes no arguments.")
+ sys.exit(1)
+
+print __doc__
+op.print_help()
+print
+
+
+def split_indices(y, fraction):
+ """Random stratified split of indices into y
+
@amueller Owner

pep8: Whitespace on blank line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ Returns (unlabeled, labeled)
+ """
+ k = int(round(1 / fraction))
+ folds = list(StratifiedKFold(y, k))
+ return folds[rng.randint(k)]
+
+
+def trim(s):
+ """Trim string to fit on terminal (assuming 80-column display)"""
+ return s if len(s) <= 80 else s[:77] + "..."
+
+
+###############################################################################
+# Load some categories from the training set
+categories = [
+ 'alt.atheism',
+ 'talk.religion.misc',
+ 'comp.graphics',
+ 'sci.space',
+]
+# Uncomment the following to do the analysis on all the categories
+#categories = None
+
+print "Loading 20 newsgroups dataset for categories:"
+print categories if categories else "all"
+
+rng = np.random.RandomState(42)
+
+data_train = fetch_20newsgroups(subset='train', categories=categories,
+ shuffle=True, random_state=rng)
+
+data_test = fetch_20newsgroups(subset='test', categories=categories,
+ shuffle=True, random_state=rng)
+print 'data loaded'
+
+categories = data_train.target_names # for case categories == None
+
+print "%d documents (training set)" % len(data_train.data)
+print "%d documents (testing set)" % len(data_test.data)
+print "%d categories" % len(categories)
+print
+
+# split a training set and a test set
+y_train, y_test = data_train.target, data_test.target
+
+if opts.labeled_fraction is None:
+ fraction = .1
+else:
+ fraction = opts.labeled_fraction
+ if fraction <= 0. or fraction > 1.:
+ print "Invalid fraction %.2f"
+ sys.exit(1)
+
+print "Extracting features from the training dataset using a sparse vectorizer"
+t0 = time()
+vectorizer = Vectorizer()
+X_train = vectorizer.fit_transform(data_train.data)
+print "done in %fs" % (time() - t0)
+print "n_samples: %d, n_features: %d" % X_train.shape
+print
+
+print "Extracting features from the test dataset using the same vectorizer"
+t0 = time()
+X_test = vectorizer.transform(data_test.data)
+print "done in %fs" % (time() - t0)
+print "n_samples: %d, n_features: %d" % X_test.shape
+print
+
+unlabeled, labeled = split_indices(y_train, fraction)
+print "Removing labels of %d random training documents" % len(unlabeled)
+print
+X_labeled = X_train[labeled]
+y_labeled = y_train[labeled]
+y_train[unlabeled] = -1
+
+vocabulary = np.array([t for t, i in sorted(vectorizer.vocabulary.iteritems(),
+ key=itemgetter(1))])
+
+
+###############################################################################
+# Benchmark classifiers
+def benchmark(clf, supervised=False):
+ print 80 * '_'
+ print "Training: "
+ print clf
+ t0 = time()
+ if supervised:
+ clf.fit(X_labeled, y_labeled)
+ else:
+ clf.fit(X_train, y_train)
+ train_time = time() - t0
+ print "train time: %0.3fs" % train_time
+
+ t0 = time()
+ pred = clf.predict(X_test)
+ test_time = time() - t0
+ print "test time: %0.3fs" % test_time
+
+ score = metrics.f1_score(y_test, pred)
+ print "f1-score: %0.3f" % score
+
+ if hasattr(clf, 'coef_'):
+ print "dimensionality: %d" % clf.coef_.shape[1]
+
+ if opts.print_top10:
+ print "top 10 keywords per class:"
+ for i, category in enumerate(categories):
+ top10 = np.argsort(clf.coef_[i, :])[-10:]
+ print trim("%s: %s" % (category, " ".join(vocabulary[top10])))
+ print
+
+ if opts.print_report:
+ print "classification report:"
+ print metrics.classification_report(y_test, pred,
+ target_names=categories)
+
+ if opts.print_cm:
+ print "confusion matrix:"
+ print metrics.confusion_matrix(y_test, pred)
+
+ print
+ return score, train_time, test_time
+
+print 80 * '='
+print "Baseline: fully supervised Naive Bayes"
+benchmark(MultinomialNB(alpha=.01), supervised=True)
@amueller Owner

From the output it is not clear to me what the difference between multinomial nb and binary nb is. How are the features used in these two cases? Or is there something in the narrative docs about this use case?

@larsmans Owner

(For the record,) the predict algorithm (posterior computation) is different for the multinomial and Bernoulli event models. This is described in the narrative docs, with references: http://scikit-learn.org/dev/modules/naive_bayes.html

@amueller Owner

Ok. Sorry should have looked before complaining.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+benchmark(BernoulliNB(alpha=.01), supervised=True)
+
+print 80 * '='
+print "Naive Bayes trained with Expectation Maximization"
+benchmark(SemisupervisedNB(MultinomialNB(alpha=.01)))
+benchmark(SemisupervisedNB(BernoulliNB(alpha=.01)))
View
155 sklearn/naive_bayes.py
@@ -18,6 +18,7 @@
from abc import ABCMeta, abstractmethod
import numpy as np
+from scipy.linalg import norm
from scipy.sparse import issparse
from .base import BaseEstimator, ClassifierMixin
@@ -227,13 +228,7 @@ def fit(self, X, y, sample_weight=None, class_prior=None):
Returns self.
"""
X = atleast2d_or_csr(X)
-
- labelbin = LabelBinarizer()
- Y = labelbin.fit_transform(y)
- self._classes = labelbin.classes_
- n_classes = len(self._classes)
- if Y.shape[1] == 1:
- Y = np.concatenate((1 - Y, Y), axis=1)
+ Y = self._label_1ofK(y)
if X.shape[0] != Y.shape[0]:
msg = "X and y have incompatible shapes."
@@ -242,6 +237,14 @@ def fit(self, X, y, sample_weight=None, class_prior=None):
masks (use `indices=True` in CV)."
raise ValueError(msg)
+ self._fit1ofK(X, Y, sample_weight, class_prior)
+ return self
+
+ def _fit1ofK(self, X, Y, sample_weight, class_prior):
+ """Guts of the fit method; takes labels in 1-of-K encoding Y"""
+
+ n_classes = Y.shape[1]
+
if sample_weight is not None:
Y *= array2d(sample_weight).T
@@ -262,8 +265,6 @@ def fit(self, X, y, sample_weight=None, class_prior=None):
- np.log(N_c.reshape(-1, 1)
+ self.alpha * X.shape[1]))
- return self
-
@staticmethod
def _count(X, Y):
"""Count feature occurrences.
@@ -277,6 +278,21 @@ def _count(X, Y):
return N_c, N_c_i
+ def _label_1ofK(self, y):
+ """Convert label vector to 1-of-K and set self._classes"""
+
+ y = np.asarray(y)
+ if y.ndim == 1:
+ labelbin = LabelBinarizer()
+ Y = labelbin.fit_transform(y)
+ self._classes = labelbin.classes_
+ if Y.shape[1] == 1:
+ Y = np.concatenate((1 - Y, Y), axis=1)
+ else:
+ Y = np.copy(y)
+
+ return Y
+
intercept_ = property(lambda self: self.class_log_prior_)
coef_ = property(lambda self: self.feature_log_prob_)
@@ -453,3 +469,124 @@ def _joint_log_likelihood(self, X):
jll = safe_sparse_dot(X, self.coef_.T) + X_neg_prob
return jll + self.intercept_
+
+
+class SemisupervisedNB(BaseNB):
+ """Semisupervised Naive Bayes using expectation-maximization (EM)
+
+ This meta-estimator can be used to train a Naive Bayes model in
+ semisupervised mode, i.e. with a mix of labeled and unlabeled samples.
+
+ Parameters
+ ----------
+ estimator : {BernoulliNB, MultinomialNB}
+ Underlying Naive Bayes estimator. `GaussianNB` is not supported at
+ this moment.
+ n_iter : int, optional
+ Maximum number of iterations.
+ relabel_all : bool, optional
+ Whether to re-estimate class memberships for labeled samples as well.
+ Disabling this may result in bad performance, but follows Nigam et al.
+ closely.
+ tol : float, optional
+ Tolerance, per coefficient, for the convergence criterion.
+ Convergence is determined based on the coefficients (log probabilities)
+ instead of the model log likelihood.
+ verbose : boolean, optional
+ Whether to print progress information.
+ """
+
+ def __init__(self, estimator, n_iter=10, relabel_all=True, tol=1e-5,
+ verbose=False):
+ if not isinstance(estimator, BaseDiscreteNB):
+ raise TypeError("%r is not a supported Naive Bayes classifier"
+ % (estimator,))
+ self.estimator = estimator
+ self.n_iter = n_iter
+ self.relabel_all = relabel_all
+ self.tol = tol
+ self.verbose = verbose
+
+ def fit(self, X, y, sample_weight=None, class_prior=None):
+ """Fit Naive Bayes estimator using EM
+
+ This fits the underlying estimator at most n_iter times until its
+ parameter vector converges. After every iteration, the posterior label
+ probabilities (as returned by predict_proba) are used to fit in the
+ next iteration.
+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix}, shape = [n_samples, n_features]
+ Training vectors, where n_samples is the number of samples and
+ n_features is the number of features.
+
+ y : array-like, shape = [n_samples]
+ Target values. Unlabeled samples should have a target value of -1.
+
+ sample_weight : array-like, shape = [n_samples], optional
+ Weights applied to individual samples (1. for unweighted).
+
+ class_prior : array, shape [n_classes]
+ Custom prior probability per class.
+ Overrides the fit_prior parameter.
+
+ Returns
+ -------
+ self : object
+ Returns self.
+ """
+
+ clf = self.estimator
+ X = atleast2d_or_csr(X)
+ Y = clf._label_1ofK(y)
+
+ labeled = np.where(y != -1)[0]
+ if self.relabel_all:
+ unlabeled = np.where(y == -1)[0]
+ X_unlabeled = X[unlabeled, :]
+ Y_unlabeled = Y[unlabeled, :]
@ogrisel Owner
ogrisel added a note

Y_unlabeled is not defined if relabel_all is False. This case needs a test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+
+ n_features = X.shape[1]
+ tol = self.tol * n_features
@ogrisel Owner
ogrisel added a note

I think the tol should also be multiplied by the mean std of the feature values (or their max absolute value) so as to make the tolerance criterion insensitive to feature re-scaling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+
+ clf._fit1ofK(X[labeled, :], Y[labeled, :],
+ sample_weight[labeled, :] if sample_weight else None,
+ class_prior)
+ old_coef = clf.coef_.copy()
+ old_intercept = clf.intercept_.copy()
+
+ for i in xrange(self.n_iter):
+ if self.verbose:
+ print "Naive Bayes EM, iteration %d," % i,
+
+ # E
+ if self.relabel_all:
+ Y = clf.predict_proba(X)
+ else:
+ Y_unlabeled[:] = clf.predict_proba(X_unlabeled)
@ogrisel Owner
ogrisel added a note

Y_unlabeled seems to never be used in this loop. I guess the M-step should also test whether self.relabel_all is true or false to know which Y to use for fitting the new model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+
+ # M
+ clf._fit1ofK(X, Y, sample_weight, class_prior)
+
+ d = (norm(old_coef - clf.coef_, 1)
+ + norm(old_intercept - clf.intercept_, 1))
+ if self.verbose:
+ print "diff = %.3g" % d
+ if d < tol:
+ if self.verbose:
+ print "Naive Bayes EM converged"
+ break
+
+ old_coef[:] = clf.coef_
+ old_intercept[:] = clf.intercept_
+
+ return self
+
+ # we "inherit" the crucial parts of an NB estimator from the underlying
+ # one, so we can inherit the predict* methods from BaseNB with docstrings
+ coef_ = property(lambda self: self.estimator.coef_)
+ intercept_ = property(lambda self: self.estimator.intercept_)
+ _classes = property(lambda self: self.estimator._classes)
+ _joint_log_likelihood = \
+ property(lambda self: self.estimator._joint_log_likelihood)
View
18 sklearn/preprocessing/__init__.py
@@ -449,6 +449,7 @@ def transform(self, X, y=None, copy=None):
def _is_label_indicator_matrix(y):
return hasattr(y, "shape") and len(y.shape) == 2
+
def _is_multilabel(y):
return isinstance(y[0], tuple) or \
isinstance(y[0], list) or \
@@ -473,6 +474,16 @@ class LabelBinarizer(BaseEstimator, TransformerMixin):
model gave the greatest confidence. LabelBinarizer makes this easy
with the inverse_transform method.
+ This class can handle target vectors with missing classes (for
+ semisupervised learning) via the unlabeled parameter. The output for
+ unlabeled samples is uniform fractional (probabilistic) membership of
+ all classes.
+
+ Parameters
+ ----------
+ unlabeled : optional, default = -1
+ Pseudo-label for unlabeled samples.
+
Attributes
----------
classes_ : array of shape [n_class]
@@ -497,6 +508,9 @@ class LabelBinarizer(BaseEstimator, TransformerMixin):
array([1, 2, 3])
"""
+ def __init__(self, unlabeled=-1):
+ self.unlabeled = unlabeled
+
def _check_fitted(self):
if not hasattr(self, "classes_"):
raise ValueError("LabelBinarizer was not fitted yet.")
@@ -522,7 +536,7 @@ def fit(self, y):
else:
self.classes_ = np.array(sorted(set.union(*map(set, y))))
else:
- self.classes_ = np.unique(y)
+ self.classes_ = np.array(sorted(set(y) - {self.unlabeled}))
return self
def transform(self, y):
@@ -574,11 +588,13 @@ def transform(self, y):
elif len(self.classes_) == 2:
Y[y == self.classes_[1], 0] = 1
+ Y[y == self.unlabeled, 0] = .5
return Y
elif len(self.classes_) >= 2:
for i, k in enumerate(self.classes_):
Y[y == k, i] = 1
+ Y[y == self.unlabeled, :] = 1. / len(self.classes_)
return Y
else:
View
17 sklearn/preprocessing/tests/test_preprocessing.py
@@ -296,6 +296,23 @@ def test_label_binarizer():
assert_array_equal(expected, got)
assert_array_equal(lb.inverse_transform(got), inp)
+ # two-class case with unlabeled samples
+ inp = np.array([0, 1, -1, 1, 0])
+ expected = np.array([[0, 1, .5, 1, 0]]).T
+ got = lb.fit_transform(inp)
+ assert_array_equal(expected, got)
+
+ # multi-class case with unlabeled samples
+ inp = np.array([2, -1, 1, -1, 2, 0])
+ expected = np.array([[0, 0, 1],
+ [1/3., 1/3., 1/3.],
+ [0, 1, 0],
+ [1/3., 1/3., 1/3.],
+ [0, 0, 1],
+ [1, 0, 0]])
+ got = lb.fit_transform(inp)
+ assert_array_equal(expected, got)
+
def test_label_binarizer_multilabel():
lb = LabelBinarizer()
View
15 sklearn/tests/test_naive_bayes.py
@@ -5,7 +5,8 @@
from numpy.testing import (assert_almost_equal, assert_array_equal,
assert_array_almost_equal, assert_equal)
-from ..naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
+from ..naive_bayes import (GaussianNB, BernoulliNB, MultinomialNB,
+ SemisupervisedNB)
# Data is just 6 separable points in the plane
X = np.array([[-2, -1], [-1, -1], [-1, -2], [1, 1], [1, 2], [2, 1]])
@@ -117,3 +118,15 @@ def test_sample_weight():
sample_weight=[1, 1, 4])
assert_array_equal(clf.predict([1, 0]), [1])
assert_array_almost_equal(np.exp(clf.intercept_), [1 / 3., 2 / 3.])
+
+
+def test_semisupervised():
+ X = scipy.sparse.csr_matrix([[4, 3, 1],
+ [5, 2, 1],
+ [0, 1, 7],
+ [0, 1, 6]])
+ y = np.array([1, -1, -1, 2])
+ for clf in (BernoulliNB(), MultinomialNB()):
+ semi_clf = SemisupervisedNB(clf, n_iter=20, tol=1e6)
+ semi_clf.fit(X, y)
+ assert_array_equal(semi_clf.predict([[5, 0, 0], [1, 1, 4]]), [1, 2])
@ogrisel Owner
ogrisel added a note

The coverage report shows that this test is a bit two easy as the convergence is reached at the first iteration (old_coef and old_intercept are not updated). Also I would update this test to check that the input of a dense array X and its CSR variant yield the same outcome (in terms of predicted proba for instance).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.