Expectation Maximization for Supervised Naive Bayes Classifiers #1310

pemistahl opened this Issue Nov 1, 2012 · 35 comments


None yet
5 participants

I know, this isn't a code issue really but hopefully my issue fits here somehow anyway. I'd like to know whether there is a more or less simple way of adding the expectation maximization algorithm to the supervised Naive Bayes classifiers MultinomialNB and BernoulliNB in sklearn.naive_bayes.

I've found a pull request with an implementation of this in a semi-supervised approach but it seems that it hasn't been merged so far to master. But I need it for a completely supervised setting.

Has anyone done something like that or can anyone tell me how to start implementing EM? I'm a bit lost in where to start.

Thank you!


amueller commented Nov 2, 2012

Hi @pemistahl.
First, I'm not an expert on naive Bayes so I might got it wrong.
To my understanding there is no point in doing EM in the supervised case. The problem allows for a closed form maximum likelihood (or MAP) solution.
I guess EM would make sense if you use hyper-priors. Is that what you are after?


mblondel commented Nov 2, 2012

It could perhaps make sense if you have missing values (features) but I agree with @amueller in the general case.

Hi @amueller and @mblondel, thank you both for your quick replies. Let me explain my issue in a bit more detail:

My task is to implement several algorithms for partially supervised webpage classification for which you learn a classifier from positive (P) and unlabeled (U) examples only, i.e., you don't have negative examples. These algorithms usually work in a 2-step process. The first step is to determine a set of so-called reliable negative (RN) examples by several different methods such as similarity measures. The second step is to use these reliable negative and the positive examples to classify the rest of the unlabeled examples.

For this second step, there exists an algorithm that combines a Naive Bayes classifier with the expectation maximization algorithm. In the first iteration, the classifier is trained using the positive set P and the reliable negative set RN. At the end of this first iteration, each unlabeled document in the set Q = U - RN is assigned a probabilistic class label Pr(P|document) for the document being classified as positive. In the subsequent iterations, this set Q participates in training the classifier and applying it again on Q. At some point, the algorithm will converge and stops.

For further information, please go to this website, download this paper and refer to section 4.2 on page 4. There you have pseudocode and more explanations.

This pseudocode is what I'm trying to implement and for which I need your help. How can I best extend the classes sklearn.naive_bayes.MultinomialNB and sklearn.naive_bayes.BernoulliNB to support accepting my own prior probabilities for a specific document set and my own feature probabilities, instead of having initial word counts only which is the only thing that is accepted by your implementation so far?

I hope my explanations were understandable. Looking forward to your help. Thanks a lot. :-)


amueller commented Nov 3, 2012

Contrary to what you said in the beginning, this is a semi-supervised setting with an additional stage to obtain RNs.

Yes, you are probably right. I was confounding the terminology, it seems, sorry. So, do you think that the pull request containing a semi-supervised version of the classifers can get me further? It's a bit difficult for me to get my head around this.

By the way, is there any reason why this semi-supervised version has not been integrated into the master branch so far? As far as I could see, the implementation seemed to be quite mature.


amueller commented Nov 3, 2012

Yes, I'm pretty sure it will help you.
You are right, #430 seems quite mature, but apparently @larsmans didn't give it the last finishing touch. There was still some controversy about he API. Feel free to restart the discussion, or fix remaining issues.

It is a shame that the code has been lying around so long :-$

amueller closed this Nov 3, 2012


larsmans commented Nov 3, 2012

I'd love to merge in that code some day, but I'm completely swamped. It's indeed the API that prevented it from being merged in. I also have a partial implementation of PU learning algoritms, but that's far from finished; it's somewhere among my branches.

@larsmans: The API doesn't need to be very elegant. Can you tell me whether it works correctly? Then I will try to use it for my purposes.


larsmans commented Nov 3, 2012

The vanilla semisupervised version works quite nicely. As for the PU version, I implemented only a method for finding the reliable negatives, and I had a hard time getting it to work for my data. https://github.com/larsmans/scikit-learn/tree/pu-learning

@larsmans I have already implemented all methods for finding the reliable negatives myself and I think all of them are working correctly. But they are designed as custom management commands for a Django project of mine at the moment. Maybe I can adapt them to fit into scikit-learn and contribute them later. I will try your SemisupervisedNB class now and let you know whether it fits my needs.

@larsmans I have the following code at the moment:

vectorizer = CountVectorizer(min_df=1)

X_train = vectorizer.fit_transform(
    chain(positive_data, reliable_negative_data, unlabeled_data)) # data is provided as generators

y_train = np.append(
    np.array(number_positive_docs * [1]),
    np.array(number_reliable_negative_docs * [0]),
    np.array(number_unlabeled_docs * [-1]))

multinomial_clf = MultinomialNB(alpha=0.1)
semisupervised_clf = SemisupervisedNB(estimator=multinomial_clf)

semisupervised_clf.fit(X=X_train, y=y_train)

How do I get the predictions for the unlabeled data now? Do I have to create a separate matrix for the unlabeled data and give it to semisupervised_clf.predict() as an argument? Or can I simply extract the predicted labels from the last iteration of the EM algorithm from the original matrix X_train? The docstrings in your code are a bit unclear about that. Thank you!


larsmans commented Nov 6, 2012

You just feed the samples to predict. I know that may seem wasteful, but predict is dirt cheap.


ogrisel commented Nov 6, 2012

@larsmans: The API doesn't need to be very elegant. Can you tell me whether it works correctly? Then I will try to use it for my purposes.

The API needs to be elegant before merging to master as we will have to maintain it over the years and if we have to change the API later we will have questions on the mailing list or false bug reports with missing version information by users who do not read the change log. Hence we need to get the API as right as possible from the start.

Otherwise we are swamped by maintenance issues / question answering and it's no longer fun to contribute to the project as we cannot find the time to do any more coding.


mblondel commented Nov 6, 2012

@ogrisel The way I understood it, he was talking about using it for his own purposes, not about merging it in scikit-learn.


mblondel commented Nov 6, 2012

And regarding the API, if @larsmans 's code is updated to adhere to the same API as LabelPropagation, it could be merge IMO.


larsmans commented Nov 6, 2012

The problem wasn't compat with LabelPropagation -- it's the fact that SemisupervisedNB is a meta-estimator instead of the functionality being merged into BaseDiscreteNB. If we can make a decision about this, I'd be happy to pick up that code and merge it into master over the christmas holiday.


mblondel commented Nov 6, 2012

Oh, I remember this issue now. Yes, out of respect for all the work you put into this, we should really make a decision so that you can merge your code.


larsmans commented Nov 6, 2012

You shouldn't merge anything out of respect for the effort -- I've put a lot of effort into things that didn't work :)


amueller commented Nov 6, 2012

we shouldn't merge for that reason but decide what to do ;)
see my comment there.

@ogrisel: I was saying that the API doesn't need to be elegant for me personally in order to use it in my project. But, of course, I understand that this is necessary to merge the code into the master branch.

@larsmans: Do you spontaneously know which parts of this file have to be put into the respective file in the master branch so that the class SemisupervisedNB can be used without any errors? I added the entire class SemisupervisedNB and the private methods BaseDiscreteNB._fit1ofK() and BaseDiscreteNB._label_1ofK() but now I get the following error message:

File "/[...]/sklearn/naive_bayes.py", line  560, in fit
old_coef = clf.coef_.copy()
File "/[...]/sklearn/naive_bayes.py", line 319, in _get_coef
return self.feature_log_prob_[1] if len(self.classes_) == 2 \
AttributeError: 'MultinomialNB' object has no attribute 'classes_'

It is quite important for me to get this code running, so I would appreciate any help from you very much.


larsmans commented Nov 6, 2012

It would take quite some hackery to fully merge this code with current master. You might want to just use the file from that branch, I don't think the actual Naive Bayes algorithms changed a lot.

@larsmans: First of all thank you for your quick answers. Very much appreciated. :-)

I've already tried to replace the file naive_bayes.py in the master branch by the file in your emnb branch. When I call SemisupervisedNB.fit() I get the following error:

File "/[...]/sklearn/naive_bayes.py", line 573, in fit
+ norm(old_intercept - clf.intercept_, 1))
File "/[...]/scipy/linalg/misc.py", line 12, in norm
a = np.asarray_chkfinite(a)
File "/[...]/numpy/lib/function_base.py", line 590, in asarray_chkfinite
"array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs

I use the latest stable versions of SciPy and NumPy. Any idea what's wrong here?


larsmans commented Nov 6, 2012

It looks like there's an inf or nan in the coefficient vector (the log-probabilities of features). That shouldn't happen. Can you find out where the infinity occurs and add some debugging code to find out what values it takes on prior to going to infinity? It might also be worthwhile to find out what feature in the input is that...

(There also seems to be a bug in line 573: the + should be a -, I think. If you're in a hurry, you could turn off the convergence checking and just run EM for a fixed number of iterations.)

By the way, I also get the following warnings:

/[...]/sklearn/naive_bayes.py:258: RuntimeWarning: divide by zero encountered in log
self.class_log_prior_ = np.log(y_freq) - np.log(y_freq.sum())
/[...]/sklearn/naive_bayes.py:573: RuntimeWarning: invalid value encountered in subtract
+ norm(old_intercept - clf.intercept_, 1))

The same warnings and errors appear when I replace the + in line 573 by a -.

How exactly do I turn off convergence checking? If I set the attribute tol to None I get another error:

File "/[...]/sklearn/naive_bayes.py", line 551, in fit
tol = self.tol * n_features
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

larsmans commented Nov 6, 2012

You can turn off convergence checking by just commenting out the whole norm-based logic around line 573. But I think something else is amiss here. What does y look like, in particular, what it np.unique(y) at the beginning of the EM loop and just before it crashes?

y looks like this:

y = np.concatenate([
    np.array(number_positive_docs * [1]),
    np.array(number_reliable_negative_docs * [0]),
    np.array(number_unlabeled_docs * [-1])])

np.unique(y) gives array([-1, 0, 1]). As far as I can see from the code, y doesn't change during the iterations, i.e., in the for loop in line 559. Or do you mean Y instead?


larsmans commented Nov 6, 2012

Sorry, that's Y. Or more interestingly, y_freq. (Sorry, I'm trying to recollect what code I wrote a year ago did in between video lectures :)

When I disable convergence checking and run the algorithm with 10 iterations, below is the output of Y and y_freq. If I enable convergence again, the program fails after the line Naive Bayes EM, iteration 0, y_freq: [ 0. 8378.9333942 17822.0666058] has been displayed.

y_freq: [    0  7290 17613]
/[...]/sklearn/naive_bayes.py:259: RuntimeWarning: divide by zero encountered in log
self.class_log_prior_ = np.log(y_freq) - np.log(y_freq.sum())
Y: array([[0, 0, 1],
   [0, 0, 1],
   [0, 0, 1],
   [1, 0, 0],
   [1, 0, 0],
   [1, 0, 0]])
Naive Bayes EM, iteration 0, y_freq: [     0.          8378.9333942  17822.0666058]

Y: array([[  0.00000000e+00,   3.46519606e-03,   9.96534804e-01],
   [  0.00000000e+00,   2.17572298e-10,   1.00000000e+00],
   [  0.00000000e+00,   8.08888129e-11,   1.00000000e+00],
   [  0.00000000e+00,   1.00000000e+00,   1.40789732e-98],
   [  0.00000000e+00,   1.00000000e+00,   1.95952026e-99],
   [  0.00000000e+00,   1.00000000e+00,   1.09213922e-85]])
Naive Bayes EM, iteration 1, y_freq: [     0.           8415.53031742  17785.46968258]

Y: array([[  0.00000000e+000,   3.78921026e-003,   9.96210790e-001],
   [  0.00000000e+000,   2.13413175e-010,   1.00000000e+000],
   [  0.00000000e+000,   7.66849136e-011,   1.00000000e+000],
   [  0.00000000e+000,   1.00000000e+000,   3.65956451e-100],
   [  0.00000000e+000,   1.00000000e+000,   1.49041595e-100],
   [  0.00000000e+000,   1.00000000e+000,   6.33163722e-111]])
Naive Bayes EM, iteration 2, y_freq: [     0.           8423.40377029  17777.59622971]

Y: array([[  0.00000000e+000,   3.84123829e-003,   9.96158762e-001],
   [  0.00000000e+000,   2.21290143e-010,   1.00000000e+000],
   [  0.00000000e+000,   7.99408677e-011,   1.00000000e+000],
   [  0.00000000e+000,   1.00000000e+000,   7.61149802e-097],
   [  0.00000000e+000,   1.00000000e+000,   2.81529459e-097],
   [  0.00000000e+000,   1.00000000e+000,   6.98453916e-108]])
Naive Bayes EM, iteration 3, y_freq: [     0.           8426.82680896  17774.17319104]

Y: array([[  0.00000000e+000,   3.84835828e-003,   9.96151642e-001],
   [  0.00000000e+000,   2.22097326e-010,   1.00000000e+000],
   [  0.00000000e+000,   7.98978877e-011,   1.00000000e+000],
   [  0.00000000e+000,   1.00000000e+000,   1.58293495e-098],
   [  0.00000000e+000,   1.00000000e+000,   5.80088961e-099],
   [  0.00000000e+000,   1.00000000e+000,   1.56020781e-109]])
Naive Bayes EM, iteration 4, y_freq: [     0.           8428.20883001  17772.79116999]

Y: array([[  0.00000000e+000,   3.85059503e-003,   9.96149405e-001],
   [  0.00000000e+000,   2.22228897e-010,   1.00000000e+000],
   [  0.00000000e+000,   7.88360548e-011,   1.00000000e+000],
   [  0.00000000e+000,   1.00000000e+000,   5.93829598e-099],
   [  0.00000000e+000,   1.00000000e+000,   2.18034206e-099],
   [  0.00000000e+000,   1.00000000e+000,   5.60937508e-110]])
Naive Bayes EM, iteration 5, y_freq: [     0.           8429.26678124  17771.73321876]

Y: array([[  0.00000000e+000,   3.85187799e-003,   9.96148122e-001],
   [  0.00000000e+000,   2.22367683e-010,   1.00000000e+000],
   [  0.00000000e+000,   7.73897691e-011,   1.00000000e+000],
   [  0.00000000e+000,   1.00000000e+000,   5.10176608e-099],
   [  0.00000000e+000,   1.00000000e+000,   1.88370075e-099],
   [  0.00000000e+000,   1.00000000e+000,   4.67052518e-110]])
Naive Bayes EM, iteration 6, y_freq: [     0.           8429.59775586  17771.40224414]

Y: array([[  0.00000000e+000,   3.85248690e-003,   9.96147513e-001],
   [  0.00000000e+000,   2.22386839e-010,   1.00000000e+000],
   [  0.00000000e+000,   7.72765263e-011,   1.00000000e+000],
   [  0.00000000e+000,   1.00000000e+000,   4.66505361e-099],
   [  0.00000000e+000,   1.00000000e+000,   1.72411031e-099],
   [  0.00000000e+000,   1.00000000e+000,   4.20129643e-110]])
Naive Bayes EM, iteration 7, y_freq: [     0.           8429.29960942  17771.70039058]

Y: array([[  0.00000000e+000,   3.85310960e-003,   9.96146890e-001],
   [  0.00000000e+000,   2.22492909e-010,   1.00000000e+000],
   [  0.00000000e+000,   7.73191700e-011,   1.00000000e+000],
   [  0.00000000e+000,   1.00000000e+000,   4.41773189e-099],
   [  0.00000000e+000,   1.00000000e+000,   1.63114819e-099],
   [  0.00000000e+000,   1.00000000e+000,   3.90354517e-110]])
Naive Bayes EM, iteration 8, y_freq: [     0.           8429.52790536  17771.47209464]

Y: array([[  0.00000000e+000,   3.85394219e-003,   9.96146058e-001],
   [  0.00000000e+000,   2.22721606e-010,   1.00000000e+000],
   [  0.00000000e+000,   7.74184984e-011,   1.00000000e+000],
   [  0.00000000e+000,   1.00000000e+000,   4.22522102e-099],
   [  0.00000000e+000,   1.00000000e+000,   1.55620262e-099],
   [  0.00000000e+000,   1.00000000e+000,   3.64041923e-110]])
Naive Bayes EM, iteration 9, y_freq: [     0.           8429.66724278  17771.33275722]

larsmans commented Nov 6, 2012

Ah, now I know what's wrong. Just plugging in that file isn't enough -- you also need the accompanying changes to LabelBinarizer so that it knows how to handle the -1 = unlabeled rule.

What do you mean exactly? Do I have to replace the file sklearn/preprocessing/__init__.py (where the class LabelBinarizer is defined) in the master branch by the respective file in your emnb branch? Or do I have to additionally change anything in the code? Please be a bit more specific. Thank you. :)


I've just seen that in the current stable version 0.12 the folder structure has changed, compared to your emnb branch. In the master branch, we have the file sklearn/preprocessing.py instead. Do I just have to replace the class LabelBinarizer and put it into the respective location?


larsmans commented Nov 6, 2012

You'll need to copy-paste the definition of LabelBinarizer -- too much has changed in the rest of that file, as you already noticed. No guarantees that that will work out of the box, though.

SemisupervisedNB hasn't been maintained, I haven't looked at it for a year and I can't really give proper support for it ATM. You could try just installing scikit-learn from that old branch if you don't need any of the changes made since then...

I've replaced the class LabelBinarizer and run my program again with convergence checking enabled. Now the program finishes without any errors and converges after 8 iterations. Looks good so far. :) Can you please take a look at the following output and tell me whether this is what you expected?

y_freq: [  7290.  17613.]

Y: array([[ 0. ,  1. ],
   [ 0. ,  1. ],
   [ 0. ,  1. ],
   [ 0.5,  0.5],
   [ 0.5,  0.5],
   [ 0.5,  0.5]])
Naive Bayes EM, iteration 0, y_freq: [  8378.9333942  17822.0666058]
diff = 6.97

Y: array([[  3.46519606e-03,   9.96534804e-01],
   [  2.17572298e-10,   1.00000000e+00],
   [  8.08888129e-11,   1.00000000e+00],
   [  1.00000000e+00,   1.40789732e-98],
   [  1.00000000e+00,   1.95952026e-99],
   [  1.00000000e+00,   1.09213922e-85]])
Naive Bayes EM, iteration 1, y_freq: [  8415.53031742  17785.46968258]
diff = 5.95

Y: array([[  3.78921026e-003,   9.96210790e-001],
   [  2.13413175e-010,   1.00000000e+000],
   [  7.66849136e-011,   1.00000000e+000],
   [  1.00000000e+000,   3.65956451e-100],
   [  1.00000000e+000,   1.49041595e-100],
   [  1.00000000e+000,   6.33163722e-111]])
Naive Bayes EM, iteration 2, y_freq: [  8423.40377029  17777.59622971]
diff = 4.79

Y: array([[  3.84123829e-003,   9.96158762e-001],
   [  2.21290143e-010,   1.00000000e+000],
   [  7.99408677e-011,   1.00000000e+000],
   [  1.00000000e+000,   7.61149802e-097],
   [  1.00000000e+000,   2.81529459e-097],
   [  1.00000000e+000,   6.98453916e-108]])
Naive Bayes EM, iteration 3, y_freq: [  8426.82680896  17774.17319104]
diff = 1.07

Y: array([[  3.84835828e-003,   9.96151642e-001],
   [  2.22097326e-010,   1.00000000e+000],
   [  7.98978877e-011,   1.00000000e+000],
   [  1.00000000e+000,   1.58293495e-098],
   [  1.00000000e+000,   5.80088961e-099],
   [  1.00000000e+000,   1.56020781e-109]])
Naive Bayes EM, iteration 4, y_freq: [  8428.20883001  17772.79116999]
diff = 0.876

Y: array([[  3.85059503e-003,   9.96149405e-001],
   [  2.22228897e-010,   1.00000000e+000],
   [  7.88360548e-011,   1.00000000e+000],
   [  1.00000000e+000,   5.93829598e-099],
   [  1.00000000e+000,   2.18034206e-099],
   [  1.00000000e+000,   5.60937508e-110]])
Naive Bayes EM, iteration 5, y_freq: [  8429.26678124  17771.73321876]
diff = 1.29

Y: array([[  3.85187799e-003,   9.96148122e-001],
   [  2.22367683e-010,   1.00000000e+000],
   [  7.73897691e-011,   1.00000000e+000],
   [  1.00000000e+000,   5.10176608e-099],
   [  1.00000000e+000,   1.88370075e-099],
   [  1.00000000e+000,   4.67052518e-110]])
Naive Bayes EM, iteration 6, y_freq: [  8429.59775586  17771.40224414]
diff = 1.57

Y: array([[  3.85248690e-003,   9.96147513e-001],
   [  2.22386839e-010,   1.00000000e+000],
   [  7.72765263e-011,   1.00000000e+000],
   [  1.00000000e+000,   4.66505361e-099],
   [  1.00000000e+000,   1.72411031e-099],
   [  1.00000000e+000,   4.20129643e-110]])
Naive Bayes EM, iteration 7, y_freq: [  8429.29960942  17771.70039058]
diff = 1.41

Y: array([[  3.85310960e-003,   9.96146890e-001],
   [  2.22492909e-010,   1.00000000e+000],
   [  7.73191700e-011,   1.00000000e+000],
   [  1.00000000e+000,   4.41773189e-099],
   [  1.00000000e+000,   1.63114819e-099],
   [  1.00000000e+000,   3.90354517e-110]])
Naive Bayes EM, iteration 8, y_freq: [  8429.52790536  17771.47209464]
diff = 0.571
Naive Bayes EM converged

larsmans commented Nov 6, 2012

Looks absolute fine to me! Of course, the proof of the pudding is in the eating, so be sure to evaluate this :)

(The easy way of doing an evaluation is to take a labeled set and randomly throw away part of the labels -- there's an example script in the PR that does that.)

Sure, I will evaluate this and let you know. Then this information might help you to merge the code into master some day. But I think, now I'm in a position where I can finally continue. I wouldn't have accomplished that without your help. Thank you very much! :) Scikit-learn has a great community.


larsmans commented Nov 6, 2012

Great! I'm really interested in what comes out of this.

Zintinio referenced this issue in Zintinio/Naive-Bayes Oct 12, 2016


Implement EM Semi-Supervised learning #2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment