# Expectation Maximization for Supervised Naive Bayes Classifiers #1310

Closed
opened this Issue Nov 1, 2012 · 35 comments

Projects
None yet
5 participants

### pemistahl commented Nov 1, 2012

 I know, this isn't a code issue really but hopefully my issue fits here somehow anyway. I'd like to know whether there is a more or less simple way of adding the expectation maximization algorithm to the supervised Naive Bayes classifiers `MultinomialNB` and `BernoulliNB` in `sklearn.naive_bayes`. I've found a pull request with an implementation of this in a semi-supervised approach but it seems that it hasn't been merged so far to master. But I need it for a completely supervised setting. Has anyone done something like that or can anyone tell me how to start implementing EM? I'm a bit lost in where to start. Thank you!
Owner

### amueller commented Nov 2, 2012

 Hi @pemistahl. First, I'm not an expert on naive Bayes so I might got it wrong. To my understanding there is no point in doing EM in the supervised case. The problem allows for a closed form maximum likelihood (or MAP) solution. I guess EM would make sense if you use hyper-priors. Is that what you are after?
Owner

### mblondel commented Nov 2, 2012

 It could perhaps make sense if you have missing values (features) but I agree with @amueller in the general case.

### pemistahl commented Nov 3, 2012

 Hi @amueller and @mblondel, thank you both for your quick replies. Let me explain my issue in a bit more detail: My task is to implement several algorithms for partially supervised webpage classification for which you learn a classifier from positive (P) and unlabeled (U) examples only, i.e., you don't have negative examples. These algorithms usually work in a 2-step process. The first step is to determine a set of so-called reliable negative (RN) examples by several different methods such as similarity measures. The second step is to use these reliable negative and the positive examples to classify the rest of the unlabeled examples. For this second step, there exists an algorithm that combines a Naive Bayes classifier with the expectation maximization algorithm. In the first iteration, the classifier is trained using the positive set P and the reliable negative set RN. At the end of this first iteration, each unlabeled document in the set Q = U - RN is assigned a probabilistic class label `Pr(P|document)` for the document being classified as positive. In the subsequent iterations, this set Q participates in training the classifier and applying it again on Q. At some point, the algorithm will converge and stops. For further information, please go to this website, download this paper and refer to section 4.2 on page 4. There you have pseudocode and more explanations. This pseudocode is what I'm trying to implement and for which I need your help. How can I best extend the classes `sklearn.naive_bayes.MultinomialNB` and `sklearn.naive_bayes.BernoulliNB` to support accepting my own prior probabilities for a specific document set and my own feature probabilities, instead of having initial word counts only which is the only thing that is accepted by your implementation so far? I hope my explanations were understandable. Looking forward to your help. Thanks a lot. :-)
Owner

### amueller commented Nov 3, 2012

 Contrary to what you said in the beginning, this is a semi-supervised setting with an additional stage to obtain RNs.

### pemistahl commented Nov 3, 2012

 Yes, you are probably right. I was confounding the terminology, it seems, sorry. So, do you think that the pull request containing a semi-supervised version of the classifers can get me further? It's a bit difficult for me to get my head around this. By the way, is there any reason why this semi-supervised version has not been integrated into the master branch so far? As far as I could see, the implementation seemed to be quite mature.
Owner

### amueller commented Nov 3, 2012

 Yes, I'm pretty sure it will help you. You are right, #430 seems quite mature, but apparently @larsmans didn't give it the last finishing touch. There was still some controversy about he API. Feel free to restart the discussion, or fix remaining issues. It is a shame that the code has been lying around so long :-\$

Owner

### larsmans commented Nov 3, 2012

 I'd love to merge in that code some day, but I'm completely swamped. It's indeed the API that prevented it from being merged in. I also have a partial implementation of PU learning algoritms, but that's far from finished; it's somewhere among my branches.

### pemistahl commented Nov 3, 2012

 @larsmans: The API doesn't need to be very elegant. Can you tell me whether it works correctly? Then I will try to use it for my purposes.
Owner

### larsmans commented Nov 3, 2012

 The vanilla semisupervised version works quite nicely. As for the PU version, I implemented only a method for finding the reliable negatives, and I had a hard time getting it to work for my data. https://github.com/larsmans/scikit-learn/tree/pu-learning

### pemistahl commented Nov 3, 2012

 @larsmans I have already implemented all methods for finding the reliable negatives myself and I think all of them are working correctly. But they are designed as custom management commands for a Django project of mine at the moment. Maybe I can adapt them to fit into scikit-learn and contribute them later. I will try your `SemisupervisedNB` class now and let you know whether it fits my needs.

### pemistahl commented Nov 6, 2012

 @larsmans I have the following code at the moment: ``````vectorizer = CountVectorizer(min_df=1) X_train = vectorizer.fit_transform( chain(positive_data, reliable_negative_data, unlabeled_data)) # data is provided as generators y_train = np.append( np.array(number_positive_docs * [1]), np.array(number_reliable_negative_docs * [0]), np.array(number_unlabeled_docs * [-1])) multinomial_clf = MultinomialNB(alpha=0.1) semisupervised_clf = SemisupervisedNB(estimator=multinomial_clf) semisupervised_clf.fit(X=X_train, y=y_train) `````` How do I get the predictions for the unlabeled data now? Do I have to create a separate matrix for the unlabeled data and give it to `semisupervised_clf.predict()` as an argument? Or can I simply extract the predicted labels from the last iteration of the EM algorithm from the original matrix `X_train`? The docstrings in your code are a bit unclear about that. Thank you!
Owner

### larsmans commented Nov 6, 2012

 You just feed the samples to `predict`. I know that may seem wasteful, but `predict` is dirt cheap.
Owner

### ogrisel commented Nov 6, 2012

 @larsmans: The API doesn't need to be very elegant. Can you tell me whether it works correctly? Then I will try to use it for my purposes. The API needs to be elegant before merging to master as we will have to maintain it over the years and if we have to change the API later we will have questions on the mailing list or false bug reports with missing version information by users who do not read the change log. Hence we need to get the API as right as possible from the start. Otherwise we are swamped by maintenance issues / question answering and it's no longer fun to contribute to the project as we cannot find the time to do any more coding.
Owner

### mblondel commented Nov 6, 2012

 @ogrisel The way I understood it, he was talking about using it for his own purposes, not about merging it in scikit-learn.
Owner

### mblondel commented Nov 6, 2012

 And regarding the API, if @larsmans 's code is updated to adhere to the same API as `LabelPropagation`, it could be merge IMO.
Owner

### larsmans commented Nov 6, 2012

 The problem wasn't compat with `LabelPropagation` -- it's the fact that `SemisupervisedNB` is a meta-estimator instead of the functionality being merged into `BaseDiscreteNB`. If we can make a decision about this, I'd be happy to pick up that code and merge it into master over the christmas holiday.
Owner

### mblondel commented Nov 6, 2012

 Oh, I remember this issue now. Yes, out of respect for all the work you put into this, we should really make a decision so that you can merge your code.
Owner

### larsmans commented Nov 6, 2012

 You shouldn't merge anything out of respect for the effort -- I've put a lot of effort into things that didn't work :)
Owner

### amueller commented Nov 6, 2012

 we shouldn't merge for that reason but decide what to do ;) see my comment there.

### pemistahl commented Nov 6, 2012

 @ogrisel: I was saying that the API doesn't need to be elegant for me personally in order to use it in my project. But, of course, I understand that this is necessary to merge the code into the master branch. @larsmans: Do you spontaneously know which parts of this file have to be put into the respective file in the master branch so that the class `SemisupervisedNB` can be used without any errors? I added the entire class `SemisupervisedNB` and the private methods `BaseDiscreteNB._fit1ofK()` and `BaseDiscreteNB._label_1ofK()` but now I get the following error message: ``````File "/[...]/sklearn/naive_bayes.py", line 560, in fit old_coef = clf.coef_.copy() File "/[...]/sklearn/naive_bayes.py", line 319, in _get_coef return self.feature_log_prob_[1] if len(self.classes_) == 2 \ AttributeError: 'MultinomialNB' object has no attribute 'classes_' `````` It is quite important for me to get this code running, so I would appreciate any help from you very much.
Owner

### larsmans commented Nov 6, 2012

 It would take quite some hackery to fully merge this code with current `master`. You might want to just use the file from that branch, I don't think the actual Naive Bayes algorithms changed a lot.

### pemistahl commented Nov 6, 2012

 @larsmans: First of all thank you for your quick answers. Very much appreciated. :-) I've already tried to replace the file `naive_bayes.py` in the master branch by the file in your emnb branch. When I call `SemisupervisedNB.fit()` I get the following error: ``````File "/[...]/sklearn/naive_bayes.py", line 573, in fit + norm(old_intercept - clf.intercept_, 1)) File "/[...]/scipy/linalg/misc.py", line 12, in norm a = np.asarray_chkfinite(a) File "/[...]/numpy/lib/function_base.py", line 590, in asarray_chkfinite "array must not contain infs or NaNs") ValueError: array must not contain infs or NaNs `````` I use the latest stable versions of SciPy and NumPy. Any idea what's wrong here?
Owner

### larsmans commented Nov 6, 2012

 It looks like there's an `inf` or `nan` in the coefficient vector (the log-probabilities of features). That shouldn't happen. Can you find out where the infinity occurs and add some debugging code to find out what values it takes on prior to going to infinity? It might also be worthwhile to find out what feature in the input is that... (There also seems to be a bug in line 573: the `+` should be a `-`, I think. If you're in a hurry, you could turn off the convergence checking and just run EM for a fixed number of iterations.)

### pemistahl commented Nov 6, 2012

 By the way, I also get the following warnings: ``````/[...]/sklearn/naive_bayes.py:258: RuntimeWarning: divide by zero encountered in log self.class_log_prior_ = np.log(y_freq) - np.log(y_freq.sum()) /[...]/sklearn/naive_bayes.py:573: RuntimeWarning: invalid value encountered in subtract + norm(old_intercept - clf.intercept_, 1)) `````` The same warnings and errors appear when I replace the `+` in line 573 by a `-`. How exactly do I turn off convergence checking? If I set the attribute `tol` to `None` I get another error: ``````File "/[...]/sklearn/naive_bayes.py", line 551, in fit tol = self.tol * n_features TypeError: unsupported operand type(s) for *: 'NoneType' and 'int' ``````
Owner

### larsmans commented Nov 6, 2012

 You can turn off convergence checking by just commenting out the whole `norm`-based logic around line 573. But I think something else is amiss here. What does `y` look like, in particular, what it `np.unique(y)` at the beginning of the EM loop and just before it crashes?

### pemistahl commented Nov 6, 2012

 `y` looks like this: ``````y = np.concatenate([ np.array(number_positive_docs * [1]), np.array(number_reliable_negative_docs * [0]), np.array(number_unlabeled_docs * [-1])]) `````` `np.unique(y)` gives `array([-1, 0, 1])`. As far as I can see from the code, `y` doesn't change during the iterations, i.e., in the `for` loop in line 559. Or do you mean `Y` instead?
Owner

### larsmans commented Nov 6, 2012

 Sorry, that's `Y`. Or more interestingly, `y_freq`. (Sorry, I'm trying to recollect what code I wrote a year ago did in between video lectures :)

### pemistahl commented Nov 6, 2012

 When I disable convergence checking and run the algorithm with 10 iterations, below is the output of `Y` and `y_freq`. If I enable convergence again, the program fails after the line `Naive Bayes EM, iteration 0, y_freq: [ 0. 8378.9333942 17822.0666058]` has been displayed. ``````y_freq: [ 0 7290 17613] /[...]/sklearn/naive_bayes.py:259: RuntimeWarning: divide by zero encountered in log self.class_log_prior_ = np.log(y_freq) - np.log(y_freq.sum()) Y: array([[0, 0, 1], [0, 0, 1], [0, 0, 1], ..., [1, 0, 0], [1, 0, 0], [1, 0, 0]]) Naive Bayes EM, iteration 0, y_freq: [ 0. 8378.9333942 17822.0666058] Y: array([[ 0.00000000e+00, 3.46519606e-03, 9.96534804e-01], [ 0.00000000e+00, 2.17572298e-10, 1.00000000e+00], [ 0.00000000e+00, 8.08888129e-11, 1.00000000e+00], ..., [ 0.00000000e+00, 1.00000000e+00, 1.40789732e-98], [ 0.00000000e+00, 1.00000000e+00, 1.95952026e-99], [ 0.00000000e+00, 1.00000000e+00, 1.09213922e-85]]) Naive Bayes EM, iteration 1, y_freq: [ 0. 8415.53031742 17785.46968258] Y: array([[ 0.00000000e+000, 3.78921026e-003, 9.96210790e-001], [ 0.00000000e+000, 2.13413175e-010, 1.00000000e+000], [ 0.00000000e+000, 7.66849136e-011, 1.00000000e+000], ..., [ 0.00000000e+000, 1.00000000e+000, 3.65956451e-100], [ 0.00000000e+000, 1.00000000e+000, 1.49041595e-100], [ 0.00000000e+000, 1.00000000e+000, 6.33163722e-111]]) Naive Bayes EM, iteration 2, y_freq: [ 0. 8423.40377029 17777.59622971] Y: array([[ 0.00000000e+000, 3.84123829e-003, 9.96158762e-001], [ 0.00000000e+000, 2.21290143e-010, 1.00000000e+000], [ 0.00000000e+000, 7.99408677e-011, 1.00000000e+000], ..., [ 0.00000000e+000, 1.00000000e+000, 7.61149802e-097], [ 0.00000000e+000, 1.00000000e+000, 2.81529459e-097], [ 0.00000000e+000, 1.00000000e+000, 6.98453916e-108]]) Naive Bayes EM, iteration 3, y_freq: [ 0. 8426.82680896 17774.17319104] Y: array([[ 0.00000000e+000, 3.84835828e-003, 9.96151642e-001], [ 0.00000000e+000, 2.22097326e-010, 1.00000000e+000], [ 0.00000000e+000, 7.98978877e-011, 1.00000000e+000], ..., [ 0.00000000e+000, 1.00000000e+000, 1.58293495e-098], [ 0.00000000e+000, 1.00000000e+000, 5.80088961e-099], [ 0.00000000e+000, 1.00000000e+000, 1.56020781e-109]]) Naive Bayes EM, iteration 4, y_freq: [ 0. 8428.20883001 17772.79116999] Y: array([[ 0.00000000e+000, 3.85059503e-003, 9.96149405e-001], [ 0.00000000e+000, 2.22228897e-010, 1.00000000e+000], [ 0.00000000e+000, 7.88360548e-011, 1.00000000e+000], ..., [ 0.00000000e+000, 1.00000000e+000, 5.93829598e-099], [ 0.00000000e+000, 1.00000000e+000, 2.18034206e-099], [ 0.00000000e+000, 1.00000000e+000, 5.60937508e-110]]) Naive Bayes EM, iteration 5, y_freq: [ 0. 8429.26678124 17771.73321876] Y: array([[ 0.00000000e+000, 3.85187799e-003, 9.96148122e-001], [ 0.00000000e+000, 2.22367683e-010, 1.00000000e+000], [ 0.00000000e+000, 7.73897691e-011, 1.00000000e+000], ..., [ 0.00000000e+000, 1.00000000e+000, 5.10176608e-099], [ 0.00000000e+000, 1.00000000e+000, 1.88370075e-099], [ 0.00000000e+000, 1.00000000e+000, 4.67052518e-110]]) Naive Bayes EM, iteration 6, y_freq: [ 0. 8429.59775586 17771.40224414] Y: array([[ 0.00000000e+000, 3.85248690e-003, 9.96147513e-001], [ 0.00000000e+000, 2.22386839e-010, 1.00000000e+000], [ 0.00000000e+000, 7.72765263e-011, 1.00000000e+000], ..., [ 0.00000000e+000, 1.00000000e+000, 4.66505361e-099], [ 0.00000000e+000, 1.00000000e+000, 1.72411031e-099], [ 0.00000000e+000, 1.00000000e+000, 4.20129643e-110]]) Naive Bayes EM, iteration 7, y_freq: [ 0. 8429.29960942 17771.70039058] Y: array([[ 0.00000000e+000, 3.85310960e-003, 9.96146890e-001], [ 0.00000000e+000, 2.22492909e-010, 1.00000000e+000], [ 0.00000000e+000, 7.73191700e-011, 1.00000000e+000], ..., [ 0.00000000e+000, 1.00000000e+000, 4.41773189e-099], [ 0.00000000e+000, 1.00000000e+000, 1.63114819e-099], [ 0.00000000e+000, 1.00000000e+000, 3.90354517e-110]]) Naive Bayes EM, iteration 8, y_freq: [ 0. 8429.52790536 17771.47209464] Y: array([[ 0.00000000e+000, 3.85394219e-003, 9.96146058e-001], [ 0.00000000e+000, 2.22721606e-010, 1.00000000e+000], [ 0.00000000e+000, 7.74184984e-011, 1.00000000e+000], ..., [ 0.00000000e+000, 1.00000000e+000, 4.22522102e-099], [ 0.00000000e+000, 1.00000000e+000, 1.55620262e-099], [ 0.00000000e+000, 1.00000000e+000, 3.64041923e-110]]) Naive Bayes EM, iteration 9, y_freq: [ 0. 8429.66724278 17771.33275722] ``````
Owner

### larsmans commented Nov 6, 2012

 Ah, now I know what's wrong. Just plugging in that file isn't enough -- you also need the accompanying changes to `LabelBinarizer` so that it knows how to handle the `-1` = unlabeled rule.

### pemistahl commented Nov 6, 2012

 What do you mean exactly? Do I have to replace the file `sklearn/preprocessing/__init__.py` (where the class `LabelBinarizer` is defined) in the master branch by the respective file in your emnb branch? Or do I have to additionally change anything in the code? Please be a bit more specific. Thank you. :) EDIT: I've just seen that in the current stable version 0.12 the folder structure has changed, compared to your emnb branch. In the master branch, we have the file `sklearn/preprocessing.py` instead. Do I just have to replace the class `LabelBinarizer` and put it into the respective location?
Owner

### larsmans commented Nov 6, 2012

 You'll need to copy-paste the definition of `LabelBinarizer` -- too much has changed in the rest of that file, as you already noticed. No guarantees that that will work out of the box, though. `SemisupervisedNB` hasn't been maintained, I haven't looked at it for a year and I can't really give proper support for it ATM. You could try just installing scikit-learn from that old branch if you don't need any of the changes made since then...

### pemistahl commented Nov 6, 2012

 I've replaced the class `LabelBinarizer` and run my program again with convergence checking enabled. Now the program finishes without any errors and converges after 8 iterations. Looks good so far. :) Can you please take a look at the following output and tell me whether this is what you expected? ``````y_freq: [ 7290. 17613.] Y: array([[ 0. , 1. ], [ 0. , 1. ], [ 0. , 1. ], ..., [ 0.5, 0.5], [ 0.5, 0.5], [ 0.5, 0.5]]) Naive Bayes EM, iteration 0, y_freq: [ 8378.9333942 17822.0666058] diff = 6.97 Y: array([[ 3.46519606e-03, 9.96534804e-01], [ 2.17572298e-10, 1.00000000e+00], [ 8.08888129e-11, 1.00000000e+00], ..., [ 1.00000000e+00, 1.40789732e-98], [ 1.00000000e+00, 1.95952026e-99], [ 1.00000000e+00, 1.09213922e-85]]) Naive Bayes EM, iteration 1, y_freq: [ 8415.53031742 17785.46968258] diff = 5.95 Y: array([[ 3.78921026e-003, 9.96210790e-001], [ 2.13413175e-010, 1.00000000e+000], [ 7.66849136e-011, 1.00000000e+000], ..., [ 1.00000000e+000, 3.65956451e-100], [ 1.00000000e+000, 1.49041595e-100], [ 1.00000000e+000, 6.33163722e-111]]) Naive Bayes EM, iteration 2, y_freq: [ 8423.40377029 17777.59622971] diff = 4.79 Y: array([[ 3.84123829e-003, 9.96158762e-001], [ 2.21290143e-010, 1.00000000e+000], [ 7.99408677e-011, 1.00000000e+000], ..., [ 1.00000000e+000, 7.61149802e-097], [ 1.00000000e+000, 2.81529459e-097], [ 1.00000000e+000, 6.98453916e-108]]) Naive Bayes EM, iteration 3, y_freq: [ 8426.82680896 17774.17319104] diff = 1.07 Y: array([[ 3.84835828e-003, 9.96151642e-001], [ 2.22097326e-010, 1.00000000e+000], [ 7.98978877e-011, 1.00000000e+000], ..., [ 1.00000000e+000, 1.58293495e-098], [ 1.00000000e+000, 5.80088961e-099], [ 1.00000000e+000, 1.56020781e-109]]) Naive Bayes EM, iteration 4, y_freq: [ 8428.20883001 17772.79116999] diff = 0.876 Y: array([[ 3.85059503e-003, 9.96149405e-001], [ 2.22228897e-010, 1.00000000e+000], [ 7.88360548e-011, 1.00000000e+000], ..., [ 1.00000000e+000, 5.93829598e-099], [ 1.00000000e+000, 2.18034206e-099], [ 1.00000000e+000, 5.60937508e-110]]) Naive Bayes EM, iteration 5, y_freq: [ 8429.26678124 17771.73321876] diff = 1.29 Y: array([[ 3.85187799e-003, 9.96148122e-001], [ 2.22367683e-010, 1.00000000e+000], [ 7.73897691e-011, 1.00000000e+000], ..., [ 1.00000000e+000, 5.10176608e-099], [ 1.00000000e+000, 1.88370075e-099], [ 1.00000000e+000, 4.67052518e-110]]) Naive Bayes EM, iteration 6, y_freq: [ 8429.59775586 17771.40224414] diff = 1.57 Y: array([[ 3.85248690e-003, 9.96147513e-001], [ 2.22386839e-010, 1.00000000e+000], [ 7.72765263e-011, 1.00000000e+000], ..., [ 1.00000000e+000, 4.66505361e-099], [ 1.00000000e+000, 1.72411031e-099], [ 1.00000000e+000, 4.20129643e-110]]) Naive Bayes EM, iteration 7, y_freq: [ 8429.29960942 17771.70039058] diff = 1.41 Y: array([[ 3.85310960e-003, 9.96146890e-001], [ 2.22492909e-010, 1.00000000e+000], [ 7.73191700e-011, 1.00000000e+000], ..., [ 1.00000000e+000, 4.41773189e-099], [ 1.00000000e+000, 1.63114819e-099], [ 1.00000000e+000, 3.90354517e-110]]) Naive Bayes EM, iteration 8, y_freq: [ 8429.52790536 17771.47209464] diff = 0.571 Naive Bayes EM converged ``````
Owner

### larsmans commented Nov 6, 2012

 Looks absolute fine to me! Of course, the proof of the pudding is in the eating, so be sure to evaluate this :) (The easy way of doing an evaluation is to take a labeled set and randomly throw away part of the labels -- there's an example script in the PR that does that.)

### pemistahl commented Nov 6, 2012

 Sure, I will evaluate this and let you know. Then this information might help you to merge the code into master some day. But I think, now I'm in a position where I can finally continue. I wouldn't have accomplished that without your help. Thank you very much! :) Scikit-learn has a great community.
Owner

### larsmans commented Nov 6, 2012

 Great! I'm really interested in what comes out of this.

Open