ENH add Naive Bayes Metaestimator `ColumnwiseNB` (aka "GeneralNB") #22574

avm19 · 2022-02-22T08:39:50Z

This PR is ready for review.
Dear reviewer, please don't be discouraged by the amount of text below! The enumerated comments are optional and simply document certain considerations that I had while implementing this PR.

Reference Issues/PRs
Fixes #12957 #15077 (issues)
Supersedes #16281 (stalled PR)
Somewhat related to #10856
Incorporates changes from PR #22565 , where I add abstract methods in _BaseDiscreteNB

What does this implement/fix? Explain your changes.

This implements a meta-estimator that mixes existing naive Bayes classifiers: GaussianNB, BernoulliNB, and others. The interface is inspired by ColumnTransformer, as was previously suggested by @timbicker , @jnothman and others:

>>> import numpy as np
>>> rng = np.random.RandomState(1)
>>> X = rng.randint(5, size=(6, 100))
>>> y = np.array([0, 0, 1, 1, 2, 2])
>>> from sklearn.naive_bayes import MultinomialNB, GaussianNB, ColumnwiseNB
>>> clf = ColumnwiseNB(nb_estimators=[('mnb1', MultinomialNB(), [0, 1]),
...                                   ('mnb2', MultinomialNB(), [3, 4]),
...                                   ('gnb1', GaussianNB(), [5])])
>>> clf.fit(X, y)
ColumnwiseNB(nb_estimators=[('mnb1', MultinomialNB(), [0, 1]),
                         ('mnb2', MultinomialNB(), [3, 4]),
                         ('gnb1', GaussianNB(), [5])])
>>> print(clf.predict(X))
[0 0 1 0 2 2]

The meta-estimator combines multiple naive Bayes estimators by expressing the overall joint probability P(x,y) through P(x_i,y), the joint probabilities of the subestimators:

Log P(x,y) = Log P(x_1,y) + ... + Log P(x_N,y) - (N - 1) Log P(y),

where N is the number of estimators (3 in the example). The joint probabilities P(x_i,y) are exposed through the private method _joint_log_likelihood(self, X) in any instance of _BaseNB class, which is a parent to all naive Bayes classifiers implemented so far, including the proposed meta-estimator.

See further equations and detailed explanation inside the collapsable section in this comment.

Any other comments?

About the name. This meta-estimator was referred to as GeneralNB in the past, cf. stalled PR [WIP] Implement general naive Bayes #16281. I think terms like "general" or "generalized" are too ambiguous. For example, Bernoulli and Gaussian distributions are both members of the general exponential family of distributions -- is this the generalisation we refer to? Another generalisation is having different distribution families for each class, e.g., P(x|y=0)~N and P(x|y=1)~Mult, or working with any user-provided distribution. In addition, I don't recall seeing "the general naive Bayes" in the literature: all models were just naive Bayes with further assumptions of various degree of generality.
In contrast, ColumnwiseNB is specific: it tells the user that something is going on with columns and forms a very appropriate association with ColumnTranformer. I think this is a much better name, and maybe there are even better options. I'll be fine with any decision, but I think ColumnwiseNB would be a good choice.
Docstring and tests are included.
Use case, examples and guide. I don't know any good use case or have any good example at hand. I'd be grateful for a suggestion ~~or if someone wrote an example using my implementation, while testing it at the same time. I guess, a paragraph or two should be added to the Guide as well. Shall we leave all this for another PR?~~
[Updated on 2022-04-10] A section to naive Bayes guide is added. An auto-example (titanic dataset) is created for the gallery.
Parallel jobs. I implemented fit and partial_fit following _BaseDiscreteNB and took ColumnTransformer's implementation as a blueprint for parallelism. I am not very familiar with the inner workings of joblib, and I ask reviewers to pay extra attention here.
[Related comment by jnothman: "... I suspect that fitting also will not benefit much from parallelism, but this may be more of an empirical question."]
MetaEstimator. _BaseComposition is a parent class and get/set methods for parameters have been implemented and tested. Subestimators are cloned prior to fitting.
Callable columns are supported.
Pandas DataFrame and string indexing of columns is supported.
There is no remainder parameter (un)like in ColumnTransformer. It complicates the logic a little bit. Do you think it'd be really nice to have it? Maybe leave it till the next PR then.
Repeated columns are allowed and have the same effect as if they were duplicated.
GaussianNB and _BaseDiscreteNB use different conventions. For example, priors vs class_prior as hyperparameters and class_log_prior_ vs class_prior_ as attributes. I do not know which convention is more preferable. Just a remark.
The meta-estimator duplicates some functionality of its subestimators, such as checking feature names, sample weights, classes, supplied prior, counting and calculating priors as relative frequencies. I haven't found a way to avoid this.
(I forgot it)
Theoretical issue. It is implicitly assumed in the equation Log P(x,y) = Log P(x_1,y) + ... + Log P(x_N,y) - (N - 1) Log P(y), that the class log priors are finite and agree between the estimators and the subestimator: - inf < Log P(y) = Log P(y|1) = ... = Log P(y|N). The meta-estimators does not check this condition. Meaningless results, including NaN, may be produced by the meta-estimator if the class priors differ or contain a zero probability. Usually this is not a problem, because all children of _BaseNB calculate class priors directly as relative frequencies; unseen classes from partial_fit are on the user, not on us. But in general, class priors can be estimated differently or provided by the user in subestimators and the meta-estimator independently. We can distinguish 2 issues. First, what if the subestimators' class priors do not agree with the meta-estimator's prior? Then we could "marry" the subestimators not "at joint probabilities", but "at conditional probabilities", just as naive Bayes prescribes: Log P(x|y) = Log P(x_1|y) + ... + Log P(x_N|y) or equivalently Log P(x,y) = Log P(x_1,y) - Log P(y|1) + ... + Log P(x_N,y) - Log P(y|N) + Log P(y), where Log P(y|i) is the class prior from the ith subestimator. Second, what if some prior probabilities are zero? A problem arises only when the the zero found in the meta-estimator's class prior. This is something to think about. I wonder if someone came across a paper or a textbook that discusses and resolves this issue. [Update: see most recent comments 1 and 2]

Finally, I would appreciate any comments and suggestions, especially those addressing mistakes or aimed at efficient programming and "good code". I am glad to be learning from this community.

Updated on Wednesday, February 23, 2022 05:57:39 UTC:

estimatorNBs vs estimators. Initially, the parameter's name was "estimators", by analogy with "transformers". It turns out that many common tests failed simply because they duck-type models based on the presence of estimators in the parameter list. As a result, the tests assume that they're dealing with something like VotingClassifier and feed it LogisticRegression--something that ColumnwiseNB is not suppose to accept. ~~I took the path of the least resistance and renamed estimators to estimatorNBs.~~ [Further update. Following jnothman's comment, the parameter was later renamed to nb_estimators. glemaitre thinks the name should be estimators. My comment summarises my thoughts and links to the duck-typing issue.]
~~The only change I made to pre-existing test files is adding ColumnwiseNB to VALIDATE_ESTIMATOR_INIT, a list that exempts ColumnTransformer and other estimators from a certain test.~~ [Updated on 2022-05-29] The problematic test is passed by catching the error when _estimators setter in ColumnwiseNB fails to unpack parameters passed by the test. See FIX Remove validation from __init__ and set_params for ColumnTransformer #22537 for the similar change in ColumnTransformer.

Updated on Wednesday, February 23, 2022 09:14:13 UTC:

pytest, joblib, and output. I don't understand how pytest interacts with joblib, and why I can't capture any stdout produced within a parallel loop when running pytest. As a result, I cannot pytest such an output. Could someone please explain this unexpected behaviour or suggest a solution?

Updated on Sunday, February 27, 2022 09:00:13 UTC:

n_features_in_. ~~Advice needed. First, I do not fully understand the purpose of this attribute and whether it is needed given we use feature_names_in_~~. ~~Second, setting it up using BaseEstimator._check_n_features can be problematic, since this method expects a "converted array", and I currently avoid converting X in the meta-estimator~~.
[Updated on 2022-10-04]: Although BaseEstimator._check_n_features is typically passed converted arrays (e.g., from BaseEstimator._validate_data or ColumnTransformer.fit_transform), it seems to work fine with pandas dataframes too. All examples work and all tests are passed.

Updated on Thursday, April 7, 2022 17:14:08 UTC:

There could be an advantage in taking VotingClassifier as a prototype instead of ColumnTransformer. The difference is that we would be passing a list of tuples (str, estimator) instead of (str, estimator, columns). In the former case, estimator would have to be a base naive Bayes estimator wrapped together with a column selector, e.g. a GaussianNB + a selector of float columns. In fact, we could've redefined all _BaseNB to add this new parameter for column selection, whose default value selects all columns of X. The advantage is that now the subestimators' columns subsets are treated as separate hyperparameters for the purpose of GridSearch. We could write
param_grid = {"clf__gnb1__columns": [[5], [5, 6]]}
instead of much more verbose
```
param_grid = {"clf__estimators": [
[('mnb1', MultinomialNB(), [0, 1]),
 ('mnb2', MultinomialNB(), [3, 4]),
 ('gnb1', GaussianNB(), [5])],
[('mnb1', MultinomialNB(), [0, 1]),
 ('mnb2', MultinomialNB(), [3, 4]),
 ('gnb1', GaussianNB(), [5, 6])]}
]
```

Updated on Monday, April 11, 2022 05:30:23 UTC:

HTML representation ColumnwiseNB displays subestimators in parallel, just like ColumnTransformer.

Updated on Thursday, July 7, 2022 23:24:45 UTC:

_validate_params aka common parameter validation. Necessary changes towards Make all estimators use _validate_params #23462 are made. A custom replacement for test_common.py::test_check_param_validation is implemented in test_naive_bayes.py::test_cwnb_check_param_validation. It is needed because utils.estimator_checks._construct_instance does not know how to create an instance of ColumnwiseNB, which leads to test_common.py skipping this test for ColumnwiseNB (without a notification by pytest; a message is displayed only when the test is called directly). ColumnTransformer, Pipeline, and a handful of other estimators suffer from this problem too.

Updated on Friday, October 7, 2022 16:26:00 UTC:

Why ColumnTransformer is not enough? This question was asked at a Discord voice meeting. First, ColumnTransformer is a transformer, not a classifier. Second, ColumnTransformer simply concatenates the transformed columns, whereas the naive Bayes requires additional calculations. Importantly, these additional calculations are done not on the subestimators' standard output (predict, predict_proba, predict_log_proba), but on the intermediate quantities (predict_joint_log_proba, formerly known as _joint_log_likelihood), see this discussion. This is why ColumnTransformer cannot be used even with additional logic wrapped on top of it.

…ents

…o the tests

avm19 · 2022-02-27T09:45:24Z

This PR needs attention.

avm19 · 2022-03-01T17:27:18Z

@jnothman Would you be able to review this PR or advise on how to proceed to get it merged in foreseeable future? I am asking you, because you thumbs-upped. Thanks!

avm19 · 2023-08-22T17:57:14Z

@glemaitre :

Doing so, I am not convinced about the class prior in terms of the theory here. Since a user can pass the different estimators, it can choose different priors. Thus, the term (M - 1) log P(y) is potentially wrong. Instead, we should subtract each class_prior or class_log_prior of the underlying naive Bayes estimator and add the prior estimate by the meta-estimator (at least if I get what the equation means). @avm19 does this remark make sense?

@steppi has just responded to this, and I agree. The remark is appropriate, but I am afraid the suggestion would address a non-existent use case. This issue was discussed in item 13., previously by us, and later by @steppi . So far, there is no example (no matter how contrived) for using sub-estimators with different priors.

So we should detect if someone tries to pass different priors and raise.

I don't know, should we? If yes, then we check that sub-estimators' priors all exactly equal or all None (not passed). If the user defines their own sub-estimator, then there are two options. Either we insist that they implement passing priors as UserNB(priors=...), in which case we check [e.priors for e in estimators]. Or we allow the user not to implement passing priors, in which case we check [e.get('priors', None) for e in estimators]. The first option requires the user to write two more lines of code if they implement their own sub-estimator. The second option leaves a possibility that the user-defined sub-estimator accepts priors under a different name. You probably had the first option in mind, right?

So the workflow is as follows. If the user is okay with empirical priors, they do not pass anything to any sub-estimator. Otherwise, the user must compute priors before the initialisation and passes them to every sub-estimator and the meta-estimator.

    priors = compute_priors(y)
    clf1 = ColumnwiseNB(
        estimators=[
            ("g1", GaussianNB(priors=priors), [0]),
            ("u1", UserNB(priors=priors), [1]),
        ],
        priors=priors,
    )

I wonder if this can be improved.

glemaitre · 2023-08-22T18:08:14Z

Either we insist that they implement passing prior

Indeed, I would force users to have priors as a parameter and have a consistent API. I open #27135 to resolve the issue in scikit-learn.

Regarding the snippet, I would forbid passing priors to the underlying estimator. Internally, it means that when validating the estimators (converting from predictor to estimator), we also set_params(priors=self.priors). Detecting that priors are not default will be relatively easy and we can raise an informative error indicating to only set the meta-estimator priors.

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

…umnwiseNB

…learn#27005

avm19 · 2023-09-13T18:04:50Z

@glemaitre

This is an unfinished implementation but gives a good perspective

it does not handle partial_fit (not implemented by ColumnTransformer)
...

The inability to handle partial_fit alone sounds to me like a good reason to leave this idea for now. I share your appreciation of clever ways of doing things, but I think in this case, the straightforward solution is simpler and easier to maintain.

The bits we are trying to re-use are _validate_column_callables(), _log_message, _iter(), and the Parallel()( ... ) statement. Everything else stays, plus _PredictorToTransformer and any complexity that may arise from this trick. Note that some things are simpler in ColumnwiseNB than in ColumnTransformer, e.g. _iter(). In my opinion, 10 or 20 simple lines of code that we can potentially save will not offset the additional complexity and room for unexpected behaviour.

I suggest to merge this PR as it is now. It is a perfectly working contribution that does the job a user expects of it. Any refactoring and re-use of ColumnTransformer won't affect the public API and can be done in subsequent PRs. And metadata routing also to follow later. What do you think, @glemaitre ?

glemaitre · 2023-12-07T15:36:21Z

I suggest to merge this PR as it is now. It is a perfectly working contribution that does the job a user expects of it. Any refactoring and re-use of ColumnTransformer won't affect the public API and can be done in subsequent PRs.

It is not affecting the public API but it will affect the maintenance. I personally uneasy to merge something in which we already foresee internal refactoring.

avm19 · 2023-12-07T16:02:07Z

I suggest to merge this PR as it is now. It is a perfectly working contribution that does the job a user expects of it. Any refactoring and re-use of ColumnTransformer won't affect the public API and can be done in subsequent PRs.

It is not affecting the public API but it will affect the maintenance. I personally uneasy to merge something in which we already foresee internal refactoring.

I don't know if a refactoring will be necessary or likely. I understand that you prefer your proposal of using ColumnTransformer inside ColumnwiseNB by implementing _PredictorToTransformer , but I gave my reasons in the previous message why I believe that keeping two similar pieces of code in parallel would be better and easier to maintain.

adrinjalali · 2024-01-04T10:38:28Z

I think I agree with @glemaitre here that having column transformer like code in this estimator is much harder to maintain.

avm19 · 2024-01-04T20:41:05Z

I think I agree with @glemaitre here that having column transformer like code in this estimator is much harder to maintain.

Do you think that maintaining a _PredictorToTransformer would be easier? Is there a place in sklearn code where such a trick had been done before, so that I could see how much maintenance effort went there?

To this calculation one should add tricks and changes necessary for partial_fit.

avm19 added 5 commits February 21, 2022 04:31

Add abstract methods to _BaseDiscreteNB and minor corrections in comm…

75ce0de

…ents

Merge remote-tracking branch 'upstream/main' into naive-bayes-abstract

6f80025

Implemented ColumnwiseNB and tests

62ebbe0

Added my name to module authors.

4ca9ac5

ColumnwiseNB docstring correction, See Also, Example. Added example t…

d7a9bf4

…o the tests

github-actions bot added the module:naive_bayes label Feb 22, 2022

avm19 added 2 commits February 22, 2022 04:10

black formatting compliance.

2dfcd40

black formatting compliance.

861a573

avm19 marked this pull request as draft February 22, 2022 09:41

avm19 added 7 commits February 22, 2022 15:54

ColumnwiseNB: added _required_parameters = [estimators]

9b73275

Dirty trick with ColumnwiseNB._required_parameters to pass tests

fa66575

ColumnwiseNB docstring: added extended summary

12677e7

ColumnwiseNB test issue: added to VALIDATE_ESTIMATOR_INIT exclusion list

e748cf1

ColumnwiseNB: rename 'estimators' into 'estimatorNBs'

8b7cfc8

Rename 'estimators' into 'estimatorNBs' in test_naive_bayes.py

7cb8430

ColumnwiseNB: rename 'estimators' in the example too

2abac32

avm19 force-pushed the ColumnwiseNB branch from d98df1b to 2abac32 Compare February 23, 2022 04:28

avm19 added 3 commits February 23, 2022 00:10

Added pytest skip when no pandas to test_naive_bayes.py

c9a5e1d

ColumnwiseNB: update class prior AFTER update fitted estimators

79f980f

test_naive_bayes.py Added more tests to improve coverage

c2cd353

avm19 marked this pull request as ready for review February 23, 2022 09:18

avm19 added 5 commits February 26, 2022 20:06

Merge branch 'main' into ColumnwiseNB

7ac5986

Merge remote-tracking branch 'upstream/main' into ColumnwiseNB

a4939f2

Add DOC entry and corrections to DOCSTRING

de1123b

ColumnwiseNB: DOCSTRING correction

e515b6a

Merge remote-tracking branch 'upstream/main' into ColumnwiseNB

b120499

Merge remote-tracking branch 'upstream/main' into ColumnwiseNB

9dbbb4d

Merge remote-tracking branch 'upstream/main' into ColumnwiseNB

8c86193

avm19 closed this Aug 22, 2023

avm19 reopened this Aug 22, 2023

avm19 and others added 17 commits September 11, 2023 18:11

Merge remote-tracking branch 'upstream/main' into ColumnwiseNB

f62c3c3

Move the changelog entry from 1.3 to 1.4

e2d48d8

Move ColumnwiseNB section before Out of Core section

45e4381

Change versionadded from 1.3 to 1.4

95d00cd

Update sklearn/naive_bayes.py

a1ed149

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/naive_bayes.py

6c09328

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/naive_bayes.py

b0ccfb0

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Merge branch 'ColumnwiseNB' of github.com:avm19/scikit-learn into Col…

52c2265

…umnwiseNB

Formatting

53fb1a9

Fix test re _parameter_constraints = {'nb_estimators': [list], ...}

ddf5f55

Change nb_estimator to naive_bayes_estimator in all docstring

78c70eb

Rename 'nb_estimators' into 'estimators'

95c9c69

Merge remote-tracking branch 'upstream/main' into ColumnwiseNB

21e4fb8

Fix rename 'nb_estimators' into 'estimators'

3c3315c

Fix _fit_context decorator.

ca78f52

Merge remote-tracking branch 'upstream/main' into ColumnwiseNB

585d778

Change _iter signature to mirror ColumnTransformer changes in scikit-…

7d0ad34

…learn#27005

Merge remote-tracking branch 'upstream/main' into ColumnwiseNB

919e2ca

glemaitre modified the milestones: 1.4, 1.5 Dec 7, 2023

glemaitre mentioned this pull request Jan 15, 2024

Expanded ColumnTransformer functionality -- transforming subsets of data #28130

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH add Naive Bayes Metaestimator `ColumnwiseNB` (aka "GeneralNB") #22574

ENH add Naive Bayes Metaestimator `ColumnwiseNB` (aka "GeneralNB") #22574

avm19 commented Feb 22, 2022 •

edited

avm19 commented Feb 27, 2022

avm19 commented Mar 1, 2022

avm19 commented Aug 22, 2023 •

edited

glemaitre commented Aug 22, 2023

avm19 commented Sep 13, 2023

glemaitre commented Dec 7, 2023

avm19 commented Dec 7, 2023

adrinjalali commented Jan 4, 2024

avm19 commented Jan 4, 2024

ENH add Naive Bayes Metaestimator ColumnwiseNB (aka "GeneralNB") #22574

Are you sure you want to change the base?

ENH add Naive Bayes Metaestimator ColumnwiseNB (aka "GeneralNB") #22574

Conversation

avm19 commented Feb 22, 2022 • edited

What does this implement/fix? Explain your changes.

Any other comments?

Updated on Wednesday, February 23, 2022 05:57:39 UTC:

Updated on Wednesday, February 23, 2022 09:14:13 UTC:

Updated on Sunday, February 27, 2022 09:00:13 UTC:

Updated on Thursday, April 7, 2022 17:14:08 UTC:

Updated on Monday, April 11, 2022 05:30:23 UTC:

Updated on Thursday, July 7, 2022 23:24:45 UTC:

Updated on Friday, October 7, 2022 16:26:00 UTC:

avm19 commented Feb 27, 2022

avm19 commented Mar 1, 2022

avm19 commented Aug 22, 2023 • edited

glemaitre commented Aug 22, 2023

avm19 commented Sep 13, 2023

glemaitre commented Dec 7, 2023

avm19 commented Dec 7, 2023

adrinjalali commented Jan 4, 2024

avm19 commented Jan 4, 2024

ENH add Naive Bayes Metaestimator `ColumnwiseNB` (aka "GeneralNB") #22574

ENH add Naive Bayes Metaestimator `ColumnwiseNB` (aka "GeneralNB") #22574

avm19 commented Feb 22, 2022 •

edited

avm19 commented Aug 22, 2023 •

edited