Gaussian Naive Bayes (GaussianNB) #51

VasanthakumarV · 2020-10-06T19:14:50Z

bytesnake · 2020-10-08T20:06:58Z

will look into this in the weekend :)

bytesnake

For me it makes more sense to structure it into ModelParameter -> FittedModel, than ProbablyFittedModel -> FittedModel. Added a comment how the API would look like, what do you think?

linfa-bayes/README.md

linfa-bayes/examples/winequality.rs

linfa-bayes/src/error.rs

linfa-bayes/src/gaussian_nb.rs

bytesnake · 2020-11-02T13:04:15Z

Could you take a look at https://github.com/rust-ml/linfa/blob/fdf241127fe9058ae98c67a603a2bbb8eccc561c/CONTRIBUTE.md, I'm currently writing a small how-to-contribute document for the upcoming interface changes. If you find something unclear, please let me know so I can update the document.

bytesnake

I updated the review with regards to merging #55
If you have any questions how to implement the new traits, please let me know!

linfa-bayes/Cargo.toml

linfa-bayes/examples/winequality.rs

linfa-bayes/src/error.rs

linfa-bayes/src/gaussian_nb.rs

linfa-bayes/src/lib.rs

VasanthakumarV · 2020-11-08T05:59:11Z

@bytesnake Thank you for the detailed comments, will try to make the changes by next week.

VasanthakumarV · 2020-12-08T13:52:39Z

@bytesnake For incrementally learning the algorithm needs the definite set of classes supplied by the user, currently I supply them as an argument in the fit_with method, this prevents me from implementing the linfa::traits::IncrementalFit trait.

bytesnake · 2020-12-08T16:37:06Z

I think you can use the fn labels(&self) -> Vec<T::Elem> returning a vector of the unique classes. For this you have to restrict your targets like this:

impl<F: Float, R: Records<Elem = F>, T: Labels<Elem = usize>> IncrementalFit<R, T> for GaussianNbParams<F> {
}

in the future the trait could also be extended to accept any countable type:

impl<F: Float, L: Label, R: Records<Elem = F>, T: Label<Elem = L>> IncrementalFit<R, T> for GaussianNbParams<F, L> {
}

bytesnake · 2020-12-08T16:55:19Z

oh and you could estimate the priors from the samples as well, there is the Dataset::frequencies_with_mask function. It was introduced for decision trees, where you have to mask the data at each split point. You could either pass a true slice or introduce a function just called frequencies. This also takes the sample weights in consideration.

VasanthakumarV · 2020-12-09T11:12:05Z

@bytesnake I have started using the Labels trait for the dataset targets, and now I am using labels method for accessing unique entries.

But the issue with incremental learning in naive-bayes is that we need to know the complete list of labels while training on batches of data, the first batch of data that is fed to the model will almost never have encountered all the labels, that is the reason why I have an extra argument in the function signature.

Also I see Iter data structure, do you think we will be able to create chunks of Datasets using this?, this will be very useful for the incremental learning api.

VasanthakumarV · 2020-12-10T14:31:46Z

oh and you could estimate the priors from the samples as well, there is the Dataset::frequencies_with_mask function. It was introduced for decision trees, where you have to mask the data at each split point. You could either pass a true slice or introduce a function just called frequencies. This also takes the sample weights in consideration.

@bytesnake Currently I calculate priors along with mean and variance for each class, which requires masked indexing for each class.
So more than Dataset::frequencies_with_mask, a convenience function for masked indexing or regular indexing will be useful, my implementation is a little ugly, but can work on a more generic one if you think it will be useful.

bytesnake · 2020-12-10T16:56:51Z

@VasanthakumarV I think that an incremental model should collect its label incrementally as well. Would it be possible to form the union of previous labels with new encounter ones and update the model correspondingly. Otherwise you could also pass the labels as a hyper-parameter and return an error when a different label is encountered.

bytesnake · 2020-12-10T17:08:02Z

So more than Dataset::frequencies_with_mask, a convenience function for masked indexing or regular indexing will be useful, my implementation is a little ugly, but can work on a more generic one if you think it will be useful.

the API around dataset is very sparse right now, any update is welcome and a function which can select samples based on the class should definitely be in there somewhere 👍 It would be even more useful if you could somehow produce a view on the sub-selected samples, but I don't think that is possible right now.
I want to start a PR, renaming Dataset -> DatasetBase and implementing Dataset with Records = Array2<F> and Targets = Array1<L> to simplify the implementation details.

codecov-io · 2020-12-15T18:03:52Z

Codecov Report

Merging #51 (2f324f9) into master (1c19f3f) will increase coverage by 0.25%.
The diff coverage is 54.88%.

@@            Coverage Diff             @@
##           master      #51      +/-   ##
==========================================
+ Coverage   49.07%   49.33%   +0.25%     
==========================================
  Files          55       57       +2     
  Lines        3629     3762     +133     
==========================================
+ Hits         1781     1856      +75     
- Misses       1848     1906      +58

Impacted Files	Coverage Δ
linfa-bayes/src/error.rs	`0.00% <0.00%> (ø)`
linfa-bayes/src/gaussian_nb.rs	`57.03% <57.03%> (ø)`
linfa-linear/src/glm.rs	`49.54% <0.00%> (ø)`
src/dataset/impl_targets.rs	`33.33% <0.00%> (ø)`
linfa-hierarchical/src/lib.rs	`44.92% <0.00%> (ø)`
linfa-svm/src/classification.rs	`62.32% <0.00%> (ø)`
linfa-trees/src/decision_trees/algorithm.rs	`52.12% <0.00%> (+0.38%)`	⬆️
src/metrics_classification.rs	`50.00% <0.00%> (+0.52%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c19f3f...2f324f9. Read the comment docs.

VasanthakumarV · 2020-12-16T15:40:29Z

@VasanthakumarV I think that an incremental model should collect its label incrementally as well. Would it be possible to form the union of previous labels with new encounter ones and update the model correspondingly. Otherwise you could also pass the labels as a hyper-parameter and return an error when a different label is encountered.

@bytesnake I went with the first approach from a usability standpoint.
To achieve this I replaced ndarray matrix with a hashtable, this greatly reduced the complexity of the training step, introduced a little bit of complexity to the prediction step.

To implement the incremental learning API, I had to change IncrementalFit API - link. The big problem right now is IncrementalFit looks different from Fit trait. I wasn't able to make Option and Result work together in the old incremental trait, I am still thinking about ways to achieve what I want with a trait that fits nicely with everything else.
Doc Example - exhibits what it takes to train a model incrementally, I hope this matches the expectation we had for this.

bytesnake

thanks for all the work, will merge after adressing these small nitpicks 👍

linfa-bayes/src/gaussian_nb.rs

bytesnake · 2020-12-17T16:35:51Z

ready to merge?

VasanthakumarV · 2020-12-17T16:42:26Z

ready to merge?

I really don't like the changes I made to IncrementalFIt trait, especially hardcoding Result on the output. I might take another day to make it better.

VasanthakumarV · 2020-12-17T17:50:35Z

@bytesnake What do you think about the following trait for IncrementalFit?, two associated types to capture the asymmetry.

pub trait IncrementalFit<'a, R: Records, T: Targets> {
    type Input: 'a;
    type Output: 'a;

    fn fit_with(&self, model: Self::Input, dataset: &'a Dataset<R, T>) -> Self::Output;
}

// Implementation
impl<A, L> IncrementalFit<'_, ArrayView2<'_, A>, L> for GaussianNbParams
where
    A: Float,
    L: Labels<Elem = usize>,
{
    type Input = Option<GaussianNb<A>>;
    type Output = Result<Option<GaussianNb<A>>>;

    fn fit_with(&self, model_in: Self::Input, dataset: &Dataset<ArrayView2<A>, L>) -> Self::Output {
        ...
    }
    ...
}

The problem here is the flexibility, no rule is enforced, but the user has the freedom to either return an Option or an Option wrapped in a result should there be a need.

bytesnake · 2020-12-18T10:09:16Z

didn't see this yesterday, this definitely gives you more flexibility. I had another thing in my mind, which I pushed here VasanthakumarV#1
Shouldn't be the output type Output = Option<Result<GaussianNb<A>>>;?

bytesnake · 2020-12-31T13:26:56Z

I'm back from christmas and after the break I think your idea is good, especially in combination with try_fold. We should revise the decision for Self::Output once we have Result<T, anyhow::Error>

VasanthakumarV · 2020-12-31T14:28:55Z

@bytesnake do you want to rename the associated type's names from Input, Output to ObjectIn, ObjectOut, or something else?

bytesnake · 2020-12-31T14:50:00Z

@VasanthakumarV no Input and Output is fine and transparent, otherwise this could be InputObj, ObjectIn or other combinations 😅

VasanthakumarV · 2020-12-31T15:04:49Z

@VasanthakumarV no Input and Output is fine and transparent, otherwise this could be InputObj, ObjectIn or other combinations 😅

I made them ObjectIn and ObjectOut for consistency with Fit trait. Let me know if you want any changes made.

bytesnake reviewed Oct 11, 2020

View reviewed changes

VasanthakumarV force-pushed the master branch from 77344e2 to 51ad204 Compare October 12, 2020 07:28

bytesnake reviewed Nov 7, 2020

View reviewed changes

VasanthakumarV marked this pull request as draft November 9, 2020 06:16

VasanthakumarV force-pushed the master branch 3 times, most recently from 0f4bfbc to 9ed89ce Compare December 8, 2020 13:38

VasanthakumarV force-pushed the master branch from d09514d to 19b0747 Compare December 9, 2020 10:41

VasanthakumarV force-pushed the master branch 2 times, most recently from e0f2526 to 1b7fbf6 Compare December 15, 2020 17:58

VasanthakumarV force-pushed the master branch from 1b7fbf6 to 5c04b81 Compare December 16, 2020 15:27

bytesnake reviewed Dec 16, 2020

View reviewed changes

linfa-bayes/src/gaussian_nb.rs Outdated Show resolved Hide resolved

linfa-bayes/src/gaussian_nb.rs Outdated Show resolved Hide resolved

linfa-bayes/src/gaussian_nb.rs Outdated Show resolved Hide resolved

linfa-bayes/src/gaussian_nb.rs Show resolved Hide resolved

VasanthakumarV added 6 commits December 16, 2020 23:09

Add fit method for gaussian naive bayes

1df511a

Add predict method for gaussian naive bayes

b20ce86

Fix partial_fit method for gaussian naive bayes

170dd72

Add error variants for gaussian naive bayes

be2a161

Make partial_fit method take in view matrices

7510e1c

Add example for gaussian naive bayes

aba7b34

VasanthakumarV added 8 commits December 16, 2020 23:09

Add comments and documentation

02b962b

Remove unwrap from doc examples

6513f66

Reformat the 2*pi calculation

0925b9d

Fix typos

5acf071

Add reference to linfa-bayes

1fd5cbb

Refactor partial_fit api

7dba05b

Refactor for new Fit and Predict api

422e871

Make api work with Dataset::split_with_ratio method

487023e

VasanthakumarV force-pushed the master branch 3 times, most recently from c4c0d77 to f94fe21 Compare December 17, 2020 14:32

VasanthakumarV force-pushed the master branch 2 times, most recently from 871c0f8 to c10a51e Compare December 31, 2020 14:26

VasanthakumarV force-pushed the master branch from c10a51e to 2f324f9 Compare December 31, 2020 14:44

Refactor incremental fit api

bbd7b95

VasanthakumarV force-pushed the master branch from 2f324f9 to bbd7b95 Compare December 31, 2020 15:03

VasanthakumarV marked this pull request as ready for review December 31, 2020 15:05

bytesnake approved these changes Dec 31, 2020

View reviewed changes

bytesnake merged commit 8bbd641 into rust-ml:master Dec 31, 2020

bytesnake mentioned this pull request Dec 31, 2020

Roadmap #7

Open

24 tasks

sgrigory mentioned this pull request Nov 29, 2021

Add Multinomial and Bernoulli Naive Bayes algorithms #183

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gaussian Naive Bayes (GaussianNB) #51

Gaussian Naive Bayes (GaussianNB) #51

VasanthakumarV commented Oct 6, 2020

bytesnake commented Oct 8, 2020

bytesnake left a comment

bytesnake commented Nov 2, 2020

bytesnake left a comment

VasanthakumarV commented Nov 8, 2020

VasanthakumarV commented Dec 8, 2020

bytesnake commented Dec 8, 2020

bytesnake commented Dec 8, 2020

VasanthakumarV commented Dec 9, 2020

VasanthakumarV commented Dec 10, 2020

bytesnake commented Dec 10, 2020

bytesnake commented Dec 10, 2020

codecov-io commented Dec 15, 2020 •

edited

Loading

VasanthakumarV commented Dec 16, 2020 •

edited

Loading

bytesnake left a comment

bytesnake commented Dec 17, 2020

VasanthakumarV commented Dec 17, 2020 •

edited

Loading

VasanthakumarV commented Dec 17, 2020

bytesnake commented Dec 18, 2020

bytesnake commented Dec 31, 2020 •

edited

Loading

VasanthakumarV commented Dec 31, 2020

bytesnake commented Dec 31, 2020

VasanthakumarV commented Dec 31, 2020

Gaussian Naive Bayes (GaussianNB) #51

Gaussian Naive Bayes (GaussianNB) #51

Conversation

VasanthakumarV commented Oct 6, 2020

bytesnake commented Oct 8, 2020

bytesnake left a comment

Choose a reason for hiding this comment

bytesnake commented Nov 2, 2020

bytesnake left a comment

Choose a reason for hiding this comment

VasanthakumarV commented Nov 8, 2020

VasanthakumarV commented Dec 8, 2020

bytesnake commented Dec 8, 2020

bytesnake commented Dec 8, 2020

VasanthakumarV commented Dec 9, 2020

VasanthakumarV commented Dec 10, 2020

bytesnake commented Dec 10, 2020

bytesnake commented Dec 10, 2020

codecov-io commented Dec 15, 2020 • edited Loading

Codecov Report

VasanthakumarV commented Dec 16, 2020 • edited Loading

bytesnake left a comment

Choose a reason for hiding this comment

bytesnake commented Dec 17, 2020

VasanthakumarV commented Dec 17, 2020 • edited Loading

VasanthakumarV commented Dec 17, 2020

bytesnake commented Dec 18, 2020

bytesnake commented Dec 31, 2020 • edited Loading

VasanthakumarV commented Dec 31, 2020

bytesnake commented Dec 31, 2020

VasanthakumarV commented Dec 31, 2020

codecov-io commented Dec 15, 2020 •

edited

Loading

VasanthakumarV commented Dec 16, 2020 •

edited

Loading

VasanthakumarV commented Dec 17, 2020 •

edited

Loading

bytesnake commented Dec 31, 2020 •

edited

Loading