Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaussian Naive Bayes (GaussianNB) #51

Merged
merged 15 commits into from
Dec 31, 2020
Merged

Conversation

VasanthakumarV
Copy link
Contributor

Port of GaussianNB

@bytesnake
Copy link
Member

will look into this in the weekend :)

Copy link
Member

@bytesnake bytesnake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it makes more sense to structure it into ModelParameter -> FittedModel, than ProbablyFittedModel -> FittedModel. Added a comment how the API would look like, what do you think?

linfa-bayes/README.md Show resolved Hide resolved
linfa-bayes/examples/winequality.rs Outdated Show resolved Hide resolved
linfa-bayes/examples/winequality.rs Outdated Show resolved Hide resolved
linfa-bayes/src/error.rs Outdated Show resolved Hide resolved
linfa-bayes/src/gaussian_nb.rs Outdated Show resolved Hide resolved
linfa-bayes/src/gaussian_nb.rs Outdated Show resolved Hide resolved
@bytesnake
Copy link
Member

Could you take a look at https://github.com/rust-ml/linfa/blob/fdf241127fe9058ae98c67a603a2bbb8eccc561c/CONTRIBUTE.md, I'm currently writing a small how-to-contribute document for the upcoming interface changes. If you find something unclear, please let me know so I can update the document.

Copy link
Member

@bytesnake bytesnake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the review with regards to merging #55
If you have any questions how to implement the new traits, please let me know!

linfa-bayes/Cargo.toml Outdated Show resolved Hide resolved
linfa-bayes/examples/winequality.rs Outdated Show resolved Hide resolved
linfa-bayes/src/error.rs Show resolved Hide resolved
linfa-bayes/src/gaussian_nb.rs Outdated Show resolved Hide resolved
linfa-bayes/src/gaussian_nb.rs Outdated Show resolved Hide resolved
linfa-bayes/src/gaussian_nb.rs Outdated Show resolved Hide resolved
linfa-bayes/src/lib.rs Outdated Show resolved Hide resolved
@VasanthakumarV
Copy link
Contributor Author

@bytesnake Thank you for the detailed comments, will try to make the changes by next week.

@VasanthakumarV VasanthakumarV marked this pull request as draft November 9, 2020 06:16
@VasanthakumarV
Copy link
Contributor Author

@bytesnake For incrementally learning the algorithm needs the definite set of classes supplied by the user, currently I supply them as an argument in the fit_with method, this prevents me from implementing the linfa::traits::IncrementalFit trait.

@bytesnake
Copy link
Member

I think you can use the fn labels(&self) -> Vec<T::Elem> returning a vector of the unique classes. For this you have to restrict your targets like this:

impl<F: Float, R: Records<Elem = F>, T: Labels<Elem = usize>> IncrementalFit<R, T> for GaussianNbParams<F> {
}

in the future the trait could also be extended to accept any countable type:

impl<F: Float, L: Label, R: Records<Elem = F>, T: Label<Elem = L>> IncrementalFit<R, T> for GaussianNbParams<F, L> {
}

@bytesnake
Copy link
Member

oh and you could estimate the priors from the samples as well, there is the Dataset::frequencies_with_mask function. It was introduced for decision trees, where you have to mask the data at each split point. You could either pass a true slice or introduce a function just called frequencies. This also takes the sample weights in consideration.

@VasanthakumarV
Copy link
Contributor Author

@bytesnake I have started using the Labels trait for the dataset targets, and now I am using labels method for accessing unique entries.

But the issue with incremental learning in naive-bayes is that we need to know the complete list of labels while training on batches of data, the first batch of data that is fed to the model will almost never have encountered all the labels, that is the reason why I have an extra argument in the function signature.

Also I see Iter data structure, do you think we will be able to create chunks of Datasets using this?, this will be very useful for the incremental learning api.

@VasanthakumarV
Copy link
Contributor Author

oh and you could estimate the priors from the samples as well, there is the Dataset::frequencies_with_mask function. It was introduced for decision trees, where you have to mask the data at each split point. You could either pass a true slice or introduce a function just called frequencies. This also takes the sample weights in consideration.

@bytesnake Currently I calculate priors along with mean and variance for each class, which requires masked indexing for each class.
So more than Dataset::frequencies_with_mask, a convenience function for masked indexing or regular indexing will be useful, my implementation is a little ugly, but can work on a more generic one if you think it will be useful.

@bytesnake
Copy link
Member

@VasanthakumarV I think that an incremental model should collect its label incrementally as well. Would it be possible to form the union of previous labels with new encounter ones and update the model correspondingly. Otherwise you could also pass the labels as a hyper-parameter and return an error when a different label is encountered.

@bytesnake
Copy link
Member

So more than Dataset::frequencies_with_mask, a convenience function for masked indexing or regular indexing will be useful, my implementation is a little ugly, but can work on a more generic one if you think it will be useful.

the API around dataset is very sparse right now, any update is welcome and a function which can select samples based on the class should definitely be in there somewhere 👍 It would be even more useful if you could somehow produce a view on the sub-selected samples, but I don't think that is possible right now.
I want to start a PR, renaming Dataset -> DatasetBase and implementing Dataset with Records = Array2<F> and Targets = Array1<L> to simplify the implementation details.

@codecov-io
Copy link

codecov-io commented Dec 15, 2020

Codecov Report

Merging #51 (2f324f9) into master (1c19f3f) will increase coverage by 0.25%.
The diff coverage is 54.88%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #51      +/-   ##
==========================================
+ Coverage   49.07%   49.33%   +0.25%     
==========================================
  Files          55       57       +2     
  Lines        3629     3762     +133     
==========================================
+ Hits         1781     1856      +75     
- Misses       1848     1906      +58     
Impacted Files Coverage Δ
linfa-bayes/src/error.rs 0.00% <0.00%> (ø)
linfa-bayes/src/gaussian_nb.rs 57.03% <57.03%> (ø)
linfa-linear/src/glm.rs 49.54% <0.00%> (ø)
src/dataset/impl_targets.rs 33.33% <0.00%> (ø)
linfa-hierarchical/src/lib.rs 44.92% <0.00%> (ø)
linfa-svm/src/classification.rs 62.32% <0.00%> (ø)
linfa-trees/src/decision_trees/algorithm.rs 52.12% <0.00%> (+0.38%) ⬆️
src/metrics_classification.rs 50.00% <0.00%> (+0.52%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c19f3f...2f324f9. Read the comment docs.

@VasanthakumarV
Copy link
Contributor Author

VasanthakumarV commented Dec 16, 2020

@VasanthakumarV I think that an incremental model should collect its label incrementally as well. Would it be possible to form the union of previous labels with new encounter ones and update the model correspondingly. Otherwise you could also pass the labels as a hyper-parameter and return an error when a different label is encountered.

@bytesnake I went with the first approach from a usability standpoint.
To achieve this I replaced ndarray matrix with a hashtable, this greatly reduced the complexity of the training step, introduced a little bit of complexity to the prediction step.

To implement the incremental learning API, I had to change IncrementalFit API - link. The big problem right now is IncrementalFit looks different from Fit trait. I wasn't able to make Option and Result work together in the old incremental trait, I am still thinking about ways to achieve what I want with a trait that fits nicely with everything else.
Doc Example - exhibits what it takes to train a model incrementally, I hope this matches the expectation we had for this.

Copy link
Member

@bytesnake bytesnake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for all the work, will merge after adressing these small nitpicks 👍

linfa-bayes/src/gaussian_nb.rs Outdated Show resolved Hide resolved
linfa-bayes/src/gaussian_nb.rs Outdated Show resolved Hide resolved
linfa-bayes/src/gaussian_nb.rs Outdated Show resolved Hide resolved
linfa-bayes/src/gaussian_nb.rs Show resolved Hide resolved
@bytesnake
Copy link
Member

ready to merge?

@VasanthakumarV
Copy link
Contributor Author

VasanthakumarV commented Dec 17, 2020

ready to merge?

I really don't like the changes I made to IncrementalFIt trait, especially hardcoding Result on the output. I might take another day to make it better.

@VasanthakumarV
Copy link
Contributor Author

@bytesnake What do you think about the following trait for IncrementalFit?, two associated types to capture the asymmetry.

pub trait IncrementalFit<'a, R: Records, T: Targets> {
    type Input: 'a;
    type Output: 'a;

    fn fit_with(&self, model: Self::Input, dataset: &'a Dataset<R, T>) -> Self::Output;
}

// Implementation
impl<A, L> IncrementalFit<'_, ArrayView2<'_, A>, L> for GaussianNbParams
where
    A: Float,
    L: Labels<Elem = usize>,
{
    type Input = Option<GaussianNb<A>>;
    type Output = Result<Option<GaussianNb<A>>>;

    fn fit_with(&self, model_in: Self::Input, dataset: &Dataset<ArrayView2<A>, L>) -> Self::Output {
        ...
    }
    ...
}

The problem here is the flexibility, no rule is enforced, but the user has the freedom to either return an Option or an Option wrapped in a result should there be a need.

@bytesnake
Copy link
Member

didn't see this yesterday, this definitely gives you more flexibility. I had another thing in my mind, which I pushed here VasanthakumarV#1
Shouldn't be the output type Output = Option<Result<GaussianNb<A>>>;?

@bytesnake
Copy link
Member

bytesnake commented Dec 31, 2020

I'm back from christmas and after the break I think your idea is good, especially in combination with try_fold. We should revise the decision for Self::Output once we have Result<T, anyhow::Error>

@VasanthakumarV
Copy link
Contributor Author

@bytesnake do you want to rename the associated type's names from Input, Output to ObjectIn, ObjectOut, or something else?

@bytesnake
Copy link
Member

@VasanthakumarV no Input and Output is fine and transparent, otherwise this could be InputObj, ObjectIn or other combinations 😅

@VasanthakumarV
Copy link
Contributor Author

@VasanthakumarV no Input and Output is fine and transparent, otherwise this could be InputObj, ObjectIn or other combinations 😅

I made them ObjectIn and ObjectOut for consistency with Fit trait. Let me know if you want any changes made.

@VasanthakumarV VasanthakumarV marked this pull request as ready for review December 31, 2020 15:05
@bytesnake bytesnake merged commit 8bbd641 into rust-ml:master Dec 31, 2020
@bytesnake bytesnake mentioned this pull request Dec 31, 2020
24 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants