Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify conventions for public interfaces #8

Open
bplevin36 opened this issue Dec 8, 2019 · 10 comments
Open

Clarify conventions for public interfaces #8

bplevin36 opened this issue Dec 8, 2019 · 10 comments

Comments

@bplevin36
Copy link

It would be useful, especially to potential contributors, to have a unified description of how public interfaces should be structured. At first glance, I assumed we would be attempting to stay as close to sklearn's conventions as is reasonable with Rust's conventions and syntax. Looking at the KMeans implementation, however, shows significant departure from sklearn's parameter naming scheme, and introduces several additional types for Hyperparameter management.

I think it's important in the long run for the public interfaces to be both intuitive and consistent. With that in mind, I think we should:

  1. Start a discussion about design choices for the public interfaces. I personally am not sure that the utility of introducing HyperParams structs and accompanying factory functions justifies the additional API complexity. But I could absolutely be wrong about that. I'd just like to see some rationale.

  2. Write up the conclusions of that discussion in a design doc. This doesn't need to be that complex or in-depth, just a basic statement of conventions and design philosophy to make it easier for contributors.

@LukeMathWalker
Copy link
Contributor

LukeMathWalker commented Dec 8, 2019

I agree in principle @bplevin36, but I disagree in practice - if nothing else, because I tried to do this already in rust-ml/classical-ml-discussion#2 and I quickly realised that it's difficult to have design discussions when there are not enough concrete examples.
They are difficult to follow and it's difficult to understand if you are going in the right direction or not.
It's extremely valuable to be able to quickly benchmark against existing functionality what it would look like to go for a certain trait definition instead of another.

For the current stage of the project, I would focus on algorithm implementation without spending too much time on traits/interfaces until we have at least 3/4 algorithms in each relevant class (supervised learning, unsupervised learning, pre-processing functions, etc.).
Refactoring to a common interface takes a fraction of the effort of the initial heavy-lifting required to write, test and document a new model implementation.

I do as well understand that a minimal set of directives/guidelines is required to avoid having contributors wander in incompatible directions.
At the moment, I would probably suggest the following as contributor guidelines:

  • when you need to reach for a data-structure, favor re-use over adding a new dependency (e.g. ndarray for dense matrices);
  • leverage the strength of the type system to ensure compile-time correctness;
  • don't abuse the type system - it's ok if some checks have to be performed at runtime (no typenum 😅 )

I would add that imitating the API of scikit-learn is not a goal for linfa: scikit-learn is a great library, but its first release came out in 2007 and the authors themselves would probably make different choices if they had a chance to redesign the API today.
Rust is not Python - we should leverage its own strength.
We should also consider other efforts/API experiments currently taking place in other languages/ecosystems to inform our choices - see for example MLJ in Julia.

Different story for linfa's Python bindings - there I would strive to ensure API compatibility to make sure that linfa can be used with all the ecosystem surrounding scikit-learn and built on top of its Estimator class.

@LukeMathWalker
Copy link
Contributor

My current heuristics (which might or might not survive the test of time as we start adding more models to linfa):

  • fit should be used as constructor for the model class; it ensures that predict can be called without having to check if the model has been trained or not (otherwise predict would have to return a Result/Option or panic);
  • most models have several hyperparameters and in most cases the user doesn't want to/doesn't need to provide a value for every single one of them. The builder pattern can be used to provide sensible defaults;
  • given a trained model, it should be possible to know which hyperparameters were used to train it (how does this work when a model can be partially/incrementally trained in multiple steps? No idea yet).

What are your thoughts on this @bplevin36?

@mstallmo
Copy link

mstallmo commented Dec 9, 2019

Just to chip in my two cents here; I think it's a good idea to get a couple of community contributed examples into the project so there can be some evaluation of what is working and what isn't with the initial structure and API for algorithm implementation.

I just started on the implementation of NearestNeighbors and am currently roughly following the structure that linfa-clustering has laid out. My plan is to continue forward with roughly the same structure and API until I hit a point where it doesn't make sense.

I am going to open a work in progress PR as soon as I'm happy with the scaffolding so other members of the community can have a look and comment. To me this seemed like the most reasonable approach since the project is pretty new and there will be a handful of people doing parallel implementations and most likely hitting similar problems related to project structure and API design.

@bplevin36
Copy link
Author

What are your thoughts on this @bplevin36?

These two comments alone help a lot in terms of a unifying strategy. The only thing I would caution against is postponing a finalization of public interface structure until too late. If there's one constant I've observed in the Rust ecosystem, it's that large teams have overlooked Rust for projects it would be really good for due to API instability both in the core language and in popular projects.

I'll continue working on the PR I have in mind for linfa and we'll see where the discussion chains go :)

@Nimpruda
Copy link

If what we seek is adoption, following sklearn's api is probably the best idea.

@Ten0
Copy link

Ten0 commented Dec 26, 2019

I personally am not sure that the utility of introducing HyperParams structs and accompanying factory functions justifies the additional API complexity. But I could absolutely be wrong about that. I'd just like to see some rationale.

If what we seek is adoption, following sklearn's api is probably the best idea.

I think to some extent the terminology, namings of the functions, etc should be kept, but as Rust gives us the opportunity to, for instance, guarantee that a model was fitted to call .predict on it for things that would otherwise have to be represented as Options with unwraps everywhere which would just make bugs and crashes easier to create both in production systems and when exploring data while also being more verbose, I strongly agree with @LukeMathWalker and jblondin on that we're much better off leveraging the type system by having separate types for the hyperparameters.

This would also allow for some nice abstractions for easily creating say, model pipelines as tuples of models that would themselves implement Model, where the hyperparameters would be the tuple of hyperparameters of each of the models.

This would imply deviating from the

clf = Something(alpha = 0.1)
clf.fit(X, y)
y2 = clf.predict(X2)

of sklearn, into something like:

let hp = Something { alpha: 0.1, ..Default::default() };
let clf = hp.fit(X, y);
let y2 = clf.predict(X2);

given a trained model, it should be possible to know which hyperparameters were used to train it (how does this work when a model can be partially/incrementally trained in multiple steps? No idea yet).

Given that typically the set of hyperparameters is rather small and that copying it will take a very neglictable amount of time compared to the fitting time, and because of the very good compromise in terms of simplicity/flexibility/performance compared to storing references/Rcs/Arcs/generic-on-std::borrow::Borrow, I would simply store a copy of the hyperparameters in the model (fit would take ownership of the HyperParameters, if there are few enough the specific HyperParameters struct would implement Copy, and of course we would always have HyperParameters: Clone).
For models that have a too complex set of hyperparameters, they could still use the same interface but have the HyperParameters for their model be a reference, or even be generic on any Borrow<SomeModelHyperParameters> and impl<B: Borrow<SomeModelHyperParameters>> Model for SomeModel<B> for full flexibility...

The interface I have in mind would look something like this:

pub trait HyperParameters: Clone + Default {}

pub trait Model {
    type HyperParameters: HyperParameters;
    fn hyper_parameters(&self) -> &Self::HyperParameters;
}

#[derive(Clone, Debug, Default)]
struct SomeModelHyperParameters {}
struct SomeModel {
    hyper_parameters: SomeModelHyperParameters,
    /// When fitted, this and other similar variables get filled:
    parameters: (),
}

// These three could be part of derives
impl HyperParameters for SomeModelHyperParameters {}

impl Model for SomeModel {
    type HyperParameters = SomeModelHyperParameters;
    fn hyper_parameters(&self) -> &Self::HyperParameters {
        &self.hyper_parameters
    }
}

// For ease of access to parameters from rest of code, incl. `fit` & the like
impl std::ops::Deref for SomeModel {
    type Target = <Self as Model>::HyperParameters;
    fn deref(&self) -> &Self::Target {
        self.hyper_parameters()
    }
}

// This is how you would typically call fit after setting your parameters:

pub trait Fit<FD>: Model + Sized {
    fn fit(hyper_parameters: <Self as Model>::HyperParameters, fit_data: FD) -> Self;
}

pub trait TargetedHyperParameters: HyperParameters {
    type Model: Model<HyperParameters = Self>;
    fn fit<FD>(self, fit_data: FD) -> Self::Model
    where
        Self::Model: Fit<FD>,
    {
        <Self::Model as Fit<FD>>::fit(self, fit_data)
    }
}

Now given that, the model has the hyperparameters and it seems easy to factorize fit using incremental_fit, (where usually fit would be responsible for setting up the base data and incremental_fit for the learning process).

otherwise predict would have to return a Result/Option or panic

I can see a few scenarios where fitting could fail: invalid hyperparameters, invalid data, using a distributed service as a back-end and it disconnects... We could have a TryFit trait (mix between Fit and TryInto) to handle these scenarios, and add the try_fit function to the TargetedHyperParameters trait.

What do you think?

@LukeMathWalker
Copy link
Contributor

LukeMathWalker commented Dec 27, 2019

given a trained model, it should be possible to know which hyperparameters were used to train it (how does this work when a model can be partially/incrementally trained in multiple steps? No idea yet).

Given that typically the set of hyperparameters is rather small and that copying it will take a very neglictable amount of time compared to the fitting time, and because of the very good compromise in terms of simplicity/flexibility/performance compared to storing references/Rcs/Arcs/generic-on-std::borrow::Borrow, I would simply store a copy of the hyperparameters in the model (fit would take ownership of the HyperParameters, if there are few enough the specific HyperParameters struct would implement Copy, and of course we would always have HyperParameters: Clone).
For models that have a too complex set of hyperparameters, they could still use the same interface but have the HyperParameters for their model be a reference, or even be generic on any Borrow<SomeModelHyperParameters> and impl<B: Borrow<SomeModelHyperParameters>> Model for SomeModel<B> for full flexibility...

I was referring to a different problem - the following calls to partial_fit might use different hyperparameters (e.g. a different learning rate if we are talking neural networks [which are not in scope for linfa]). How do we preserve the full history of hyperparameter?

The sketch of those traits looks quite interesting - not so convinced by the Deref blanket implementation for Model, but as you said it's about ergonomics, which usually become clearer with usage.
A couple of practical questions, coming from the two models we have now in linfa (KMeans and DBSCAN):

  • How do we handle randomness? Do we consider the random number generator one of the hyperparameters? Do we accept it as an additional parameter on fit (but then, not all model need one - e.g. DBSCAN)?
  • How do we handle models that do not really "store" a learned representation of the data for subsequent use - e.g. DBSCAN? Do we treat them as "transformers" - a.k.a. they only implement fit_and_predict, nothing else?

@Ten0
Copy link

Ten0 commented Dec 28, 2019

These are very complex questions! :) Let's see...

the following calls to partial_fit might use different hyperparameters. How do we preserve the full history of hyperparameter?

Saving the full history seems to me like a very edgy use-case so I wouldn't necessarily always store it in the model (would require a Vec hence verbose and memory allocations that would not necessarily be useful...).
I guess there could be a trait for updating the hyperparameters (EditableHyperParameters: Model providing hyper_parameters_mut), implemented mostly on models with incremental fitting (IncrementalFit: Model providing incremental_fit) where applicable.

Then one could create a wrapper struct for any model that has EditableHyperParameters that also saves the history in a Vec along the line, and reimplements all the other traits, passing them to the underlying struct:

struct ModelWithHPHistory<M: Model> {
    model: M,
    previous_hyper_parameters: Vec<M::HyperParameters>,
}
impl<M: Model> Model for ModelWithHPHistory<M> {
    type HyperParameters = M::HyperParameters;
    fn hyper_parameters(&self) -> &Self::HyperParameters {
        self.model.hyper_parameters()
    }
}
impl<M: Model, FD> Fit<FD> for ModelWithHPHistory<M>
where
    M: Fit<FD>,
{
    fn fit(hyper_parameters: <Self as Model>::HyperParameters, fit_data: FD) -> Self {
        ModelWithHPHistory {
            model: M::fit(hyper_parameters, fit_data),
            previous_hyper_parameters: Vec::new(),
        }
    }
}
// etc.

But if it's edgy enough, users who need it could just store it manually. I don't think I've seen that feature anywhere in SKLearn.

How do we handle models that do not really "store" a learned representation of the data for subsequent use - e.g. DBSCAN? Do we treat them as "transformers" - a.k.a. they only implement fit_and_predict, nothing else?

We should note that in the SKLearn terminology (which I like around transform and predict):

  • a transformer denotes the presence of transform or fit_transform, and not fit_predict (e.g. PCA is a transformer), so I suppose you were talking about what would correspond to the ClusterMixin in SKLearn?
  • ClusterMixins (trait FitPredict? Or something more explicit with regards to their purpose?) are not necessarily purely transductive (that is, some of them may be able to handle unseen data through a predict function, e.g. KMeans).

I think we'd want both KMeans and DBSCAN to have the same trait related to their ability to fit_predict, while KMeans could also provide implementations for fit (impl Fit) and predict (impl Predict?) separately.
Since we should still be able to call predict on k_means = KMeansHyperParameters::default().fit(X) even when it was generated through fit_predict, I feel that both the model and the predicted classes should output from fit_predict, so I think we should go for something like this:

pub trait Predict<Input, Output>: Model {
    fn predict(&self, x: Input) -> Output;
}

pub trait FitPredict<FD, Output>: Model + Sized {
    fn fit_predict(
        hyper_parameters: <Self as Model>::HyperParameters,
        fit_data: FD,
    ) -> (Self, Output);
}

This would imply that even DBSCAN would output something, be it a nearly-empty struct containing just the hyperparameters but I think it's preferable: maybe later in the development we would want it to gain some other post-fit capabilities like some details about how it fitted stuff or even actually Predict by either classifying the point as noise if it's too far from its neighbours or as the same cluster as the closest point otherwise...

Once specialization is stabilized, we could provide a default impl for FitPredict where the model has Fit and Predict implemented for the same data (but until then I think I wouldn't because conflicting impl would disallow having a more performant implementation than calling one after the other).

About Output being a generic parameter instead of an associated type in the code above, I think it's better for two reasons:

  1. It corresponds more to what it really is: an ability to produce a prediction in some output format from an input in some format. We could imagine that a single model is able to predict, from the same input, both owned and non-owned versions of a prediction.
  2. If the predict trait for a given input is only implemented for a single output, the compiler will be able to find it out without asking for a type annotation in the same way as if it was an associated type, so this writing does not imply by itself more verbosity than an associated type.

How do we handle randomness? Do we consider the random number generator one of the hyperparameters? Do we accept it as an additional parameter on fit (but then, not all model need one - e.g. DBSCAN)?

I agree we probably need to be able to customize RNGs, in particular for seeding for having reproducible results.

I've started analyzing what rand provides us and giving it some thought but for now I still need to give it some more... Still, here are my current thoughts:

  • Given that as you pointed out some models don't need it and that a lot of the time we don't need to customize it I don't think it should be a parameter on fit.
  • With it being on another trait that provides fit_with_rng we'd have to also duplicate fit_predict and fit_transform and all of these may be used in a lot of places whose code would also have to be duplicated...
  • with it being as a generic RNG: rand::Rng in the hyperparameters we'd have to make the model and hyperparameters generic on any RNG: rand::Rng for every impl which is awfully verbose
  • with it being an enum in the hyperparameters we'd have to implement said enum, and calls to it would have to resolve the enum variant hence I would imaging maybe perf. loss depending on how well the LLVM optimizer performs on that particular instance of code
  • with it being a global it raises some questions with regards to parallel algorithms (that may still be raised in the other cases but maybe more manageable?) and workflows where you would want one model training to behave in a reproducible way no matter how many calls to the RNG were made before starting to train that specific model.

@LukeMathWalker
Copy link
Contributor

But if it's edgy enough, users who need it could just store it manually. I don't think I've seen that feature anywhere in SKLearn.

Let's give our minds permission to wander outside what has already been explored a little bit 😁
I think the cost of cloning/allocating a vector of hyperparameters is negligible (for the same reasons we were discussing a couple of posts ago), hence I'd generally lean on providing best practice out of the box (full history of training for a model) with an opt-out escape hatch, rather than the opposite.
For industry usage, especially in regulated context, reproducibility is indeed quite a key feature.

How do we handle models that do not really "store" a learned representation of the data for subsequent use - e.g. DBSCAN? Do we treat them as "transformers" - a.k.a. they only implement fit_and_predict, nothing else?

The FitPredict sketch seems like a good idea.

About Output being a generic parameter instead of an associated type in the code above, I think it's better for two reasons:

1. It corresponds more to what it really is: an ability to produce a prediction in some output format from an input in some format. We could imagine that a single model is able to predict, from the same input, both owned and non-owned versions of a prediction.

2. If the predict trait for a given input is only implemented for a single output, the compiler will be able to find it out without asking for a type annotation in the same way as if it was an associated type, so this writing does not imply by itself more verbosity than an associated type.

I think we will have to evaluate the ergonomics of Output as an associated type rather than a generic parameter in practice depending on how many (input, output) combinations we end up supporting.

I agree we probably need to be able to customize RNGs, in particular for seeding for having reproducible results.

I've started analyzing what rand provides us and giving it some thought but for now I still need to give it some more... Still, here are my current thoughts:

* Given that as you pointed out some models don't need it and that a lot of the time we don't need to customize it I don't think it should be a parameter on `fit`.

* With it being on another trait that provides `fit_with_rng` we'd have to also duplicate `fit_predict` and `fit_transform` and all of these may be used in a lot of places whose code would also have to be duplicated...

* with it being as a generic `RNG: rand::Rng` in the hyperparameters we'd have to make the model and hyperparameters generic on any `RNG: rand::Rng` for every impl which is awfully verbose

* with it being an enum in the hyperparameters we'd have to implement said enum, and calls to it would have to resolve the `enum` variant hence I would imaging maybe perf. loss depending on how well the LLVM optimizer performs on that particular instance of code

* with it being a global it raises some questions with regards to parallel algorithms (that may still be raised in the other cases but maybe more manageable?) and workflows where you would want one model training to behave in a reproducible way no matter how many calls to the RNG were made before starting to train that specific model.

There is also another option: make seed a required hyperparameter of all models which rely on randomness and let the choice of the specific SeedableRng be an implementation detail of linfa.
This reduces flexibility, but keeps reproducibility while probably simplifying the code using the library (no need to mess around with rand, just provide a seed).

@Ten0
Copy link

Ten0 commented Dec 28, 2019

There is also another option: make seed a required hyperparameter of all models which rely on randomness and let the choice of the specific SeedableRng be an implementation detail of linfa.
This reduces flexibility, but keeps reproducibility while probably simplifying the code using the library (no need to mess around with rand, just provide a seed).

I agree, this option looks better than all the other ones! :)

Let's give our minds permission to wander outside what has already been explored a little bit 😁
I think the cost of cloning/allocating a vector of hyperparameters is negligible (for the same reasons we were discussing a couple of posts ago), hence I'd generally lean on providing best practice out of the box (full history of training for a model) with an opt-out escape hatch, rather than the opposite.
For industry usage, especially in regulated context, reproducibility is indeed quite a key feature.

I agree that it wouldn't be bad to provide that feature, and that the Vec wouldn't cost much. However here I do not necessarily agree on that embedding it in every model is providing the best interface: what about the verbosity point? This would mean that all these models would have to store their hyperparameters history themselves and handle the complexity of that additional vector (and saving the history -> duplicate code or trait interfaces that shouldn't be public but would have to be...) instead of having a single variable to store a their hyperparameter set. It would make it harder on both model implementors and on readers to understand what's going on.
Considering how little use I would have had for this feature, I would imagine that it wouldn't be used a lot (I may be wrong here), and I would estimate this added complexity in all these models to not be worth not typing let clf = ModelWithHPHistory::<SomeModel>::fit(hp, data); whenever this feature is needed.

Also, I guess saving the HP history would not be all you need to reproduce: you would also need the incremental data history synchronized with the HP history, which you probably don't want to store in the model without an opt-in because it would be too large/complex, so you would still have some more code to write, in the middle of which ModelWithHPHistory::<SomeModel> starts to look really small.

I also feel like it's conceptually closer to precisely what we need: sometimes, to require saving any change to the hyperparameters when we update them, for models where it's updatable, and this "being conceptually close" is probably my favorite way to design things that tend to scale well with new features.

I think we will have to evaluate the ergonomics of Output as an associated type rather than a generic parameter in practice depending on how many (input, output) combinations we end up supporting.

You're right, supporting some rarely-useful edge-cases could hurt ergonomics in the most common case here. (Oh, looks a lot like the previous problem! ^^')

I think in order to solve this we could have the "ability for output" and "typical output" be separate traits (so adding a TargetedFit or something along those lines), in a similar fashion as what I wrote with TargetedHyperParameters and Fit, that way we'd support both non-verbosity and flexibility with regards to most performant implementations. (We just have to find some nice naming because if both are called fit rust would require the expicit syntax which wouldn't be very ergonomic. ^^)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants