Skip to content
This repository has been archived by the owner on Aug 12, 2020. It is now read-only.

Design exploration #2

Open
wants to merge 35 commits into
base: master
Choose a base branch
from
Open

Design exploration #2

wants to merge 35 commits into from

Conversation

LukeMathWalker
Copy link
Contributor

I have started to play around with some traits to explore how we could structure the different concepts in a ML workflow.

For now I have kept it very simple:

  • a Model trait (should it be renamed to Transformer?);
  • a Blueprint trait (serving as initializer for Model types, it holds the model configuration);
  • an Optimizer trait (encoding the training step).

For the same Model type we could potentially have multiple Blueprints, each one providing a different parametrization of the space of possible models, as well as multiple Optimizers.

Model, as defined here, could potentially be used to represent any kind of transformation (e.g. preprocessing steps).

I am now trying to come down with something to encode the concept of pipeline or network of transformations, but I have not nailed it down yet.

@LukeMathWalker
Copy link
Contributor Author

I have added another trait, called BlueprintGenerator, to mark the possible parametrizations of a Blueprint - mostly with the final aim of performing some kind of hyperparameter optimization routine (grid search, random search, bayesian fancy search, etc.).

Copy link
Collaborator

@jblondin jblondin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

I think I need to see a few example workflows using this trait set to really consider the ergonomics. If I have some time over the next couple days, I'll take a crack at adding some example code to this pull request (most likely with no-op estimators).

src/lib.rs Outdated
/// In the same way, it has no notion of loss or "correct" predictions.
/// Those concepts are embedded elsewhere.
pub trait Model {
type Input;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to think whether Input and Output should be associated types or struct generics (Model<Input, Output>). It's definitely possible that a trained model could be implemented to provide predictions over multiple types of input / output. For instance, we could have a model defined over ndarray input, or dataframe input, or even a Vec<T>.

I could also see a case for Model<Input> with Output being an associated type -- given a particular input, the output could only be a specific type.

src/lib.rs Outdated
/// This means that there is no difference between one-shot training and incremental training.
/// Furthermore, the optimizer doesn't have to "own" the model or know anything about its hyperparameters,
/// because it never has to initialize it.
pub trait Optimizer<M>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording: Optimizer or something like Estimator? Optimizer might be confusing given that some algorithms are actually optimization algorithms, but others aren't.

src/lib.rs Outdated
/// Each of these strategies can take different (hyper)parameters, even though they return an
/// instance of the same model type in the end.
///
/// The initialization procedure could be data-dependent, hence the signature of `initialize`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about potential user confusion about what should be put in the Blueprint's initialize method vs Optimizer's train method, given the similarities in method signatures (they both take input and targets, they both return models).

What would be an example of a workflow with a data-dependent initialization? Are there any other options for handling that initialization?

@LukeMathWalker
Copy link
Contributor Author

I have been trying to sketch out some usage and I ran into some of the issues you have identified.
So let me take a step back - what do I want to achieve?
My goals are:

  • use the type system to avoid incorrect or inconsistent usage of models. From a design perspective, I'd like to have models as state machines, e.g. they can only be in a finite set of states and there are clear transition path between states. Nonsensical transitions are not allowed (making predictions with an untrained model);
  • for the same model class, it should be possible to specify configuration APIs at different levels of granularity/control, without having all the extensions in the main library (another crate could provide an alternative constructor for our SVMs with fewer hyperparams because some interesting/smart defaults have been found out to work very well);
  • for the same model class, it should be possible to use different training methodologies, without having to duplicate the model code itself. For example, you might want to do the first training pass of your recommendation engine on a lot of public data using a fairly standard linear-algebra method, while you might want to resort to gradient descent to fine-tune it on your much smaller proprietary dataset. In the same spirit, some training algorithms could be more or less high-level, thus allowing the re-use of this crate as the engine of a more off-the-shelf solution (see PyTorch vs Fast.ai or Keras vs Tensorflow);
  • for the same model class, we should be able to provide different training routines depending on the data coming in (Is it batched? Is it just a huge table?);
  • the space of primitive "concepts" should be small. For example, it should be possible to express pipelines and pipeline optimization using the same set of traits used to express single models and their optimization routines. They should be, in a certain sense, recursive.

Looking back at what I have written, and at your comments @jblondin, I can see how this draft fails to accomodate some of these requirements. I'll try to put down a revised sketch tonight, ideally with some example code using no-op estimators or very simple estimators (computing the mean).

@jblondin
Copy link
Collaborator

Some brief thoughts on the goals:

  • use the type system to avoid incorrect or inconsistent usage of models. From a design perspective, I'd like to have models as state machines, e.g. they can only be in a finite set of states and there are clear transition path between states. Nonsensical transitions are not allowed (making predictions with an untrained model);

Agreed. I also would prefer a mutation-free workflow (which you already have with the optimizer consuming the model and creating a new one). In other words, nothing like this:

// would NOT prefer this style
let mut model = Model::from(blueprint);
model.train(train_data, targets).;
let predictions = model.predict(test_data);
  • for the same model class, it should be possible to specify configuration APIs at different levels of granularity/control, without having all the extensions in the main library (another crate could provide an alternative constructor for our SVMs with fewer hyperparams because some interesting/smart defaults have been found out to work very well);

So, a trait-based Blueprint concept, like you have? A downstream crate could just create a new struct that implements Blueprint that can be used to create a new model?

  • for the same model class, it should be possible to use different training methodologies, without having to duplicate the model code itself. For example, you might want to do the first training pass of your recommendation engine on a lot of public data using a fairly standard linear-algebra method, while you might want to resort to gradient descent to fine-tune it on your much smaller proprietary dataset. In the same spirit, some training algorithms could be more or less high-level, thus allowing the re-use of this crate as the engine of a more off-the-shelf solution (see PyTorch vs Fast.ai or Keras vs Tensorflow);

This seems like it would be useful for transfer learning tasks -- taking a model trained with one algorithm / data set, and then updated it (or a subset of it) with another algorithm. The model could even support different components that are trained differently. In the deep learning / CNN use case, the convolutional layers are usually transferred, and the fully-connected neural network at the 'end' of the network is retrained for the new learning problem.

  • the space of primitive "concepts" should be small. For example, it should be possible to express pipelines and pipeline optimization using the same set of traits used to express single models and their optimization routines. They should be, in a certain sense, recursive.

Agreed. A pipeline could have a pipeline component. I would love to be able to just define a SVM pipeline, then use that as a component in a bayesian optimization pipeline for model selection without needed new 'concepts'.

@LukeMathWalker
Copy link
Contributor Author

LukeMathWalker commented May 15, 2019

Some brief thoughts on the goals:

  • use the type system to avoid incorrect or inconsistent usage of models. From a design perspective, I'd like to have models as state machines, e.g. they can only be in a finite set of states and there are clear transition path between states. Nonsensical transitions are not allowed (making predictions with an untrained model);

Agreed. I also would prefer a mutation-free workflow (which you already have with the optimizer consuming the model and creating a new one). In other words, nothing like this:

// would NOT prefer this style
let mut model = Model::from(blueprint);
model.train(train_data, targets).;
let predictions = model.predict(test_data);

Same feeling - Rust gives us move semantics, which we can use to have optimized routines (using mutation inside the method) while still providing a more side-effect free API to consumers.

  • for the same model class, it should be possible to use different training methodologies, without having to duplicate the model code itself. For example, you might want to do the first training pass of your recommendation engine on a lot of public data using a fairly standard linear-algebra method, while you might want to resort to gradient descent to fine-tune it on your much smaller proprietary dataset. In the same spirit, some training algorithms could be more or less high-level, thus allowing the re-use of this crate as the engine of a more off-the-shelf solution (see PyTorch vs Fast.ai or Keras vs Tensorflow);

This seems like it would be useful for transfer learning tasks -- taking a model trained with one algorithm / data set, and then updated it (or a subset of it) with another algorithm. The model could even support different components that are trained differently. In the deep learning / CNN use case, the convolutional layers are usually transferred, and the fully-connected neural network at the 'end' of the network is retrained for the new learning problem.

This was exactly one of my driving examples.

I have done another iteration, unfortunately I didn't master the time to provide a code example, but I'd still appreciate your feedback @jblondin.
What have I changed:

  • I have changed Model to Transformer, given that they could either be proper estimators or preprocessing steps;
  • I have changed Optimizer to Fit, as you suggested that optimization could be seen as a too narrow concept;
  • I have introduced IncrementalFit, to distinguish between an additional training round on the same transformer and the initial training round from a Blueprint;
  • Transformer is now an associated type of Blueprint, I couldn't find any meaningful example where I would reuse the same configuration for different model types. If we find one, it's not difficult to change it into a generic parameter again;
  • Transformer is now generic over both input and output. The reason for being generic over input is quite clear, when it comes to output my main concern was predicting classes/single-value vs returning a probability distribution. We could potentially support both in a single trait with a generic output type.

@LukeMathWalker
Copy link
Contributor Author

I have added a first, very simple example: standard scaler, supporting one-off and incremental computation of both mean and standard deviation.
Let me know how it feels @jblondin.

The main issue I experienced is around optimizers: I had to modify both Fit and IncrementalFit to take self as a mutable reference, otherwise there was no way for me to record the number of samples that had been seen so far. How should we work around this? Returning a tuple, Result<(T: Transformer, F: Fit), Self::Error>?

@jblondin
Copy link
Collaborator

Sorry for the delay in getting to this! I've been a bit backed up the past week or so. I'm going to have to give this some thought, but here's a few quick comments...

The main issue I experienced is around optimizers: I had to modify both Fit and IncrementalFit to take self as a mutable reference, otherwise there was no way for me to record the number of samples that had been seen so far. How should we work around this? Returning a tuple, Result<(T: Transformer, F: Fit), Self::Error>?

My initial reaction is that the number of samples should actually be an update to the config (Blueprint) instead of the Fit object itself. But this would require adding a Blueprint parameter to the IncrementalFit method. I feel like this should be the case, though, since you're basically using the Transformer object to carry through the ddof config setting to IncrementalFit -- the ddof isn't actually used in the prediction (transformation) step at all.

Of course, even if you do pass a configuration to incremental_fit, this configuration would need to either be mutated or returned by the initial fit, which gets us back to your initial issue, which I don't have a good solution for.

I feel like this demonstrates that this Fit -> IncrementalFit methodology may be a bit unwieldy. I don't have concrete suggestions to resolve this at the moment; I've started playing with some of my own examples on this framework locally to see if I can come up with something. I'll hopefully have some more thoughts in the next couple days as I work with it!

@jblondin
Copy link
Collaborator

jblondin commented May 29, 2019

One more thought - I like using the generic name Transformer for something that transforms an input set to an output set, but it's less readable as to what it does (when functioning as a predictive model). Some of this can be resolved with good documentation.

We could also have Model be a separate trait with Transformer as a supertrait -- I'm secretly hoping we find some functionality that Model has that Transformer doesn't need that would necessitate this separation of traits 😆

@jblondin
Copy link
Collaborator

I think I prefer the original workflow, without the separate IncrementalFit. I'm not sure what the advantage is of moving the model initialization code outside the Blueprint -- the blueprint seems like a good place to represent the initial model state.

pub trait Blueprint<I, O> {
    type Transformer: Transformer<I, O>;
    fn initialize(&self) -> Self::Transformer;
}

pub trait Fit<T, I, O> 
where
    T: Transformer<I, O>
{
    type Error: error::Error;
    fn fit(&self, inputs: &I, targets: &O, transformer: T) -> Result<T, Self::Error>;
}

with an example workflow

let blueprint = SomeConfig::new();
let model = my_algorithm.fit(&train, &targets, blueprint.initialize())?;
let preds = model.transform(&test)?;
// generate new batch of input
let model  = my_algorithm.fit(&new_train, &new_targets, model)?;
let better_preds = model.transform(&self)?;

@jblondin
Copy link
Collaborator

Sorry, should've added a couple more thoughts to my last comment.

This would require a bit more 'weight' to the Transformer object -- it would have to carry any relevant configuration information that would needed by the optimizer. In the case of your ScandardScaler, it would need to include the ddof (which you already have in there) as well as the n_samples seen so far.

On the plus side, this avoids any modification to the Fit object or the Blueprint object (which, in hindsight, seems like a bad idea on my part) and keeps the method signature cleaner (no returning of tuples).

I'm sure there are workflow quirks we're not considering at this point -- I feel like we're close to the point where we should prototype something and start iterating as we implement different models, algorithms, and data science workflows.

let (x, y) = generate_batch(n_samples);

let mut optimizer = OnlineOptimizer::default();
let standard_scaler = optimizer.fit(&x, &y, Config::default())?;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing the config in fit time might make difficult to compose estimators. Say if the esimator is a pipeline of estimators we wouldn't want to pass all the config in a fit. Having two steps a) building the pipeline b) fitting it is more natural IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two things are not mutually exclusive I'd say. You could compose the configuration of all steps in the pipeline and then pass that in when you want to fit it, it shouldn't look very different.
But I have yet to actually prototype it, so take it with a grain of salt.

check(&standard_scaler, &x)?;

let (x2, y2) = generate_batch(n_samples);
let standard_scaler = optimizer.incremental_fit(&x2, &y2, standard_scaler)?;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this conceptually difficult to follow. If anything I would have expected,

standard_scaler.incremental_fit(&x2, &y2, &optimizer)

not the other way around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point. It should be easy enough to flip it.

&mut self,
inputs: &Input<S>,
_targets: &Output,
blueprint: Config,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not call this config or params? Is there any other ML library that uses the "blueprint" vocabulary?

params as in scikit-learn would be more accurate IMO -- we are not providing model configuration but model parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Params is kind of an overloaded term: in this case, I'd say that we are passing hyperparameters (e.g. number of convolutional layers in a CNN), not parameters (e.g. the network weights).
I think it's quite natural to call the set of model hyperparameters model configuration.
We can safely discard the blueprint terminology, but I'd try to stick to terms that are not ambiguous.

Copy link

@Ten0 Ten0 Dec 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you prefer "model configuration" to "hyperparameter" for that purpose?
(Is it the sole fact that Config is smaller?)

#[macro_use]
extern crate derive_more;

use crate::standard_scaler::{Config, OnlineOptimizer, ScalingError, StandardScaler};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we have several models, this means we would need to use the full paths, e.g.

use crate::standard_scaler;
use crate::linear_model::logistic_regression;

let mut standard_scaler_optimizer = standard_scaler::OnlineOptimizer::default();
let standard_scaler = standard_scaler_optimizer.fit(&x, &y, standard_scaler::Config::default())?;
let X_tr, y_tr = optimizer.transform(&x, &y);

let mut logregr_optimizer = logistic_regression::OnlineOptimizer::default();
let standard_scaler = logregr_optimizer.fit(&x, &y, logistic_regression::Config::default())?;

which might become somewhat difficult to manage?

Also purely from the user experience and readability (I understand this has other advantages) I find the builder pattern in rustlearn somewhat simpler because one doesn't have to deal with the optimizer.

@LukeMathWalker
Copy link
Contributor Author

I think I prefer the original workflow, without the separate IncrementalFit. I'm not sure what the advantage is of moving the model initialization code outside the Blueprint -- the blueprint seems like a good place to represent the initial model state.

The main issue is correctness: after you have called blueprint.initialize(), can that Transformer be used to do predictions? If you don't go and actually check that a fit call has happened, you risk proceeding with something that returns nonsensical predictions.
Ideally, it should be the type system to tell you that something has never been fit, so it shouldn't be used to predict.

I'm sure there are workflow quirks we're not considering at this point -- I feel like we're close to the point where we should prototype something and start iterating as we implement different models, algorithms, and data science workflows.

I do agree 100%. I have been a little bit busy lately with a bunch of side projects, but now I should be able to get focused on it again. Should we do a list of models that we should start with, ideally giving us a sufficiently diverse range of quirks to enable design validation @jblondin?

@jblondin
Copy link
Collaborator

The main issue is correctness: after you have called blueprint.initialize(), can that Transformer be used to do predictions? If you don't go and actually check that a fit call has happened, you risk proceeding with something that returns nonsensical predictions.

If initialize returns the base non-optimized model, that model could indeed be used to do predictions (although likely poor ones since you haven't done any training yet). Those predictions may even be useful in tracking the effectiveness of your optimizer by giving a baseline for your prior.

Conceptually, I don't think the model and optimizer should be so interwoven that the optimizer is absolutely required to help build the initial model (which would demand a fit call on the optimizer to produce a valid model). I would prefer to keep the concepts separate.

Can you give me an example of a violation of 'correctness' in this context? I feel like you could still effectively apply local reasoning in my example workflow.

Should we do a list of models that we should start with, ideally giving us a sufficiently diverse range of quirks to enable design validation @jblondin?

Sounds good. I'll start giving it some thought!

@LukeMathWalker
Copy link
Contributor Author

I have found a couple of repos that should allow me to get a sizeable collection of algorithms up and running in a short amount of time:

They use just NumPy and vanilla Python, so it should be quite straight-forward to port them to Rust using ndarray 👍
It should give us a sense of what abstractions will work on a sufficiently vast array of algorithms @jblondin @rth

@LukeMathWalker
Copy link
Contributor Author

I have started with rust-ndarray/ndarray-linalg#166 👀

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants