Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine readable definition of parameters #13031

Open
amueller opened this issue Jan 22, 2019 · 10 comments
Open

Machine readable definition of parameters #13031

amueller opened this issue Jan 22, 2019 · 10 comments

Comments

@amueller
Copy link
Member

amueller commented Jan 22, 2019

Related to #5004.
For automl purposes and automatically using scikit-learn more programmatically, it would be helpful for some to specify allowed parameters in a machine-readable format. Right now they are only available via the documentation, which doesn't allow us to actually test them and might be hard to parse.

Considerations are how to specify hyper-parameters. We could use the Python type system, but I think we would need to define custom types for strings with a fixed number of options.

It might also make sense to have possible ranges for integer and float parameters.

Finally, it might be interesting to annotate whether a parameter influences algorithm results, algorithm speed and/or memory consumption.

Not all options influence the result and I've seen people grid-search optimization options while evaluating only on accuracy. There might be fuzzy cases, though, like different solvers might yield slightly different results but it's mostly a time/memory issue.

Another question is whether it's in scope of doing this inside scikit-learn at all, or whether it should live outside - but it will be hard to keep in sync.

Being able to test for different options (not in standard CI but maybe in cron?) might be a good motivation. However, only some combinations of parameters are valid and actually specifying valid combinations (which would be required for testing) is more tricky than just specifying valid values.

@jnothman
Copy link
Member

jnothman commented Jan 22, 2019 via email

@mitar
Copy link
Contributor

mitar commented Jan 23, 2019

In the d3m project we have a language to describe parameters using Python classes. We have decided not to use Python typing because of its strange focus on using Type[Arg] syntax which means that everything is a class and not an instance so then you have to program with metaclasses and so on. It just becomes another tricky thing to manage. So we just use regular Python instances.

We already parsed docstrings for many sklearn functions and also hand-curated them to fix some things.

We do differentiate between "params" (state of an estimator which is being changed during fitting) and "hyper-parameters" (parameters which control what the estimator is doing). For first we just record their type (using Python typing), for the latter we define the whole search space, in which context they could probably be used (tuning, resource use, controlling the logic itself). I think sklearn already has something like that with its differentiation between params with _ suffix and without. One tricky thing is that it is tricky to know the whole "params" (state) in sklearn because some attributes are set outside of constructor. One thing sklearn could improve also is make sure all instance attributes are set in the constructor and not outside.

See example of our description of "params" and "hyper-parameters" for random forest here.

I think it would be great to have such machine-readable format as source which would then generate docstrings. (You could use a Python metaclass to do that automatically.)

@jnothman
Copy link
Member

jnothman commented Jan 23, 2019 via email

@mitar
Copy link
Contributor

mitar commented Jan 23, 2019

So how would one automatically inspect an estimator? It seems that currently the idea is that you create an instance of the estimator, provide hyper-parameters, and then read what are params/state ... how? So if flags are dynamic and depend on hyper-parameters, params are as well. But I am not sure if I want to fit the estimator just to figure out what state it might have and populate the AutoML database.

@jnothman
Copy link
Member

jnothman commented Jan 23, 2019 via email

@mitar
Copy link
Contributor

mitar commented Jan 23, 2019

The idea is that you can have more insight into what the state is and be able to inspect how, what, and how much changes during fitting. It is an ongoing research area. But I agree that those "params" in our view (state) is not as important as "hyper-parameters" in machine readable format.

@jnothman
Copy link
Member

jnothman commented Jan 23, 2019 via email

@thomasjpfan
Copy link
Member

Here is an API proposal for defining machine readable parameters:

class LogisticRegression(..., BaseEstimator):
    valid_params = {
        "penalty": Enum("l2", "l1", "elasticnet", "none"),
        "dual": TypeOf(bool),
        "tol": Interval(float, 0, None, lower_inclusive=False),
        "C": Interval(float, 0),
        "fit_intercept": TypeOf(bool),
        "class_weight": Union(TypeOf(dict), Enum('balanced', None)),
        "random_state": TypeOf(int, RandomState, type(None)),
        "solver": Enum("newton-cg", "lbfgs", "liblinear", "sag", "saga"),
        "max_iter": Interval(int, 0),
        "multi_class": Enum("ovr", "multinomial", "auto"),
        "verbose": Interval(int, 0, tags=["control"]),
        "warm_start": TypeOf(bool, tags=["control"]),
        "n_jobs": Union(Interval(int, 1, None), Enum(-1, None), tags=["resource"]),
        "l1_ratio": Union(Interval(float, 0, 1), Enum(None))
    }

    def fit(self, X, ...):
        self._valid_params()

class BaseEstimator:
    def _validate_params(self):
        ...

This introduces a four parameter types:

  1. Enum - Collection of options
  2. TypeOf - Specifying types
  3. Interval - Numerical interval
  4. Union - Union of above parameter types

There is also a tags parameter that adds metadata for each hyperparameter.

There will be estimator tests to make sure valid_parameters and the parameters in __init__ match up.

@mitar
Copy link
Contributor

mitar commented Jun 20, 2019

@thomasjpfan, I like it. I like tags, too.

One thing I would suggest is also to list the default value.

I am confused with TypeOf, especially "random_state": TypeOf(int, RandomState, type(None)), shouldn't that be Union of independent parameter types?

Also, it is unclear how can this information serve you. Like, what does TypeOf(RandomState) tell you? Not really how to make an instance of it? Should we go deeper and require that any class used inside TypeOf has its constructor arguments also described with parameter types?

How are you planing to describe parameters which are ndarrays? Like priors? Or class weights?

@shinnar
Copy link
Contributor

shinnar commented Jul 22, 2020

TLDR: We believe that JSON Schema is a good choice for a machine readable representation for hyper-parameter specifications.  We have explored this in our open source project Lale [1].

JSON Schema is a widely used specification language for JSON [2]. As part of the open source Lale (https://github.com/ibm/lale) project we use JSON Schema for our specifications, and it seems to work well.
In particular, we have wrappers for many scikit-learn classes that provide schemas for hyper-parameters (and input/output), and would be excited to integrate them more directly with scikit-learn.  

In Lale, we gain a number of benefits from using these schemas:

  • Automation: given a pipeline of operators that come with schemas we automatically generate search spaces for a number of AutoML tools, including GridSearchCV, Hyperopt, and SMAC.
  • Documentation: We auto-generate documentation about the hyper-parameters based on the schemas.  See for example: https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.logistic_regression.html
    A goal could be to enhance the text descriptions sufficiently so that original scikit-learn documentation for the operators is auto-generated and thus always in sync with the machine description.
  • Error checking: We check that the hyperparameters passed to an operator conform to the schema (this is just JSON Schema validation).

In addition to simple schema constraints (such as enumerations and ranges), we also use JSON Schema's rich logical connectives to support side-conditions.  For example, our schema for logistic regression encodes that 'The newton-cg, sag, and lbfgs solvers support only l2 penalties.' as a constraint [3], which is validated automatically. Additionally, our search generator takes these constraints into account when generating a search space for AutoML tool.

We support additional fields in a JSON Schema allowing the schema writer to specify a distribution for numeric ranges, as well as some other enhancements oriented towards AutoML tools.  For example, some fields can be marked as not relevant for optimization [4].  Other fields can have one variant marked as relevant for the optimizer, while still allowing others.  For example, specifying the class_weight as a dictionary is valid, but not necessarily the best choice for an AutoML tool [5].

[1] https://github.com/ibm/lale
[2] https://json-schema.org/understanding-json-schema/
[3] https://github.com/IBM/lale/blob/v0.3.20/lale/lib/sklearn/logistic_regression.py#L246-L251
[4] https://github.com/IBM/lale/blob/v0.3.20/lale/lib/sklearn/logistic_regression.py#L108-L110
[5] https://github.com/IBM/lale/blob/v0.3.20/lale/lib/sklearn/logistic_regression.py#L190-L193

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants