Machine readable definition of parameters #13031

amueller · 2019-01-22T21:05:13Z

Related to #5004.
For automl purposes and automatically using scikit-learn more programmatically, it would be helpful for some to specify allowed parameters in a machine-readable format. Right now they are only available via the documentation, which doesn't allow us to actually test them and might be hard to parse.

Considerations are how to specify hyper-parameters. We could use the Python type system, but I think we would need to define custom types for strings with a fixed number of options.

It might also make sense to have possible ranges for integer and float parameters.

Finally, it might be interesting to annotate whether a parameter influences algorithm results, algorithm speed and/or memory consumption.

Not all options influence the result and I've seen people grid-search optimization options while evaluating only on accuracy. There might be fuzzy cases, though, like different solvers might yield slightly different results but it's mostly a time/memory issue.

Another question is whether it's in scope of doing this inside scikit-learn at all, or whether it should live outside - but it will be hard to keep in sync.

Being able to test for different options (not in standard CI but maybe in cron?) might be a good motivation. However, only some combinations of parameters are valid and actually specifying valid combinations (which would be required for testing) is more tricky than just specifying valid values.

jnothman · 2019-01-22T22:39:04Z

This broad goal of enabling Auto ML is not something we have on our roadmap iirc, and probably should be. Another aspect of parameter use that might deserve machine readable annotation is which parameters can be varied for warm_start (and in what order).

mitar · 2019-01-23T03:34:13Z

In the d3m project we have a language to describe parameters using Python classes. We have decided not to use Python typing because of its strange focus on using Type[Arg] syntax which means that everything is a class and not an instance so then you have to program with metaclasses and so on. It just becomes another tricky thing to manage. So we just use regular Python instances.

We already parsed docstrings for many sklearn functions and also hand-curated them to fix some things.

We do differentiate between "params" (state of an estimator which is being changed during fitting) and "hyper-parameters" (parameters which control what the estimator is doing). For first we just record their type (using Python typing), for the latter we define the whole search space, in which context they could probably be used (tuning, resource use, controlling the logic itself). I think sklearn already has something like that with its differentiation between params with _ suffix and without. One tricky thing is that it is tricky to know the whole "params" (state) in sklearn because some attributes are set outside of constructor. One thing sklearn could improve also is make sure all instance attributes are set in the constructor and not outside.

See example of our description of "params" and "hyper-parameters" for random forest here.

I think it would be great to have such machine-readable format as source which would then generate docstrings. (You could use a Python metaclass to do that automatically.)

jnothman · 2019-01-23T06:52:00Z

Our convention is to set those "parameters" attributes during the fit method. I don't see how it would be an improvement to change that rather than, say, add the type annotations for them like those you have constructed, and enforce them in tests.

mitar · 2019-01-23T13:40:25Z

So how would one automatically inspect an estimator? It seems that currently the idea is that you create an instance of the estimator, provide hyper-parameters, and then read what are params/state ... how? So if flags are dynamic and depend on hyper-parameters, params are as well. But I am not sure if I want to fit the estimator just to figure out what state it might have and populate the AutoML database.

jnothman · 2019-01-23T13:44:21Z

What do you use the parameter annotations for?

mitar · 2019-01-23T13:48:38Z

The idea is that you can have more insight into what the state is and be able to inspect how, what, and how much changes during fitting. It is an ongoing research area. But I agree that those "params" in our view (state) is not as important as "hyper-parameters" in machine readable format.

jnothman · 2019-01-23T21:11:10Z

Right. So it might allow your auto ml engine to learn which "hyper" parameters actually affect the fit. Interesting, although prediction should often tell you that too.

thomasjpfan · 2019-06-20T04:17:35Z

Here is an API proposal for defining machine readable parameters:

class LogisticRegression(..., BaseEstimator):
    valid_params = {
        "penalty": Enum("l2", "l1", "elasticnet", "none"),
        "dual": TypeOf(bool),
        "tol": Interval(float, 0, None, lower_inclusive=False),
        "C": Interval(float, 0),
        "fit_intercept": TypeOf(bool),
        "class_weight": Union(TypeOf(dict), Enum('balanced', None)),
        "random_state": TypeOf(int, RandomState, type(None)),
        "solver": Enum("newton-cg", "lbfgs", "liblinear", "sag", "saga"),
        "max_iter": Interval(int, 0),
        "multi_class": Enum("ovr", "multinomial", "auto"),
        "verbose": Interval(int, 0, tags=["control"]),
        "warm_start": TypeOf(bool, tags=["control"]),
        "n_jobs": Union(Interval(int, 1, None), Enum(-1, None), tags=["resource"]),
        "l1_ratio": Union(Interval(float, 0, 1), Enum(None))
    }

    def fit(self, X, ...):
        self._valid_params()

class BaseEstimator:
    def _validate_params(self):
        ...

This introduces a four parameter types:

Enum - Collection of options
TypeOf - Specifying types
Interval - Numerical interval
Union - Union of above parameter types

There is also a tags parameter that adds metadata for each hyperparameter.

There will be estimator tests to make sure valid_parameters and the parameters in __init__ match up.

mitar · 2019-06-20T05:40:32Z

@thomasjpfan, I like it. I like tags, too.

One thing I would suggest is also to list the default value.

I am confused with TypeOf, especially "random_state": TypeOf(int, RandomState, type(None)), shouldn't that be Union of independent parameter types?

Also, it is unclear how can this information serve you. Like, what does TypeOf(RandomState) tell you? Not really how to make an instance of it? Should we go deeper and require that any class used inside TypeOf has its constructor arguments also described with parameter types?

How are you planing to describe parameters which are ndarrays? Like priors? Or class weights?

shinnar · 2020-07-22T15:19:18Z

TLDR: We believe that JSON Schema is a good choice for a machine readable representation for hyper-parameter specifications. We have explored this in our open source project Lale [1].

JSON Schema is a widely used specification language for JSON [2]. As part of the open source Lale (https://github.com/ibm/lale) project we use JSON Schema for our specifications, and it seems to work well.
In particular, we have wrappers for many scikit-learn classes that provide schemas for hyper-parameters (and input/output), and would be excited to integrate them more directly with scikit-learn.

In Lale, we gain a number of benefits from using these schemas:

Automation: given a pipeline of operators that come with schemas we automatically generate search spaces for a number of AutoML tools, including GridSearchCV, Hyperopt, and SMAC.
Documentation: We auto-generate documentation about the hyper-parameters based on the schemas. See for example: https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.logistic_regression.html
A goal could be to enhance the text descriptions sufficiently so that original scikit-learn documentation for the operators is auto-generated and thus always in sync with the machine description.
Error checking: We check that the hyperparameters passed to an operator conform to the schema (this is just JSON Schema validation).

In addition to simple schema constraints (such as enumerations and ranges), we also use JSON Schema's rich logical connectives to support side-conditions. For example, our schema for logistic regression encodes that 'The newton-cg, sag, and lbfgs solvers support only l2 penalties.' as a constraint [3], which is validated automatically. Additionally, our search generator takes these constraints into account when generating a search space for AutoML tool.

We support additional fields in a JSON Schema allowing the schema writer to specify a distribution for numeric ranges, as well as some other enhancements oriented towards AutoML tools. For example, some fields can be marked as not relevant for optimization [4]. Other fields can have one variant marked as relevant for the optimizer, while still allowing others. For example, specifying the class_weight as a dictionary is valid, but not necessarily the best choice for an AutoML tool [5].

[1] https://github.com/ibm/lale
[2] https://json-schema.org/understanding-json-schema/
[3] https://github.com/IBM/lale/blob/v0.3.20/lale/lib/sklearn/logistic_regression.py#L246-L251
[4] https://github.com/IBM/lale/blob/v0.3.20/lale/lib/sklearn/logistic_regression.py#L108-L110
[5] https://github.com/IBM/lale/blob/v0.3.20/lale/lib/sklearn/logistic_regression.py#L190-L193

amueller mentioned this issue Jun 18, 2019

Programatically finding all supported solvers and losses for an estimator #14063

Open

jeremiedbb mentioned this issue Jun 12, 2020

API split "model" parameters from "config" params. #17566

Open

NicolasHug mentioned this issue Aug 7, 2020

Default hyperparameter ranges #18116

Closed

NicolasHug mentioned this issue Sep 3, 2020

Hyperparameter range suggestion for grid search #18330

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine readable definition of parameters #13031

Machine readable definition of parameters #13031

amueller commented Jan 22, 2019 •

edited

jnothman commented Jan 22, 2019 via email

mitar commented Jan 23, 2019

jnothman commented Jan 23, 2019 via email

mitar commented Jan 23, 2019

jnothman commented Jan 23, 2019 via email

mitar commented Jan 23, 2019

jnothman commented Jan 23, 2019 via email

thomasjpfan commented Jun 20, 2019

mitar commented Jun 20, 2019 •

edited

shinnar commented Jul 22, 2020

Machine readable definition of parameters #13031

Machine readable definition of parameters #13031

Comments

amueller commented Jan 22, 2019 • edited

jnothman commented Jan 22, 2019 via email

mitar commented Jan 23, 2019

jnothman commented Jan 23, 2019 via email

mitar commented Jan 23, 2019

jnothman commented Jan 23, 2019 via email

mitar commented Jan 23, 2019

jnothman commented Jan 23, 2019 via email

thomasjpfan commented Jun 20, 2019

mitar commented Jun 20, 2019 • edited

shinnar commented Jul 22, 2020

amueller commented Jan 22, 2019 •

edited

mitar commented Jun 20, 2019 •

edited