New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Machine readable definition of parameters #13031
Comments
This broad goal of enabling Auto ML is not something we have on our roadmap
iirc, and probably should be. Another aspect of parameter use that might
deserve machine readable annotation is which parameters can be varied for
warm_start (and in what order).
|
In the d3m project we have a language to describe parameters using Python classes. We have decided not to use Python typing because of its strange focus on using We already parsed docstrings for many sklearn functions and also hand-curated them to fix some things. We do differentiate between "params" (state of an estimator which is being changed during fitting) and "hyper-parameters" (parameters which control what the estimator is doing). For first we just record their type (using Python typing), for the latter we define the whole search space, in which context they could probably be used (tuning, resource use, controlling the logic itself). I think sklearn already has something like that with its differentiation between params with See example of our description of "params" and "hyper-parameters" for random forest here. I think it would be great to have such machine-readable format as source which would then generate docstrings. (You could use a Python metaclass to do that automatically.) |
Our convention is to set those "parameters" attributes during the fit
method. I don't see how it would be an improvement to change that rather
than, say, add the type annotations for them like those you have
constructed, and enforce them in tests.
|
So how would one automatically inspect an estimator? It seems that currently the idea is that you create an instance of the estimator, provide hyper-parameters, and then read what are params/state ... how? So if flags are dynamic and depend on hyper-parameters, params are as well. But I am not sure if I want to fit the estimator just to figure out what state it might have and populate the AutoML database. |
What do you use the parameter annotations for?
|
The idea is that you can have more insight into what the state is and be able to inspect how, what, and how much changes during fitting. It is an ongoing research area. But I agree that those "params" in our view (state) is not as important as "hyper-parameters" in machine readable format. |
Right. So it might allow your auto ml engine to learn which "hyper"
parameters actually affect the fit. Interesting, although prediction should
often tell you that too.
|
Here is an API proposal for defining machine readable parameters: class LogisticRegression(..., BaseEstimator):
valid_params = {
"penalty": Enum("l2", "l1", "elasticnet", "none"),
"dual": TypeOf(bool),
"tol": Interval(float, 0, None, lower_inclusive=False),
"C": Interval(float, 0),
"fit_intercept": TypeOf(bool),
"class_weight": Union(TypeOf(dict), Enum('balanced', None)),
"random_state": TypeOf(int, RandomState, type(None)),
"solver": Enum("newton-cg", "lbfgs", "liblinear", "sag", "saga"),
"max_iter": Interval(int, 0),
"multi_class": Enum("ovr", "multinomial", "auto"),
"verbose": Interval(int, 0, tags=["control"]),
"warm_start": TypeOf(bool, tags=["control"]),
"n_jobs": Union(Interval(int, 1, None), Enum(-1, None), tags=["resource"]),
"l1_ratio": Union(Interval(float, 0, 1), Enum(None))
}
def fit(self, X, ...):
self._valid_params()
class BaseEstimator:
def _validate_params(self):
... This introduces a four parameter types:
There is also a There will be estimator tests to make sure |
@thomasjpfan, I like it. I like One thing I would suggest is also to list the default value. I am confused with Also, it is unclear how can this information serve you. Like, what does How are you planing to describe parameters which are ndarrays? Like priors? Or class weights? |
TLDR: We believe that JSON Schema is a good choice for a machine readable representation for hyper-parameter specifications. We have explored this in our open source project Lale [1]. JSON Schema is a widely used specification language for JSON [2]. As part of the open source Lale (https://github.com/ibm/lale) project we use JSON Schema for our specifications, and it seems to work well. In Lale, we gain a number of benefits from using these schemas:
In addition to simple schema constraints (such as enumerations and ranges), we also use JSON Schema's rich logical connectives to support side-conditions. For example, our schema for logistic regression encodes that 'The newton-cg, sag, and lbfgs solvers support only l2 penalties.' as a constraint [3], which is validated automatically. Additionally, our search generator takes these constraints into account when generating a search space for AutoML tool. We support additional fields in a JSON Schema allowing the schema writer to specify a distribution for numeric ranges, as well as some other enhancements oriented towards AutoML tools. For example, some fields can be marked as not relevant for optimization [4]. Other fields can have one variant marked as relevant for the optimizer, while still allowing others. For example, specifying the class_weight as a dictionary is valid, but not necessarily the best choice for an AutoML tool [5]. [1] https://github.com/ibm/lale |
Related to #5004.
For automl purposes and automatically using scikit-learn more programmatically, it would be helpful for some to specify allowed parameters in a machine-readable format. Right now they are only available via the documentation, which doesn't allow us to actually test them and might be hard to parse.
Considerations are how to specify hyper-parameters. We could use the Python type system, but I think we would need to define custom types for strings with a fixed number of options.
It might also make sense to have possible ranges for integer and float parameters.
Finally, it might be interesting to annotate whether a parameter influences algorithm results, algorithm speed and/or memory consumption.
Not all options influence the result and I've seen people grid-search optimization options while evaluating only on accuracy. There might be fuzzy cases, though, like different solvers might yield slightly different results but it's mostly a time/memory issue.
Another question is whether it's in scope of doing this inside scikit-learn at all, or whether it should live outside - but it will be hard to keep in sync.
Being able to test for different options (not in standard CI but maybe in cron?) might be a good motivation. However, only some combinations of parameters are valid and actually specifying valid combinations (which would be required for testing) is more tricky than just specifying valid values.
The text was updated successfully, but these errors were encountered: