Skip to content

How to use get_params() #135

Open
Open
@koaning

Description

@koaning

I have a scikit-learn pipeline defined in the code below.

from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import OneHotEncoder, Binarizer
from sklearn.impute import SimpleImputer
from skrub import SelectCols
from sklearn.ensemble import HistGradientBoostingClassifier

feat_pipe = make_union(
    make_pipeline(
        SelectCols(["pclass", "sex"]),
        OneHotEncoder(sparse_output=False)
    ),
    SelectCols(["fare", "age"])
)

pipe = make_pipeline(
    feat_pipe, 
    HistGradientBoostingClassifier()
)

When I ask for the params of said pipeline I can see a long list of names that I can refer to when I do hyperparameter tuning.

pipe.get_params()

The list is long, but that is because it is nice and elaborate.

{'memory': None,
 'steps': [('featureunion',
   FeatureUnion(transformer_list=[('pipeline',
                                   Pipeline(steps=[('selectcols',
                                                    SelectCols(cols=['pclass',
                                                                     'sex'])),
                                                   ('onehotencoder',
                                                    OneHotEncoder(sparse_output=False))])),
                                  ('selectcols',
                                   SelectCols(cols=['fare', 'age']))])),
  ('histgradientboostingclassifier', HistGradientBoostingClassifier())],
 'verbose': False,
 'featureunion': FeatureUnion(transformer_list=[('pipeline',
                                 Pipeline(steps=[('selectcols',
                                                  SelectCols(cols=['pclass',
                                                                   'sex'])),
                                                 ('onehotencoder',
                                                  OneHotEncoder(sparse_output=False))])),
                                ('selectcols',
                                 SelectCols(cols=['fare', 'age']))]),
 'histgradientboostingclassifier': HistGradientBoostingClassifier(),
 'featureunion__n_jobs': None,
 'featureunion__transformer_list': [('pipeline',
   Pipeline(steps=[('selectcols', SelectCols(cols=['pclass', 'sex'])),
                   ('onehotencoder', OneHotEncoder(sparse_output=False))])),
  ('selectcols', SelectCols(cols=['fare', 'age']))],
 'featureunion__transformer_weights': None,
 'featureunion__verbose': False,
 'featureunion__verbose_feature_names_out': True,
 'featureunion__pipeline': Pipeline(steps=[('selectcols', SelectCols(cols=['pclass', 'sex'])),
                 ('onehotencoder', OneHotEncoder(sparse_output=False))]),
 'featureunion__selectcols': SelectCols(cols=['fare', 'age']),
 'featureunion__pipeline__memory': None,
 'featureunion__pipeline__steps': [('selectcols',
   SelectCols(cols=['pclass', 'sex'])),
  ('onehotencoder', OneHotEncoder(sparse_output=False))],
 'featureunion__pipeline__verbose': False,
 'featureunion__pipeline__selectcols': SelectCols(cols=['pclass', 'sex']),
 'featureunion__pipeline__onehotencoder': OneHotEncoder(sparse_output=False),
 'featureunion__pipeline__selectcols__cols': ['pclass', 'sex'],
 'featureunion__pipeline__onehotencoder__categories': 'auto',
 'featureunion__pipeline__onehotencoder__drop': None,
 'featureunion__pipeline__onehotencoder__dtype': numpy.float64,
 'featureunion__pipeline__onehotencoder__feature_name_combiner': 'concat',
 'featureunion__pipeline__onehotencoder__handle_unknown': 'error',
 'featureunion__pipeline__onehotencoder__max_categories': None,
 'featureunion__pipeline__onehotencoder__min_frequency': None,
 'featureunion__pipeline__onehotencoder__sparse_output': False,
 'featureunion__selectcols__cols': ['fare', 'age'],
 'histgradientboostingclassifier__categorical_features': 'warn',
 'histgradientboostingclassifier__class_weight': None,
 'histgradientboostingclassifier__early_stopping': 'auto',
 'histgradientboostingclassifier__interaction_cst': None,
 'histgradientboostingclassifier__l2_regularization': 0.0,
 'histgradientboostingclassifier__learning_rate': 0.1,
 'histgradientboostingclassifier__loss': 'log_loss',
 'histgradientboostingclassifier__max_bins': 255,
 'histgradientboostingclassifier__max_depth': None,
 'histgradientboostingclassifier__max_features': 1.0,
 'histgradientboostingclassifier__max_iter': 100,
 'histgradientboostingclassifier__max_leaf_nodes': 31,
 'histgradientboostingclassifier__min_samples_leaf': 20,
 'histgradientboostingclassifier__monotonic_cst': None,
 'histgradientboostingclassifier__n_iter_no_change': 10,
 'histgradientboostingclassifier__random_state': None,
 'histgradientboostingclassifier__scoring': 'loss',
 'histgradientboostingclassifier__tol': 1e-07,
 'histgradientboostingclassifier__validation_fraction': 0.1,
 'histgradientboostingclassifier__verbose': 0,
 'histgradientboostingclassifier__warm_start': False}

The reason why this is nice is that it allows me to be very specific. I can tune each input argument of every component like featureunion__pipeline__selectcols__cols or featureunion__pipeline__onehotencoder__sparse_output. This is very nice for gridsearch!

from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(
    pipe, 
    param_grid={
        "featureunion__pipeline__onehotencoder__min_frequency": [None, 1, 5, 10]
    }
)

grid.fit(X, y)

The cool thing about this is that I am able to get a nice table as output too.

import pandas as pd

pd.DataFrame(grid.cv_results_).to_markdown()
mean_fit_time std_fit_time mean_score_time std_score_time param_featureunion__pipeline__onehotencoder__min_frequency params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.557284 0.0364319 0.0053968 0.00091813 nan {'featureunion__pipeline__onehotencoder__min_frequency': None} 0.515267 0.774809 0.637405 0.709924 0.636015 0.654684 0.0866783 1
1 0.567849 0.0222483 0.00532556 0.000495336 1 {'featureunion__pipeline__onehotencoder__min_frequency': 1} 0.515267 0.774809 0.637405 0.709924 0.636015 0.654684 0.0866783 1
2 0.567496 0.00920872 0.00557318 0.000404766 5 {'featureunion__pipeline__onehotencoder__min_frequency': 5} 0.515267 0.774809 0.637405 0.709924 0.636015 0.654684 0.0866783 1
3 0.553523 0.023475 0.0052145 0.000855578 10 {'featureunion__pipeline__onehotencoder__min_frequency': 10} 0.515267 0.774809 0.637405 0.709924 0.636015 0.654684 0.0866783 1

But when I look at IbisML I wonder if I am able to do the same thing.

import ibis_ml as iml

tfm = iml.Recipe(
    iml.ExpandDateTime(iml.date())
)

In IbisML it is the Recipe object that is scikit-learn compatible, not the ExpandDateTime object. So lets inspect.

tfm.get_params()

This yields the following.

{'steps': (ExpandDateTime(date(),
                 components=['dow', 'month', 'year', 'hour', 'minute', 'second']),),
 'expanddatetime': ExpandDateTime(date(),
                components=['dow', 'month', 'year', 'hour', 'minute', 'second'])}

In fairness, this is not completely unlike what scikit-learn does natively. In a pipeline in scikit-learn you also have access to the steps argument and you could theoretically make all the changes there directly by passing in new subpipelines. But there is a reason why scikit-learn does not stop there! It can go deeper into all the input arguments of all the estimators in the pipeline because it makes the end cv_results_ output a lot nicer. And this is where I worry if IbisML can do the same thing. It seems that I need to pass full objects, instead of being able to pluck out the individual attributes that I care about.

In this particular case, what if I want to measure the effect of including/excluding dow or the hour? Is that possible? Can I have an understore-syntax-like string just like in scikit-learn to configure that? Or do I need to overwrite the steps object?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions