In [1]:
%matplotlib inline


# Obtain run information

The following example shows how to obtain information from a finished
Auto-sklearn run. In particular, it shows:
* how to query which models were evaluated by Auto-sklearn
* how to query the models in the final ensemble
* how to get general statistics on the what Auto-sklearn evaluated

Auto-sklearn is a wrapper on top of
the sklearn models. This example illustrates how to interact
with the sklearn components directly, in this case a PCA preprocessor.


In [2]:
from pprint import pprint

import sklearn.datasets
import sklearn.metrics

import autosklearn.classification

  from pandas.core.computation.check import NUMEXPR_INSTALLED


## Data Loading



In [3]:
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1
)

## Build and fit the classifier



In [4]:
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=30,
    per_run_time_limit=10,
    disable_evaluator_output=False,
    memory_limit=16384,
    # To simplify querying the models in the final ensemble, we
    # restrict auto-sklearn to use only pca as a preprocessor
    include={"feature_preprocessor": ["pca"]},
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")



AutoSklearnClassifier(ensemble_class=<class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>,
                      include={'feature_preprocessor': ['pca']},
                      memory_limit=16384, per_run_time_limit=10,
                      time_left_for_this_task=30)

## Predict using the model



In [5]:
predictions = automl.predict(X_test)
print("Accuracy score:{}".format(sklearn.metrics.accuracy_score(y_test, predictions)))

Accuracy score:0.9440559440559441


## Report the models found by Auto-Sklearn

Auto-sklearn uses
[Ensemble Selection](https://www.cs.cornell.edu/~alexn/papers/shotgun.icml04.revised.rev2.pdf)
to construct ensembles in a post-hoc fashion. The ensemble is a linear
weighting of all models constructed during the hyperparameter optimization.
This prints the final ensemble. It is a dictionary where ``model_id`` of
each model is a key, and value is a dictionary containing information
of that model. A model's dict contains its ``'model_id'``, ``'rank'``,
``'cost'``, ``'ensemble_weight'``, and the model itself. The model is
given by the ``'data_preprocessor'``, ``'feature_preprocessor'``,
``'regressor'/'classifier'`` and ``'sklearn_regressor'/'sklearn_classifier'``
entries. But for the ``'cv'`` resampling strategy, the same for each cv
model is stored in the ``'estimators'`` list in the dict, along with the
``'voting_model'``.



In [6]:
pprint(automl.show_models(), indent=4)

{   2: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7e0a452cc1f0>,
           'cost': 0.07801418439716312,
           'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7e0a45465fd0>,
           'ensemble_weight': 0.02,
           'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7e0a452cc610>,
           'model_id': 2,
           'rank': 1,
           'sklearn_classifier': RandomForestClassifier(max_features=5, n_estimators=512, n_jobs=1,
                       random_state=1, warm_start=True)},
    3: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7e0a45472b80>,
           'cost': 0.07092198581560283,
           'data_preprocessor': <autosklearn.pipeline.components.

In [7]:
selected_models = automl.show_models()
print(selected_models)

{2: {'model_id': 2, 'rank': 1, 'cost': 0.07801418439716312, 'ensemble_weight': 0.02, 'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7e0a45465fd0>, 'balancing': Balancing(random_state=1), 'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7e0a452cc610>, 'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7e0a452cc1f0>, 'sklearn_classifier': RandomForestClassifier(max_features=5, n_estimators=512, n_jobs=1,
                       random_state=1, warm_start=True)}, 3: {'model_id': 3, 'rank': 2, 'cost': 0.07092198581560283, 'ensemble_weight': 0.04, 'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7e0a451cef10>, 'balancing': Balancing(random_state=1), 'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7e0

## Report statistics about the search

Print statistics about the auto-sklearn run such as number of
iterations, number of models failed with a time out etc.



In [8]:
print(automl.sprint_statistics())

auto-sklearn results:
  Dataset name: breast_cancer
  Metric: accuracy
  Best validation score: 0.978723
  Number of target algorithm runs: 26
  Number of successful target algorithm runs: 26
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0



## Detailed statistics about the search - part 1

Auto-sklearn also keeps detailed statistics of the hyperparameter
optimization procedurce, which are stored in a so-called
[run history](https://automl.github.io/SMAC3/main/api/smac.runhistory.runhistory.html#smac.runhistory.runhistory.RunHistory).



In [9]:
print(automl.automl_.runhistory_)

<smac.runhistory.runhistory.RunHistory object at 0x7e0adb661fa0>


Runs are stored inside an ``OrderedDict`` called ``data``:



In [10]:
print(len(automl.automl_.runhistory_.data))

27


Let's iterative over all entries



In [11]:
for run_key in automl.automl_.runhistory_.data:
    print("#########")
    print(run_key)
    print(automl.automl_.runhistory_.data[run_key])

#########
RunKey(config_id=1, instance_id='{"task_id": "breast_cancer"}', seed=0, budget=0.0)
RunValue(cost=0.07801418439716312, time=0.5278530120849609, status=<StatusType.SUCCESS: 1>, starttime=1727145388.6151183, endtime=1727145389.152537, additional_info={'duration': 0.49729061126708984, 'num_run': 2, 'train_loss': 0.0, 'configuration_origin': 'Initial design'})
#########
RunKey(config_id=2, instance_id='{"task_id": "breast_cancer"}', seed=0, budget=0.0)
RunValue(cost=0.07092198581560283, time=0.4174056053161621, status=<StatusType.SUCCESS: 1>, starttime=1727145389.1710904, endtime=1727145389.5982425, additional_info={'duration': 0.38295412063598633, 'num_run': 3, 'train_loss': 0.06315789473684208, 'configuration_origin': 'Initial design'})
#########
RunKey(config_id=3, instance_id='{"task_id": "breast_cancer"}', seed=0, budget=0.0)
RunValue(cost=0.028368794326241176, time=0.5013651847839355, status=<StatusType.SUCCESS: 1>, starttime=1727145389.62697, endtime=1727145390.1388834, ad

and have a detailed look at one entry:



In [12]:
run_key = list(automl.automl_.runhistory_.data.keys())[0]
run_value = automl.automl_.runhistory_.data[run_key]

The ``run_key`` contains all information describing a run:



In [13]:
print("Configuration ID:", run_key.config_id)
print("Instance:", run_key.instance_id)
print("Seed:", run_key.seed)
print("Budget:", run_key.budget)

Configuration ID: 1
Instance: {"task_id": "breast_cancer"}
Seed: 0
Budget: 0.0


and the configuration can be looked up in the run history as well:



In [14]:
print(automl.automl_.runhistory_.ids_config[run_key.config_id])

Configuration(values={
  'balancing:strategy': 'none',
  'classifier:__choice__': 'random_forest',
  'classifier:random_forest:bootstrap': 'True',
  'classifier:random_forest:criterion': 'gini',
  'classifier:random_forest:max_depth': 'None',
  'classifier:random_forest:max_features': 0.5,
  'classifier:random_forest:max_leaf_nodes': 'None',
  'classifier:random_forest:min_impurity_decrease': 0.0,
  'classifier:random_forest:min_samples_leaf': 1,
  'classifier:random_forest:min_samples_split': 2,
  'classifier:random_forest:min_weight_fraction_leaf': 0.0,
  'data_preprocessor:__choice__': 'feature_type',
  'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean',
  'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'standardize',
  'feature_preprocessor:__choice__': 'pca',
  'feature_preprocessor:pca:keep_variance': 0.9999,
  'feature_preprocessor:pca:whiten': 'False',
})



The only other important entry is the budget in case you are using
auto-sklearn with
`sphx_glr_examples_60_search_example_successive_halving.py`.
The remaining parts of the key can be ignored for auto-sklearn and are
only there because the underlying optimizer, SMAC, can handle more general
problems, too.



The ``run_value`` contains all output from running the configuration:



In [15]:
print("Cost:", run_value.cost)
print("Time:", run_value.time)
print("Status:", run_value.status)
print("Additional information:", run_value.additional_info)
print("Start time:", run_value.starttime)
print("End time", run_value.endtime)

Cost: 0.07801418439716312
Time: 0.5278530120849609
Status: StatusType.SUCCESS
Additional information: {'duration': 0.49729061126708984, 'num_run': 2, 'train_loss': 0.0, 'configuration_origin': 'Initial design'}
Start time: 1727145388.6151183
End time 1727145389.152537


Cost is basically the same as a loss. In case the metric to optimize for
should be maximized, it is internally transformed into a minimization
metric. Additionally, the status type gives information on whether the run
was successful, while the additional information's most interesting entry
is the internal training loss. Furthermore, there is detailed information
on the runtime available.



As an example, let's find the best configuration evaluated. As
Auto-sklearn solves a minimization problem internally, we need to look
for the entry with the lowest loss:



In [16]:
losses_and_configurations = [
    (run_value.cost, run_key.config_id)
    for run_key, run_value in automl.automl_.runhistory_.data.items()
]
losses_and_configurations.sort()
print("Lowest loss:", losses_and_configurations[0][0])
print(
    "Best configuration:",
    automl.automl_.runhistory_.ids_config[losses_and_configurations[0][1]],
)

Lowest loss: 0.021276595744680882
Best configuration: Configuration(values={
  'balancing:strategy': 'none',
  'classifier:__choice__': 'passive_aggressive',
  'classifier:passive_aggressive:C': 1.1756330265225057e-05,
  'classifier:passive_aggressive:average': 'True',
  'classifier:passive_aggressive:fit_intercept': 'True',
  'classifier:passive_aggressive:loss': 'squared_hinge',
  'classifier:passive_aggressive:tol': 2.2819710870848476e-05,
  'data_preprocessor:__choice__': 'feature_type',
  'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean',
  'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'standardize',
  'feature_preprocessor:__choice__': 'pca',
  'feature_preprocessor:pca:keep_variance': 0.6667419144226332,
  'feature_preprocessor:pca:whiten': 'False',
})



## Detailed statistics about the search - part 2

To maintain compatibility with scikit-learn, Auto-sklearn gives the
same data as
[cv_results_]
(https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).



In [17]:
print(automl.cv_results_)

{'mean_test_score': array([0.92198582, 0.92907801, 0.97163121, 0.89361702, 0.88652482,
       0.89361702, 0.97163121, 0.91489362, 0.92198582, 0.89361702,
       0.94326241, 0.89361702, 0.96453901, 0.96453901, 0.85815603,
       0.88652482, 0.95744681, 0.89361702, 0.92907801, 0.97163121,
       0.94326241, 0.89361702, 0.95744681, 0.9787234 , 0.91489362,
       0.95035461]), 'rank_test_scores': array([14, 12,  2, 18, 24, 18,  2, 16, 14, 18, 10, 18,  5,  5, 26, 24,  7,
       18, 12,  2, 10, 18,  7,  1, 16,  9]), 'mean_fit_time': array([0.52785301, 0.41740561, 0.50136518, 0.24256563, 0.23035717,
       0.27443433, 0.23250961, 0.29995418, 0.43281007, 0.24171424,
       0.26088953, 0.26614261, 0.2220614 , 0.25150943, 0.23788166,
       0.2165668 , 0.25430059, 0.25377274, 0.24905372, 0.44074345,
       0.2528522 , 0.223629  , 0.26293254, 0.21079683, 0.43792057,
       0.26933074]), 'params': [{'balancing:strategy': 'none', 'classifier:__choice__': 'random_forest', 'data_preprocessor:__choice

## Inspect the components of the best model

Iterate over the components of the model and print
The explained variance ratio per stage



In [18]:
# pipeline is different model in the ensemble
for i, (weight, pipeline) in enumerate(automl.get_models_with_weights()):
    for stage_name, component in pipeline.named_steps.items():
        print("The {}th pipeline has a component {}".format(i,stage_name))
        if "feature_preprocessor" in stage_name:
            print(
                "The {} of {}th pipeline has a explained variance of {}".format(stage_name,
                    i,
                    # The component is an instance of AutoSklearnChoice.
                    # Access the sklearn object via the choice attribute
                    # We want the explained variance attributed of
                    # each principal component
                    component.choice.preprocessor.explained_variance_ratio_,
                    
                )
            )
            print(
                "The {} of {}th pipeline has {} componets".format(stage_name,
                    i,
                    # The component is an instance of AutoSklearnChoice.
                    # Access the sklearn object via the choice attribute
                    component.choice.preprocessor.n_components_,
                    
                )
            )

The 0th pipeline has a component data_preprocessor
The 0th pipeline has a component balancing
The 0th pipeline has a component feature_preprocessor
The feature_preprocessor of 0th pipeline has a explained variance of [0.4595393  0.18012072 0.09809101 0.06332899 0.0587162 ]
The feature_preprocessor of 0th pipeline has 5 componets
The 0th pipeline has a component classifier
The 1th pipeline has a component data_preprocessor
The 1th pipeline has a component balancing
The 1th pipeline has a component feature_preprocessor
The feature_preprocessor of 1th pipeline has a explained variance of [0.46038401 0.16124884 0.09747816 0.06923404 0.06142479 0.03312917
 0.03182802 0.01555463 0.01348582 0.00965531 0.00870982 0.007397
 0.00547082 0.00443245 0.00396559 0.00313575 0.0022883  0.00195796
 0.00156348]
The feature_preprocessor of 1th pipeline has 19 componets
The 1th pipeline has a component classifier
The 2th pipeline has a component data_preprocessor
The 2th pipeline has a component balancing


In [19]:
component.choice.preprocessor.components_

AttributeError: 'LibSVM_SVC' object has no attribute 'preprocessor'

In [20]:
for weight, model in automl.get_models_with_weights():
    print(tuple(model.steps[-2]))
    print(tuple(model.steps[-2]))
    print(tuple(model.steps[-2])[-1].choice.preprocessor.n_components_)
    # print(tuple(model.steps[-2])[-1].choice.preprocessor.scores_)
    # print(tuple(model.steps[-2])[-1].choice.preprocessor.percentile)

('feature_preprocessor', <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7e0a3c80ea30>)
('feature_preprocessor', <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7e0a3c80ea30>)
5
('feature_preprocessor', <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7e0a452056a0>)
('feature_preprocessor', <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7e0a452056a0>)
19
('feature_preprocessor', <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7e0a3c513b50>)
('feature_preprocessor', <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7e0a3c513b50>)
3
('feature_preprocessor', <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7e0a3c85d4f0>)
('feature_preprocessor', <autosklearn.pipeline.components.feat