ENH Adds HTML visualizations for estimators #14180

thomasjpfan · 2019-06-25T02:18:38Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

You can demo the visualization here: https://thomasjpfan.github.io/sklearn_viz_html/index.html

This PR implements a HTML visualization for estimators with a focus on displaying it in a Jupyter notebook or lab. This implementation is in pure HTML and CSS (no javascript or external dependencies):

We can hover over elements to see an estimators parameters (print_changed_only=True is the default for export_html):

All the labels in bold can be hovered over to get more information.
_type_of_html_estimator returns how to layout metaestimators, (ColumnTransformer and FeatureUnion is "parallel", while Pipeline is "serial") If there are any other metaestimators to add, we just need to add it to _type_of_html_estimator)
There is a hidden div sk-final-spacer as a hack to provide enough space for the information displayed while hovering over elements.

Code to Create HTML (In jupyterlab or a notebook)

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.pipeline import FeatureUnion
from sklearn.feature_selection import SelectPercentile
from sklearn.inspection import display_estimator

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))])

feat_u2 = FeatureUnion([("pca", PCA(n_components=1)),
                      ("svd", Pipeline([('tsvd1', TruncatedSVD(n_components=2)), 
                                        ('select', SelectPercentile())]))])

numeric_transformer2 = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('scaler', StandardScaler(with_std=False)),
    ('feats', feat_u2)
])
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', missing_values="missing")),
    ('onehot', OneHotEncoder(handle_unknown='ignore', drop='first'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num1', numeric_transformer, numeric_features),
        ('num2', numeric_transformer2, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

feat_u = FeatureUnion([("pca", PCA(n_components=1, whiten=True, svd_solver='full')),
                      ("svd", TruncatedSVD(n_components=2, n_iter=10))])
clf1 = LogisticRegression(solver='lbfgs', multi_class='multinomial',
                         random_state=1, max_iter=200)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1, max_depth=8, warm_start=True, n_jobs=3, oob_score=True)
clf3 = GaussianNB()
eclf1 = VotingClassifier(estimators=[
        ('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('feat_u', feat_u),
                      ('classifier', eclf1)])
display_estimator(clf)

jnothman · 2019-06-25T02:58:27Z

Awesome! I need to test out this parameter display thing. I want to know what happens when there's a lot to show (does it wrap?)

Is the monospace font necessary / helpful?

sklearn/inspection/_plot_estimators.py

thomasjpfan · 2019-06-25T13:30:53Z

PR updated with:

Adds Return to docstring (oops)
Checks jupyter context with get_ipython

jnothman · 2019-06-25T13:33:17Z

One of my questions here is how we want to make this appear in the documentation...

As in, should these appear throughout the user guide or tutorial or example gallery.

thomasjpfan · 2019-06-25T14:00:26Z

Right now this generates a self contained HTML file with the css and html it needs to render.

If this were to be placed in the examples, it would output a bunch of HTML when one runs the example locally. (We can do some clever hacking to get the HTML to render nicely on the webpage)

As for documentation, we can use this internally to display pipelines and/or metaestimators. (Most likely needing some Sphinx wizardry)

sklearn/inspection/_plot_estimators.py

amueller · 2019-06-26T20:22:38Z

Hm so pipelines show names if the step is a meta-estimator, and it doesn't show the name otherwise. That seems a bit inconsistent?
Showing all the names might be a bit much, but are pipelines this complex actually common?

Similarly there's no box for pipelines containing simple estimators.

There doesn't seem to be a distinction between how a column transformer is visualized and how a voting classifier is visualized. I feel like they should look different in the graph, or they should have names somehow?

amueller · 2019-06-26T20:25:19Z

I feel like we need someone that knows about UX/UI design to work on this

sklearn/inspection/__init__.py

amueller · 2019-06-26T20:31:53Z

We can do this two ways: either we merge a MVP and iterate or we try to "get it right". I think getting it right will result in lots of bike shedding and I'm not sure if we have anyone that's good with UIs. So maybe iterative is better?

Then the main remaining points are rendering in sphinx and adding generic meta-estimators and deciding on a name?

I would only really use it in the user guide for the pipeline and feature union, I think.

jnothman · 2019-06-27T07:59:10Z

I feel like we need someone that knows about UX/UI design to work on this

Agreed. Not to denigrate @thomasjpfan's illustrious skills of course. Is there someone we can can call on?

jnothman · 2019-06-27T11:45:15Z

I'm okay to iterate too. I.e. to make a reasonable first effort visible then hope the right helper comes along.

amueller · 2019-06-27T14:41:27Z

@jnothman I'm trying to figure out if Columbia lets me make payments to upwork, which would increase the chances of the right helper to come along tremendously, I think ;)

I suggested to @thomasjpfan to implement support for generic meta-estimators and figure out what it takes to render this in sphinx and then we can try and merge and iterate.

thomasjpfan · 2019-06-28T20:14:46Z

You can demo the visualization here: https://thomasjpfan.github.io/sklearn_viz_html/index.html

jnothman · 2019-07-09T02:51:10Z

But my main concern here is visibility of the feature. Do we want this to be the default _repr_html_ for estimators? Do we want a config to enable _repr_html_ and if so should it default on or off?

With _repr_html_ we should make a point of visualising pipelines in existing examples. But since they are HTML and not images, I'm not sure how their display will work with sphinx-gallery. Can we do this in SVG (even perhaps with a foreignObject)? Would that help??

ogrisel · 2019-07-09T07:54:06Z

I'm not sure how their display will work with sphinx-gallery.

Maybe @lesteve knows?

lesteve · 2019-07-09T11:48:46Z

Maybe @lesteve knows?

Full disclosure I am not following sphinx-gallery very closely any more. I don't think there is support for capturing rich output from the notebook inside sphinx-gallery yet. There were related discussions: sphinx-gallery/sphinx-gallery#396 and sphinx-gallery/sphinx-gallery#421. According to sphinx-gallery/sphinx-gallery#421 (comment), it seems like image scrapers may help.

amueller · 2019-07-09T15:31:21Z

Even if sphinx-gallery adds support, sphinx doesn't and we'd need to create a new extension to show it inside the user guide (which we would want).
Current work-around that @thomasjpfan and I settled on is just directly embedding the raw html for the user guide.

I think this would be cool as _repr_html_ and also enable it by default.
Right now for single estimators it's just a box with the name, not showing any parameters, and you have to mouse-over to get the changed parameters and there is no way to see the non-changed parameters. That makes sense for showing bigger pipeline but maybe not for a single estimator.

One option would be to make this only the default for meta-estimators, but not sure?

jnothman · 2019-07-09T22:32:55Z

If it was a graphic, it wouldn't need extra Sphinx support if we only used diagrams from examples, as we do with plots... For Sphinx support: We can easily generate the raw html in external files with a simple preprocessor, and could eventually make it into a directive. It sounds like the default repr discussion should happen separately where we can consider the proposals.

amueller · 2019-07-10T03:11:42Z

This directive might help: https://ipython.readthedocs.io/en/stable/sphinxext.html
And/or maybe @Carreau can tell us how to render html in our docs correctly, I'll ask him tomorrow.

GaelVaroquaux · 2020-04-29T08:22:48Z

Some learning would also likely be involved. How would we feel if a library that uses cross-validation invented a new word for it because cross-validation is too technical?

That's different: it's about adapting to our audience. Our audience needs to know some ML/stats to use scikit-learn. A lot of statisticians use R partly because it's less "geeky", less computer jargon. We need to cater to this audience.

rth · 2020-04-29T08:26:29Z

And you don't like having two API endpoints as in #14180 (comment) @rth?

For two endpoints it might be a bit awkward to maintain them in sync code wise I think (and a bit confusing to have two option with the same effect). But having the display config option that is {"text", "html"} and optionally accepts a list of MIME types for extensibility by advanced users, why not. This is all linked to the user extensibility discussion though.

adrinjalali · 2020-04-29T08:30:07Z

yeah, we could maybe call it display_type(?) and accept both kinds of input.

SylvainCorlay · 2020-04-29T08:35:27Z

Not chiming into the subjective appreciation on whether text/html is too much jargon. One way to have other names would be to list acceptable abbreviations for a mimetype.

For example, text/plain could be addressed as [text, plain] - although I liked the raw list of standard mime types, but I have my own biases.

rth · 2020-04-29T08:37:22Z

yeah, we could maybe call it display_type(?) and accept both kinds of input.

Or display_style as xarray.. Don't have a strong opinion though. Display is fine too.

jnothman · 2020-04-29T10:50:22Z

I don't think this is a question of MIME type. This is a question of how to format estimators. We could have multiple representations in each MIME type and could effectively return a simple repr in HTML with some highlighting etc, and it would still be distinct from the visual diagram representation implemented here. Were we to have an ASCII implementation of this diagram it would still be visual but text/plain.

Further, were we to have implemented this in SVG (MIME image/svg+xml), changing the implementation to text/html should not concern the user.

The question here for the user is how we should summarise an estimator by default in a repl or cell-based framework that supports whatever MIME types we can offer it; the option to constrain MIME type is only pertinent if we have multiple MIME types with the same effective representation (but distinct presentation, I suppose).

ogrisel

Still +1 on display="diagram" or alternatively display_style="diagram".

examples/compose/plot_column_transformer_mixed_types.py

ogrisel · 2020-04-29T16:29:49Z

doc/modules/compose.rst

+`display` option in :func:`sklearn.set_config`::
+
+  >>> from sklearn import set_config
+  >>> set_config(display=True)   # doctest: +SKIP


Suggested change

>>> set_config(display=True) # doctest: +SKIP

>>> set_config(display="diagram") # doctest: +SKIP

GaelVaroquaux · 2020-04-29T17:05:23Z

Still +1 on display="diagram" or alternatively display_style="diagram".

I like it

SylvainCorlay · 2020-04-29T17:09:29Z

I don't think this is a question of MIME type. This is a question of how to format estimators. We could have multiple representations in each MIME type and could effectively return a simple repr in HTML with some highlighting etc, and it would still be distinct from the visual diagram representation implemented here. Were we to have an ASCII implementation of this diagram it would still be visual but text/plain.

Further, were we to have implemented this in SVG (MIME image/svg+xml), changing the implementation to text/html should not concern the user.

The question here for the user is how we should summarise an estimator by default in a repl or cell-based framework that supports whatever MIME types we can offer it; the option to constrain MIME type is only pertinent if we have multiple MIME types with the same effective representation (but distinct presentation, I suppose).

That is an excellent point. Diagram or another style is independent of the mime type.

thomasjpfan · 2020-04-29T20:51:52Z

To be clear this PR adjusts the output of _repr_mimbundle_ depending on the configuration option. If display='diagram' then text/html will be added to the bundle, other wise only text/plain will be provided in the bundle.

jnothman

I think we should merge and can consider name changes right up to release. Should we implement an ipython magic to make it easy to enable?

jnothman · 2020-04-29T22:26:27Z

(though I'm happy with this config name)

jnothman · 2020-04-29T22:32:12Z

Amazing work, @thomasjpfan

amueller · 2020-04-30T00:52:23Z

OMG I'm so happy this is merged!!!

It seems to me right now the option does both, allowing an additional mime type and selecting what the content of the representation is. If you'd added the hypothetical different html representation that's not a graph,how would you enable that?

I was thinking about this addition in terms of mime type, and the diagram just as the most natural way to express nested estimators in html.

I don't have a strong opinion on the naming though. 'diagram' doesn't seem like a very natural option but I'm happy as long as we have the feature. These days I always think in terms of teaching stuff. How would you explain having a sklearn.set_config(display="diagram") line on the top of each notebook / how easy is that to remember? But I guess people will just get used to copy & pasting it.

ogrisel · 2020-04-30T07:51:28Z

But I guess people will just get used to copy & pasting it.

Yes. Also this might be temporary. Maybe in 6 months it will be enabled by default.

andreapi87 · 2020-12-16T12:35:27Z

Hi,
i don't understand where the visualizer is.
I updated scikit at master with pip install git+git://github.com/scikit-learn/scikit-learn@master --upgrade
but with from sklearn.inspection import display_estimator i still get cannot import name 'display_estimator' from 'sklearn.inspection'

TomDLT · 2020-12-16T19:01:31Z

Here is the documentation.

ENH Adds export_html

7e75b73

jnothman reviewed Jun 25, 2019

View reviewed changes

sklearn/inspection/_plot_estimators.py Outdated Show resolved Hide resolved

sklearn/inspection/_plot_estimators.py Outdated Show resolved Hide resolved

CLN Checks for jupyter context

2ce3552

amueller reviewed Jun 26, 2019

View reviewed changes

sklearn/inspection/_plot_estimators.py Outdated Show resolved Hide resolved

amueller reviewed Jun 26, 2019

View reviewed changes

sklearn/inspection/__init__.py Outdated Show resolved Hide resolved

thomasjpfan added 2 commits June 28, 2019 15:26

Merge remote-tracking branch 'upstream/master' into html_viz

f8d2681

ENH Updates style

8b015e1

thomasjpfan mentioned this pull request Jun 28, 2019

Add pipeline visualizer? #14061

Closed

thomasjpfan added 4 commits July 2, 2019 13:12

TST refactor test_numeric_stability (scikit-learn#14221)

893dbfc

Merge remote-tracking branch 'upstream/master' into html_viz

f7bfb0c

Merge remote-tracking branch 'upstream/master' into html_viz

597a99b

CLN Renames function

9ccdbd0

ENH Adds sphinx extension to visiualize

4df33f8

ogrisel approved these changes Apr 29, 2020

View reviewed changes

thomasjpfan added 3 commits April 29, 2020 13:37

Merge remote-tracking branch 'upstream/master' into html_viz

19f43ba

CLN Address comments

e688cb7

BUG Fix

b994664

CLN Address comments

57fea52

jnothman reviewed Apr 29, 2020

View reviewed changes

jnothman changed the title ~~[MRG] ENH Adds HTML visualizations for estimators~~ ENH Adds HTML visualizations for estimators Apr 29, 2020

jnothman merged commit ee2508c into scikit-learn:master Apr 29, 2020

jnothman mentioned this pull request Apr 29, 2020

Add commits to 0.23.X #17084

Merged

adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request Apr 30, 2020

ENH Adds HTML visualizations for estimators (scikit-learn#14180)

01d6a7b

ogrisel mentioned this pull request Apr 30, 2020

DOC Fix and extend mixed dtype column transformer example #17088

Merged

adrinjalali pushed a commit that referenced this pull request Apr 30, 2020

ENH Adds HTML visualizations for estimators (#14180)

425564b

thomasjpfan mentioned this pull request Apr 30, 2020

BUG Adjusts html_repr based on configuration #17093

Merged

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020

ENH Adds HTML visualizations for estimators (scikit-learn#14180)

cbc3686

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020

ENH Adds HTML visualizations for estimators (scikit-learn#14180)

9950845

	>>> set_config(display=True) # doctest: +SKIP
	>>> set_config(display="diagram") # doctest: +SKIP

ENH Adds HTML visualizations for estimators #14180

ENH Adds HTML visualizations for estimators #14180

Conversation

thomasjpfan commented Jun 25, 2019 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

jnothman commented Jun 25, 2019

thomasjpfan commented Jun 25, 2019

jnothman commented Jun 25, 2019 • edited Loading

thomasjpfan commented Jun 25, 2019

amueller commented Jun 26, 2019

amueller commented Jun 26, 2019

amueller commented Jun 26, 2019

jnothman commented Jun 27, 2019

jnothman commented Jun 27, 2019 via email

amueller commented Jun 27, 2019

thomasjpfan commented Jun 28, 2019

jnothman commented Jul 9, 2019

ogrisel commented Jul 9, 2019

lesteve commented Jul 9, 2019

amueller commented Jul 9, 2019

jnothman commented Jul 9, 2019 via email

amueller commented Jul 10, 2019 • edited Loading

GaelVaroquaux commented Apr 29, 2020 via email

rth commented Apr 29, 2020

adrinjalali commented Apr 29, 2020

SylvainCorlay commented Apr 29, 2020 • edited Loading

rth commented Apr 29, 2020 • edited Loading

jnothman commented Apr 29, 2020 • edited Loading

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel Apr 29, 2020

Choose a reason for hiding this comment

GaelVaroquaux commented Apr 29, 2020 via email

SylvainCorlay commented Apr 29, 2020

thomasjpfan commented Apr 29, 2020

jnothman left a comment

Choose a reason for hiding this comment

jnothman commented Apr 29, 2020

jnothman commented Apr 29, 2020

amueller commented Apr 30, 2020 • edited Loading

ogrisel commented Apr 30, 2020

andreapi87 commented Dec 16, 2020

TomDLT commented Dec 16, 2020

thomasjpfan commented Jun 25, 2019 •

edited

Loading

jnothman commented Jun 25, 2019 •

edited

Loading

amueller commented Jul 10, 2019 •

edited

Loading

SylvainCorlay commented Apr 29, 2020 •

edited

Loading

rth commented Apr 29, 2020 •

edited

Loading

jnothman commented Apr 29, 2020 •

edited

Loading

amueller commented Apr 30, 2020 •

edited

Loading