Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Adds HTML visualizations for estimators #14180

Merged
merged 104 commits into from Apr 29, 2020
Merged

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jun 25, 2019

Reference Issues/PRs

Closes #14061

What does this implement/fix? Explain your changes.

You can demo the visualization here: https://thomasjpfan.github.io/sklearn_viz_html/index.html

This PR implements a HTML visualization for estimators with a focus on displaying it in a Jupyter notebook or lab. This implementation is in pure HTML and CSS (no javascript or external dependencies):

Screen Shot 2019-06-28 at 4 16 20 PM

  1. We can hover over elements to see an estimators parameters (print_changed_only=True is the default for export_html):

Screen Shot 2019-06-24 at 10 11 36 PM

  1. All the labels in bold can be hovered over to get more information.
  2. _type_of_html_estimator returns how to layout metaestimators, (ColumnTransformer and FeatureUnion is "parallel", while Pipeline is "serial") If there are any other metaestimators to add, we just need to add it to _type_of_html_estimator)
  3. There is a hidden div sk-final-spacer as a hack to provide enough space for the information displayed while hovering over elements.
Code to Create HTML (In jupyterlab or a notebook)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.pipeline import FeatureUnion
from sklearn.feature_selection import SelectPercentile
from sklearn.inspection import display_estimator

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))])

feat_u2 = FeatureUnion([("pca", PCA(n_components=1)),
                      ("svd", Pipeline([('tsvd1', TruncatedSVD(n_components=2)), 
                                        ('select', SelectPercentile())]))])

numeric_transformer2 = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('scaler', StandardScaler(with_std=False)),
    ('feats', feat_u2)
])
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', missing_values="missing")),
    ('onehot', OneHotEncoder(handle_unknown='ignore', drop='first'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num1', numeric_transformer, numeric_features),
        ('num2', numeric_transformer2, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

feat_u = FeatureUnion([("pca", PCA(n_components=1, whiten=True, svd_solver='full')),
                      ("svd", TruncatedSVD(n_components=2, n_iter=10))])
clf1 = LogisticRegression(solver='lbfgs', multi_class='multinomial',
                         random_state=1, max_iter=200)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1, max_depth=8, warm_start=True, n_jobs=3, oob_score=True)
clf3 = GaussianNB()
eclf1 = VotingClassifier(estimators=[
        ('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('feat_u', feat_u),
                      ('classifier', eclf1)])
display_estimator(clf)

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 25, 2019

Awesome! I need to test out this parameter display thing. I want to know what happens when there's a lot to show (does it wrap?)

Is the monospace font necessary / helpful?

sklearn/inspection/_plot_estimators.py Outdated Show resolved Hide resolved
sklearn/inspection/_plot_estimators.py Outdated Show resolved Hide resolved
@thomasjpfan
Copy link
Member Author

@thomasjpfan thomasjpfan commented Jun 25, 2019

PR updated with:

  1. Adds Return to docstring (oops)
  2. Checks jupyter context with get_ipython

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 25, 2019

One of my questions here is how we want to make this appear in the documentation...

As in, should these appear throughout the user guide or tutorial or example gallery.

@thomasjpfan
Copy link
Member Author

@thomasjpfan thomasjpfan commented Jun 25, 2019

Right now this generates a self contained HTML file with the css and html it needs to render.

If this were to be placed in the examples, it would output a bunch of HTML when one runs the example locally. (We can do some clever hacking to get the HTML to render nicely on the webpage)

As for documentation, we can use this internally to display pipelines and/or metaestimators. (Most likely needing some Sphinx wizardry)

@amueller
Copy link
Member

@amueller amueller commented Jun 26, 2019

Hm so pipelines show names if the step is a meta-estimator, and it doesn't show the name otherwise. That seems a bit inconsistent?
Showing all the names might be a bit much, but are pipelines this complex actually common?

Similarly there's no box for pipelines containing simple estimators.

There doesn't seem to be a distinction between how a column transformer is visualized and how a voting classifier is visualized. I feel like they should look different in the graph, or they should have names somehow?

@amueller
Copy link
Member

@amueller amueller commented Jun 26, 2019

I feel like we need someone that knows about UX/UI design to work on this

sklearn/inspection/__init__.py Outdated Show resolved Hide resolved
@amueller
Copy link
Member

@amueller amueller commented Jun 26, 2019

We can do this two ways: either we merge a MVP and iterate or we try to "get it right". I think getting it right will result in lots of bike shedding and I'm not sure if we have anyone that's good with UIs. So maybe iterative is better?

Then the main remaining points are rendering in sphinx and adding generic meta-estimators and deciding on a name?

I would only really use it in the user guide for the pipeline and feature union, I think.

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 27, 2019

I feel like we need someone that knows about UX/UI design to work on this

Agreed. Not to denigrate @thomasjpfan's illustrious skills of course. Is there someone we can can call on?

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 27, 2019

@amueller
Copy link
Member

@amueller amueller commented Jun 27, 2019

@jnothman I'm trying to figure out if Columbia lets me make payments to upwork, which would increase the chances of the right helper to come along tremendously, I think ;)

I suggested to @thomasjpfan to implement support for generic meta-estimators and figure out what it takes to render this in sphinx and then we can try and merge and iterate.

@thomasjpfan
Copy link
Member Author

@thomasjpfan thomasjpfan commented Jun 28, 2019

You can demo the visualization here: https://thomasjpfan.github.io/sklearn_viz_html/index.html

@jnothman
Copy link
Member

@jnothman jnothman commented Jul 9, 2019

But my main concern here is visibility of the feature. Do we want this to be the default _repr_html_ for estimators? Do we want a config to enable _repr_html_ and if so should it default on or off?

With _repr_html_ we should make a point of visualising pipelines in existing examples. But since they are HTML and not images, I'm not sure how their display will work with sphinx-gallery. Can we do this in SVG (even perhaps with a foreignObject)? Would that help??

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Jul 9, 2019

I'm not sure how their display will work with sphinx-gallery.

Maybe @lesteve knows?

@lesteve
Copy link
Member

@lesteve lesteve commented Jul 9, 2019

Maybe @lesteve knows?

Full disclosure I am not following sphinx-gallery very closely any more. I don't think there is support for capturing rich output from the notebook inside sphinx-gallery yet. There were related discussions: sphinx-gallery/sphinx-gallery#396 and sphinx-gallery/sphinx-gallery#421. According to sphinx-gallery/sphinx-gallery#421 (comment), it seems like image scrapers may help.

@amueller
Copy link
Member

@amueller amueller commented Jul 9, 2019

Even if sphinx-gallery adds support, sphinx doesn't and we'd need to create a new extension to show it inside the user guide (which we would want).
Current work-around that @thomasjpfan and I settled on is just directly embedding the raw html for the user guide.

I think this would be cool as _repr_html_ and also enable it by default.
Right now for single estimators it's just a box with the name, not showing any parameters, and you have to mouse-over to get the changed parameters and there is no way to see the non-changed parameters. That makes sense for showing bigger pipeline but maybe not for a single estimator.

One option would be to make this only the default for meta-estimators, but not sure?

@jnothman
Copy link
Member

@jnothman jnothman commented Jul 9, 2019

@amueller
Copy link
Member

@amueller amueller commented Jul 10, 2019

This directive might help: https://ipython.readthedocs.io/en/stable/sphinxext.html
And/or maybe @Carreau can tell us how to render html in our docs correctly, I'll ask him tomorrow.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Apr 29, 2020

@rth
Copy link
Member

@rth rth commented Apr 29, 2020

And you don't like having two API endpoints as in #14180 (comment) @rth?

For two endpoints it might be a bit awkward to maintain them in sync code wise I think (and a bit confusing to have two option with the same effect). But having the display config option that is {"text", "html"} and optionally accepts a list of MIME types for extensibility by advanced users, why not. This is all linked to the user extensibility discussion though.

@adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Apr 29, 2020

yeah, we could maybe call it display_type(?) and accept both kinds of input.

@SylvainCorlay
Copy link

@SylvainCorlay SylvainCorlay commented Apr 29, 2020

Not chiming into the subjective appreciation on whether text/html is too much jargon. One way to have other names would be to list acceptable abbreviations for a mimetype.

For example, text/plain could be addressed as [text, plain] - although I liked the raw list of standard mime types, but I have my own biases.

@rth
Copy link
Member

@rth rth commented Apr 29, 2020

yeah, we could maybe call it display_type(?) and accept both kinds of input.

Or display_style as xarray.. Don't have a strong opinion though. Display is fine too.

@jnothman
Copy link
Member

@jnothman jnothman commented Apr 29, 2020

I don't think this is a question of MIME type. This is a question of how to format estimators. We could have multiple representations in each MIME type and could effectively return a simple repr in HTML with some highlighting etc, and it would still be distinct from the visual diagram representation implemented here. Were we to have an ASCII implementation of this diagram it would still be visual but text/plain.

Further, were we to have implemented this in SVG (MIME image/svg+xml), changing the implementation to text/html should not concern the user.

The question here for the user is how we should summarise an estimator by default in a repl or cell-based framework that supports whatever MIME types we can offer it; the option to constrain MIME type is only pertinent if we have multiple MIME types with the same effective representation (but distinct presentation, I suppose).

Copy link
Member

@ogrisel ogrisel left a comment

Still +1 on display="diagram" or alternatively display_style="diagram".

examples/compose/plot_column_transformer_mixed_types.py Outdated Show resolved Hide resolved
`display` option in :func:`sklearn.set_config`::

>>> from sklearn import set_config
>>> set_config(display=True) # doctest: +SKIP
Copy link
Member

@ogrisel ogrisel Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> set_config(display=True) # doctest: +SKIP
>>> set_config(display="diagram") # doctest: +SKIP

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Apr 29, 2020

@SylvainCorlay
Copy link

@SylvainCorlay SylvainCorlay commented Apr 29, 2020

I don't think this is a question of MIME type. This is a question of how to format estimators. We could have multiple representations in each MIME type and could effectively return a simple repr in HTML with some highlighting etc, and it would still be distinct from the visual diagram representation implemented here. Were we to have an ASCII implementation of this diagram it would still be visual but text/plain.

Further, were we to have implemented this in SVG (MIME image/svg+xml), changing the implementation to text/html should not concern the user.

The question here for the user is how we should summarise an estimator by default in a repl or cell-based framework that supports whatever MIME types we can offer it; the option to constrain MIME type is only pertinent if we have multiple MIME types with the same effective representation (but distinct presentation, I suppose).

That is an excellent point. Diagram or another style is independent of the mime type.

@thomasjpfan
Copy link
Member Author

@thomasjpfan thomasjpfan commented Apr 29, 2020

To be clear this PR adjusts the output of _repr_mimbundle_ depending on the configuration option. If display='diagram' then text/html will be added to the bundle, other wise only text/plain will be provided in the bundle.

Copy link
Member

@jnothman jnothman left a comment

I think we should merge and can consider name changes right up to release. Should we implement an ipython magic to make it easy to enable?

@jnothman
Copy link
Member

@jnothman jnothman commented Apr 29, 2020

(though I'm happy with this config name)

@jnothman jnothman changed the title [MRG] ENH Adds HTML visualizations for estimators ENH Adds HTML visualizations for estimators Apr 29, 2020
@jnothman jnothman merged commit ee2508c into scikit-learn:master Apr 29, 2020
21 checks passed
@jnothman
Copy link
Member

@jnothman jnothman commented Apr 29, 2020

Amazing work, @thomasjpfan

@jnothman jnothman mentioned this pull request Apr 29, 2020
@amueller
Copy link
Member

@amueller amueller commented Apr 30, 2020

OMG I'm so happy this is merged!!!

It seems to me right now the option does both, allowing an additional mime type and selecting what the content of the representation is. If you'd added the hypothetical different html representation that's not a graph,how would you enable that?

I was thinking about this addition in terms of mime type, and the diagram just as the most natural way to express nested estimators in html.

I don't have a strong opinion on the naming though. 'diagram' doesn't seem like a very natural option but I'm happy as long as we have the feature. These days I always think in terms of teaching stuff. How would you explain having a sklearn.set_config(display="diagram") line on the top of each notebook / how easy is that to remember? But I guess people will just get used to copy & pasting it.

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Apr 30, 2020

But I guess people will just get used to copy & pasting it.

Yes. Also this might be temporary. Maybe in 6 months it will be enabled by default.

@andreapi87
Copy link

@andreapi87 andreapi87 commented Dec 16, 2020

Hi,
i don't understand where the visualizer is.
I updated scikit at master with pip install git+git://github.com/scikit-learn/scikit-learn@master --upgrade
but with from sklearn.inspection import display_estimator i still get cannot import name 'display_estimator' from 'sklearn.inspection'

@TomDLT
Copy link
Member

@TomDLT TomDLT commented Dec 16, 2020

Here is the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.