Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] ENH Adds HTML visualizations for estimators #14180

Open
wants to merge 21 commits into
base: master
from

Conversation

@thomasjpfan
Copy link
Member

thomasjpfan commented Jun 25, 2019

Reference Issues/PRs

Closes #14061

What does this implement/fix? Explain your changes.

You can demo the visualization here: https://thomasjpfan.github.io/sklearn_viz_html/index.html

This PR implements a HTML visualization for estimators with a focus on displaying it in a Jupyter notebook or lab. This implementation is in pure HTML and CSS (no javascript or external dependencies):

Screen Shot 2019-06-28 at 4 16 20 PM

  1. We can hover over elements to see an estimators parameters (print_changed_only=True is the default for export_html):

Screen Shot 2019-06-24 at 10 11 36 PM

  1. All the labels in bold can be hovered over to get more information.
  2. _type_of_html_estimator returns how to layout metaestimators, (ColumnTransformer and FeatureUnion is "parallel", while Pipeline is "serial") If there are any other metaestimators to add, we just need to add it to _type_of_html_estimator)
  3. There is a hidden div sk-final-spacer as a hack to provide enough space for the information displayed while hovering over elements.
Code to Create HTML (In jupyterlab or a notebook)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.pipeline import FeatureUnion
from sklearn.feature_selection import SelectPercentile
from sklearn.inspection import display_estimator

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))])

feat_u2 = FeatureUnion([("pca", PCA(n_components=1)),
                      ("svd", Pipeline([('tsvd1', TruncatedSVD(n_components=2)), 
                                        ('select', SelectPercentile())]))])

numeric_transformer2 = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('scaler', StandardScaler(with_std=False)),
    ('feats', feat_u2)
])
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', missing_values="missing")),
    ('onehot', OneHotEncoder(handle_unknown='ignore', drop='first'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num1', numeric_transformer, numeric_features),
        ('num2', numeric_transformer2, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

feat_u = FeatureUnion([("pca", PCA(n_components=1, whiten=True, svd_solver='full')),
                      ("svd", TruncatedSVD(n_components=2, n_iter=10))])
clf1 = LogisticRegression(solver='lbfgs', multi_class='multinomial',
                         random_state=1, max_iter=200)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1, max_depth=8, warm_start=True, n_jobs=3, oob_score=True)
clf3 = GaussianNB()
eclf1 = VotingClassifier(estimators=[
        ('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('feat_u', feat_u),
                      ('classifier', eclf1)])
display_estimator(clf)
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 25, 2019

Awesome! I need to test out this parameter display thing. I want to know what happens when there's a lot to show (does it wrap?)

Is the monospace font necessary / helpful?

print_changed_only : bool, optional (default=True)
If True, only the parameters that were set to non-default
values will be printed when printing an estimator.
"""

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 25, 2019

Member

Returns?

html_output = out.getvalue()

try:
from IPython.display import HTML

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 25, 2019

Member

Hmm... I think we should more specifically determine whether we are in an ipython evaluation context, or at a minimum, check if 'IPython' is in sys.modules

@thomasjpfan

This comment has been minimized.

Copy link
Member Author

thomasjpfan commented Jun 25, 2019

PR updated with:

  1. Adds Return to docstring (oops)
  2. Checks jupyter context with get_ipython
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 25, 2019

One of my questions here is how we want to make this appear in the documentation...

As in, should these appear throughout the user guide or tutorial or example gallery.

@thomasjpfan

This comment has been minimized.

Copy link
Member Author

thomasjpfan commented Jun 25, 2019

Right now this generates a self contained HTML file with the css and html it needs to render.

If this were to be placed in the examples, it would output a bunch of HTML when one runs the example locally. (We can do some clever hacking to get the HTML to render nicely on the webpage)

As for documentation, we can use this internally to display pipelines and/or metaestimators. (Most likely needing some Sphinx wizardry)

estimators = [est[1] for est in estimator.estimators]
names = [est[0] for est in estimator.estimators]
name_tips = [_estimator_tool_tip(est) for est in estimators]
return _EstHTMLInfo('parallel', estimators, names, name_tips)

This comment has been minimized.

Copy link
@amueller

amueller Jun 26, 2019

Member

I would check if it has a base_estimator_ or estimator_ and then iterate into it.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jun 26, 2019

Hm so pipelines show names if the step is a meta-estimator, and it doesn't show the name otherwise. That seems a bit inconsistent?
Showing all the names might be a bit much, but are pipelines this complex actually common?

Similarly there's no box for pipelines containing simple estimators.

There doesn't seem to be a distinction between how a column transformer is visualized and how a voting classifier is visualized. I feel like they should look different in the graph, or they should have names somehow?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jun 26, 2019

I feel like we need someone that knows about UX/UI design to work on this



__all__ = [
'partial_dependence',
'plot_partial_dependence',
'export_html'

This comment has been minimized.

Copy link
@amueller

amueller Jun 26, 2019

Member

display_estimator

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jun 26, 2019

We can do this two ways: either we merge a MVP and iterate or we try to "get it right". I think getting it right will result in lots of bike shedding and I'm not sure if we have anyone that's good with UIs. So maybe iterative is better?

Then the main remaining points are rendering in sphinx and adding generic meta-estimators and deciding on a name?

I would only really use it in the user guide for the pipeline and feature union, I think.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 27, 2019

I feel like we need someone that knows about UX/UI design to work on this

Agreed. Not to denigrate @thomasjpfan's illustrious skills of course. Is there someone we can can call on?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 27, 2019

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jun 27, 2019

@jnothman I'm trying to figure out if Columbia lets me make payments to upwork, which would increase the chances of the right helper to come along tremendously, I think ;)

I suggested to @thomasjpfan to implement support for generic meta-estimators and figure out what it takes to render this in sphinx and then we can try and merge and iterate.

thomasjpfan added 2 commits Jun 28, 2019
@thomasjpfan

This comment has been minimized.

Copy link
Member Author

thomasjpfan commented Jun 28, 2019

You can demo the visualization here: https://thomasjpfan.github.io/sklearn_viz_html/index.html

thomasjpfan added 4 commits Jun 30, 2019
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 9, 2019

But my main concern here is visibility of the feature. Do we want this to be the default _repr_html_ for estimators? Do we want a config to enable _repr_html_ and if so should it default on or off?

With _repr_html_ we should make a point of visualising pipelines in existing examples. But since they are HTML and not images, I'm not sure how their display will work with sphinx-gallery. Can we do this in SVG (even perhaps with a foreignObject)? Would that help??

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Jul 9, 2019

I'm not sure how their display will work with sphinx-gallery.

Maybe @lesteve knows?

@lesteve

This comment has been minimized.

Copy link
Member

lesteve commented Jul 9, 2019

Maybe @lesteve knows?

Full disclosure I am not following sphinx-gallery very closely any more. I don't think there is support for capturing rich output from the notebook inside sphinx-gallery yet. There were related discussions: sphinx-gallery/sphinx-gallery#396 and sphinx-gallery/sphinx-gallery#421. According to sphinx-gallery/sphinx-gallery#421 (comment), it seems like image scrapers may help.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 9, 2019

Even if sphinx-gallery adds support, sphinx doesn't and we'd need to create a new extension to show it inside the user guide (which we would want).
Current work-around that @thomasjpfan and I settled on is just directly embedding the raw html for the user guide.

I think this would be cool as _repr_html_ and also enable it by default.
Right now for single estimators it's just a box with the name, not showing any parameters, and you have to mouse-over to get the changed parameters and there is no way to see the non-changed parameters. That makes sense for showing bigger pipeline but maybe not for a single estimator.

One option would be to make this only the default for meta-estimators, but not sure?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 9, 2019

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 10, 2019

This directive might help: https://ipython.readthedocs.io/en/stable/sphinxext.html
And/or maybe @Carreau can tell us how to render html in our docs correctly, I'll ask him tomorrow.

thomasjpfan added 6 commits Jul 19, 2019
@thomasjpfan

This comment has been minimized.

Copy link
Member Author

thomasjpfan commented Aug 1, 2019

Here is a notebook showing a workflow using display_estimator.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Aug 1, 2019

@thomasjpfan nice :) Wanna make it work for dabl ;)

thomasjpfan added 2 commits Aug 1, 2019
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression())])
print(display_estimator(clf))

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 2, 2019

Member

We might need a specialised path here for TeX to say this only displays in HTML. Or put that in the directive.

exec(code)
sys.stdout, sys.stderr = orig_stdout, orig_stderr

return "".join(['<div style="font-size: 1.2em">',

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 2, 2019

Member

should this use a class instead?



class ExecuteHTML(Directive):

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 2, 2019

Member

Docstring along the lines of "Executes Python code and includes stdout as HTML"


clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression())])
print(display_estimator(clf))

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 2, 2019

Member

It's a bit awkward that we get a rendering from something that says 'print'... Not sure how to fix that, except by leaving this as display_estimator(clf) and appending print(_) to the end or equivalent...

elif estimator is None:
return _EstHTMLInfo('single', estimator, 'None', 'None')

elif isinstance(estimator, Pipeline):

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 2, 2019

Member

Can we move this either into methods, or singledispatch, to avoid so much specialisation in the generic implementation?

thomasjpfan added 3 commits Aug 5, 2019
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Aug 6, 2019

I liked it better when the full repr appeared directly over the node rather than offset

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Aug 6, 2019

@thomasjpfan

This comment has been minimized.

Copy link
Member Author

thomasjpfan commented Aug 7, 2019

ahh did you already change that back?

Yup I still need to resolve some of your other comments. 😅

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Aug 25, 2019

What's the proposed path to merge here?

@thomasjpfan

This comment has been minimized.

Copy link
Member Author

thomasjpfan commented Aug 25, 2019

@amueller was in contact with someone from jupyter that may want to handle this.

On the technical side, I need to come up with a better way than to print display_estimator in the docs.

@thomasjpfan

This comment has been minimized.

Copy link
Member Author

thomasjpfan commented Oct 10, 2019

I am picking this up again and see if I can refine it a little more. I do not like how we need to scroll to see the rest of the popup:

Screen Shot 2019-10-10 at 4 02 34 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.