Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for batching NdOverlay plots #717

Merged
merged 10 commits into from
Jun 16, 2016
Merged

Add support for batching NdOverlay plots #717

merged 10 commits into from
Jun 16, 2016

Conversation

philippjfr
Copy link
Member

@philippjfr philippjfr commented Jun 9, 2016

Following the proposal in #716 I've prototyped an initial implementation of batched plots for NdOverlays. The idea behind this is that creating a lot of individual plot instances for very similar plot elements is expensive and results in the creation of a huge number of artists and glyphs. Hooking into HoloViews to create plot classes, which plot a bunch of Elements at once is pretty straightforward and can provide considerable speed benefits. To test this I've prototyped a BatchedCurvePlot class, which can plot large NdOverlays of Curves very quickly in the bokeh backend.

I've tested this with this simple example:

hv.HoloMap({j: hv.NdOverlay({i: hv.Curve(np.random.rand(10,2)) for i in range(50)}) for j in range(10)})

After quick testing the batched plot renders this in 0.25 seconds vs the regular implementation which takes about 1.25 seconds to generate. More importantly however the generated bokeh output no longer freezes the browser temporarily, because it renders all the curves at once rather than as separate glyphs. Visually the two plots are identical except in bokeh the batched plot is coming out a lot crisper for me for some reason (I think this has been fixed in bokeh master).

There's still a few fixes to make and we need to decide on a general place for specialized plot registries such as for the batched plots here and for adjoined plots in other cases.

@jlstevens
Copy link
Contributor

We've wanted this feature for ages!

Here are my overall comments regarding the structure of the code. As I don't know the details, you can explain why this won't work:

  1. Batched support seems something we want everywhere if possible. Why not have the _batched class attribute on all element plots? Then the constructor might be able to do a bit of conditional logic to use batching if possible.
  2. Similarly, the one method on BatchedElementPlot is get_data. If there are reasons the get_data method can't always be batched, maybe it could simply be called batched_get_data and used when appropriate. I am imagining that an ElementPlot could have _batched=True in which case it needs a working batched_get_data method.
  3. This approach using a batched_get_data method would avoid overriding the existing get_data method: if there is a problem, we can simply set _batched=False to go back to the old (unbatched) behaviour.
  4. If it needs to be called get_data because of how other methods access it, you could probably do a dynamic method switch in the constructor, aliasing get_data with batched_get_data if appropriate.
  5. This would also help simplify plot registration.

Now you'll tell me all the reasons it can't be done like this. :-)

If it could be done this way, I would like it a lot. It would keep the class hierarchy as it is, and support batching just by implementing one method and setting one flag. Even if we needed two methods and one flag this would be a better way of supporting batching (imho) than introducing new classes. If batching needs to be enabled/disabled, maybe the attribute _batched could be public i.e batched so anyone is allowed the enable/disable it.

Anyway, it is good to see this prototypes and the speed ups are nice. We definitely do want this feature!

@philippjfr
Copy link
Member Author

philippjfr commented Jun 9, 2016

Batched support seems something we want everywhere if possible. Why not have the _batched class attribute on all element plots? Then the constructor might be able to do a bit of conditional logic to use batching if possible.

I've already partially done this because I wasn't going to reimplement the entire Element plot but you definitely have a point, all Element plots could optionally support batched plotting directly without a separate plotting class. If I can implement this nicely the OverlayPlot that would have created a bunch of individual subplots can handle dispatching the objects, which would avoid the additional handling I had to put on the Renderer.

Similarly, the one method on BatchedElementPlot is get_data. If there are reasons the get_data method can't always be batched, maybe it could simply be called batched_get_data and used when appropriate. I am imagining that an ElementPlot could have _batched=True in which case it needs a working batched_get_data method.

Right, I like this proposal and will prototype it.

This approach using a batched_get_data method would avoid overriding the existing get_data method: if there is a problem, we can simply set _batched=False to go back to the old (unbatched) behaviour.

If we go with this proposal we can simply make batched a parameter (and therefore a plot option) on the plots that support it, so you can control batching on a per plot basis. While batched plots should have the same behavior as the non-batched case, there might be certain restrictions on batched plots so I think this is a good solution. As an example of such a restriction, the bokeh batched Curve plot implementation uses a multi_line glyph, which currently does not support attaching picking/hover tools, so you'd have to go back to the non-batched version if you wanted to attach a picking callback.

If it needs to be called get_data because of how other methods access it, you could probably do a dynamic method switch in the constructor, aliasing get_data with batched_get_data if appropriate.

There's only two places where the method is used so I can just use the correct one depending on the batched parameter.

This would also help simplify plot registration

Yes in this proposal there is no more problem, we should still come up with a better solution for adjoined plots though (in a different issue).

If batching needs to be enabled/disabled, maybe the attribute _batched could be public i.e batched so anyone is allowed the enable/disable it.

Ah yes, good we independently decided this might be a good idea.

Anyway, it is good to see this prototypes and the speed ups are nice. We definitely do want this feature!

Yes, this has a huge scope for speeding up plotting and the implementation is shaping up to be a lot simpler than I had originally imagined. Particularly for bokeh, where the performance penalty is both in creating the plot and rendering it in the browser this should have huge benefits, but even for matplotlib this will make certain plot types feasible. Specifically chloropleth plots of US counties using Polygons were basically not possible before but should actually become useable now.

For an initial implementation I'd like to have batched plots for Points, Curves and Polygons in bokeh and matplotlib.

@philippjfr
Copy link
Member Author

Okay I've now implemented batching for Curves and Points in Bokeh. Implementing batching in matplotlib will be a bit more difficult because it will also need new methods to update the artists. With a bit more cleanup the bokeh implementation will be fairly clean and useable at least. You can now control batching via a plot option on the OverlayPlot. After working on the implementation a bit more there are a few more restrictions for batched mode, in some cases it will be impossible to provide legends. I don't think this is a major issue since batching only becomes necessary when there is so many items that a legend doesn't make sense anyway. I'd argue the batched parameter should be False by default though.

For now I think merging this PR with a few implementations for batched bokeh plots will be sufficient since it addresses the biggest issues of seriously slowing down the browser when plotting a lot of overlaid curves or points. I'll open a separate issue to implement the equivalent for the matplotlib backend.

@philippjfr
Copy link
Member Author

philippjfr commented Jun 10, 2016

Here is a comparison between the non-batched (left) and batched versions of the three plot types I've implemented now:

Polygons

polygons = hv.NdOverlay({i: hv.Polygons([np.random.rand(10,2)]) for i in range(10)})
polygons(plot=dict(batched=False)) + polygons

image

Points

points = hv.NdOverlay({i: hv.Scatter(np.random.rand(10,2)) for i in range(50)})
points(plot=dict(batched=False)) + points

image

Curve

curves = hv.NdOverlay({i: hv.Curve(np.random.rand(10,2)) for i in range(50)})
curves(plot=dict(batched=False)) + curves

image

The missing legend is the major difference, however during batched plotting controlling the zorder of the individual plot Elements is often impossible too, because they are all rendered as one glyph, which has only one zorder. Not much we can do about it, and as long as you're forced to explicitly enable batching I don't think it's an issue.

This notebook has some additional profiling information.

@jlstevens
Copy link
Contributor

jlstevens commented Jun 10, 2016

Great! I'm glad my proposal is reasonable.

My only potential objection is making batched a parameter. I always want to avoid swamping the user with plot options they don't need to really use. This is why I suggested making it a class attribute.

Here is what I propose: make batched a class attribute that defaults to True. Then, once we know the limitations of batched mode, and have identified cases where users will want to use batched=False, then we can make it a parameter. Our goal should be to always have batched=True if possible, so hopefully it won't ever need to be a parameter...

Edit 1: I just saw what you said about z-order. I understand that the entire batch has a single z-order but why does that mean the legend can't still be there on top of the whole batch?

Edit 2: Given that batched would only be an option for Overlay and NdOverlay, I suppose having it as a parameter is fine.

@philippjfr
Copy link
Member Author

I just saw what you said about z-order. I understand that the entire batch has a single z-order but why does that mean the legend can't still be there on top of the whole batch?

Because usually you'd have one legend entry per artist, but in these batched plots I only create a single artist, which won't allow separate legend entries.

Given that batched would only be an option for Overlay and NdOverlay, I suppose having it as a parameter is fine.

That's correct, although it would be nicer if it wasn't a plot option for Overlays, as they don't actually support batching, only NdOverlays, which are guaranteed to contain Elements of a consistent type, do.

@jlstevens
Copy link
Contributor

Because usually you'd have one legend entry per artist, but in these batched plots I only create a single artist, which won't allow separate legend entries.

That makes sense. Though maybe it would be possible to fake? I.e create a single invisible point somewhere in the plot for each entry you want in the legend? Not saying we should, but I think you can fake the old behavior in a simple and inexpensive way.

@philippjfr
Copy link
Member Author

Not saying we should, but I think you can fake the old behavior in a simple and inexpensive way.

In matplotlib maybe, in bokeh legends are currently not exposed at a low enough level to really make this happen and adding fake glyphs would defeat the entire point of batching.

@jlstevens
Copy link
Contributor

.. adding fake glyphs would defeat the entire point of batching.

I'm not sure that is true. The point of batching in my mind is optimization, the fake glyphs you would need to add would be as small as possible and therefore inexpensive. Is it really not possible to add hidden fake glyphs (i.e set visible to False) and get the legend out of that?

Note that I'm not saying this is necessarily a good idea, but we should try consider some possibilities anyway.

@philippjfr
Copy link
Member Author

philippjfr commented Jun 10, 2016

The point of batching in my mind is optimization, the fake glyphs you would need to add would be as small as possible and therefore inexpensive

The expensive part is precisely in creating lots of individual glyphs, how much data each one contains is secondary. The entire speedup we get from batch plotting comes from the fact that you're no longer creating a bunch of individual glyphs, since the overall amount of data plotted is the same in both cases.

I'd also still argue that once you have so many artists that batch plotting is needed, a legend will be pretty much useless to you anyway. Based on the current support for legends in bokeh, I don't think we can fake legends in any sensible way without losing most of the performance improvement that we're hoping to gain. We can revisit this at a later point though, when there is more general support for legends in bokeh, at which point we should be able to create a custom legend more easily.

Therefore I'd say for now batched is an option you specifically have to enable. We can revisit that decision once we get around to fully implement the equivalent for matplotlib.

@jlstevens
Copy link
Contributor

jlstevens commented Jun 10, 2016

I'd also still argue that once you have so many artists that batch plotting is needed, a legend will be pretty much useless to you anyway.

To me, that is the answer. I can't imagine a legend being useful with 100 entries or even 50 entries. What I am talking about is batch mode being on by default, and used for cases where there are say, only 10 entries in the legend which is reasonable.

I would argue for batch mode being always on (so people automatically get the benefits), with support from legends when it makes sense (i.e up to some limit in the number of entries in the legend). For instance, maybe show_legend could have an integer argument so that 0 means 'disabled' with a default of (say) 15. That would mean 'show a legend if there are 15 entries or less'. That way, we could use batched mode everywhere, and legends would automatically switch off if there are lots of entries (i.e useless anyway) and batched mode makes sense.

To me that seems like a good tradeoff. Fast when there are lots of entries (legends disable themselves by default), fast when legends are off and only slow when there are lots of entries and the user explicitly asks for a huge legend (which is unlikely to be useful or desirable anyway).

Edit: Instead of overloading show_legend how about a legend_limit to set the maximum number of legend entries? We can set it low so batched mode makes sense and to keep the cost of the fake glyphs low...

@philippjfr
Copy link
Member Author

To me that seems like a good tradeoff. Fast when there are lots of entries (legends disable themselves by default), fast when legends are off and only slow when there are lots of entries and the user explicitly asks for a huge legend (which is unlikely to be useful or desirable anyway).

I'd agree that that's reasonable behavior. That said, I have not yet implemented a way to update legends in bokeh anyway, I think it should be possible but I can't be absolutely sure. So I think if we agree to do this, it will be in a separate PR. Then there's the zorder issue to work out, we should file an issue with bokeh to see if we can get the patches glyph to render the polygons in the order they're supplied.

I'd suggest I do a final cleanup of this PR setting batched to False for now, and then I'll follow it up with another PR to work on your suggestion for legends, at which point the behavior should match and we can enable it by default.

@jlstevens
Copy link
Contributor

Ok great! At least we agree on what the behaviour should be if we can get it to work that way.

I'm happy to merge this PR once you've done the cleanup. I think the legend_limit parameter is a good idea, regardless of batching so that too could be a separate PR.

@jbednar
Copy link
Member

jbednar commented Jun 10, 2016

You guys came to the conclusions that I was going to propose anyway. :-) Having the results differ due to lack of z-order control is indeed alarming, and it seems like Bokeh should be respecting the order. Alternatively, if the order is well defined in the batched case, one could consider reordering the non-batched case to match, but it sounds like the bahavior in the non-batched case is correct already (respecting the original order).

Isn't the fact that the legend is appearing below the data a bug, unrelated to this PR? It's happening in the non-batched case, and seems like it should never be happening.

You could imagine having batched be a Boolean Dynamic parameter that has access to the number of items (so that the rest of the code doesn't have to worry about the limit parameter), but I'm not sure how the number of items would be provided to such an object. We do need to make Booleans and Strings be dynamic, in any case.

@philippjfr
Copy link
Member Author

philippjfr commented Jun 10, 2016

it seems like Bokeh should be respecting the order

Yes, I'll try to work out if there is any well defined order and then file an issue.

Isn't the fact that the legend is appearing below the data a bug, unrelated to this PR? It's happening in the non-batched case, and seems like it should never be happening.

Yes, that is also a bug. The legend is set to the 'annotation' level which I thought would be on top, but I can try setting it to 'overlay' to fix that. I'll do that in the separate legends PR.

You could imagine having batched be a Boolean Dynamic parameter that has access to the number of items (so that the rest of the code doesn't have to worry about the limit parameter), but I'm not sure how the number of items would be provided to such an object.

That might be nice but since the OverlayPlot decides whether to batch, it happens in only one place anyway. However even in the absence of that instead of going through the work of faking a legend, which I believe destroys all performance gains, we simply offer control over the number of items it should render normally before batching is enabled and no legend is generated. That means we don't have to complicate the implementation further but offer the same level of control. Batching is only enabled when you have too many items for a legend to be sensible, which is exactly when you need it. Plotting 30 glyphs is not really a problem, plotting >100 is and that's well beyond the point where a legend is sensible.

@jbednar
Copy link
Member

jbednar commented Jun 10, 2016

Sounds good, but there's a grey area between when legends start being not very usable and when they become completely ridiculous. I'd think legends are only typically useful up to about 12 or so (actually only 9 fit with the current spacing for Points legends, but 12 could presumably fit, especially if it's moved out of the plot), but they are still sensible (i.e., conceivable in some cases) up to 30 or so. I'd put the limit fairly conservatively, i.e. only show a legend when it is definitely useful, but then allow people to up the limit if they verify it makes sense in their case.

@philippjfr
Copy link
Member Author

philippjfr commented Jun 10, 2016

The main issue is that control over legends is still limited in bokeh, it can't be placed outside the plot for instance and I'm dubious about being able to create hidden artists which only show up in the legend since they will also be hidden automatically. I think that along with things like colorbars, this is something the bokeh team will look at after the layout work is done. So in the meantime long legends are of limited usefulness in bokeh anyway and I'd like to avoid implementing some custom hack to fake them with hidden artists. We can revisit that decision at any point by making batching independent of the legend_limit, for now batched=True will only apply if two conditions are met:

  1. The number of items exceeds the legend_limit or show_legend is False.
  2. The plot type supports batched plotting.

We can probably drop the first requirement for an eventual matplotlib batched implementation and potentially also for bokeh if they expose more control over the legends.

@philippjfr
Copy link
Member Author

Oddly enough in another example the zorder seems fine:

image

Probably something to do with the randomly generated shapes.

@jbednar
Copy link
Member

jbednar commented Jun 10, 2016

That sounds good. I wonder if Bokeh is sorting by x, then y, or something else bizarre like that.

@philippjfr
Copy link
Member Author

After zooming in it's broken again, something iffy on the bokehJS side.

@philippjfr
Copy link
Member Author

philippjfr commented Jun 13, 2016

I'm trying to figure out the need for the _batched attribute. Is it simply indicating whether get_batched_data is available for use? I only see the value assigned at the class level where it is declared...

The batched parameter is on the OverlayPlot class, which is user-controllable. The _batched attribute is declared on the ElementPlot classes that support batched plotting, i.e. support a get_batched_data method. I suppose it could use a hasattr to look for that method but I thought I'd be explicit.

The blurriness I think you already mentioned, but the green and blue lines extend more to the right in the ungrouped than the grouped, at the very edge of the plot (though presumably also elsewhere and just not visible). Is that a bug, e.g. by connecting up lines that shouldn't be connected, or vice versa? Or just a difference in how the same data is plotted under different smoothing algorithms?

That's also a question we'll have to ask the bokeh team. I've read the patches code and it simply separates paths with NaNs, which should be pretty solid. So it could be WebGL, multi_line in general, or something to do with smoothing as you suggest.

I was hoping batched could result in the same output but faster. Looking at that output, it does seem like we might have to make a quality/speed tradeoff.

Note that it seemingly looks better in the batched case. I think there's fixes to hiDPI WebGL rendering in bokeh 0.12 though, so these differences might not persist. I'll be checking out the latest 0.12 dev release tomorrow, to check how we're doing on compatibility in general.

@jlstevens
Copy link
Contributor

jlstevens commented Jun 13, 2016

Note that it seemingly looks better in the batched case.

Ok great! I couldn't tell which way round Jim's screenshots were. I assumed the worst and that batched version was lower quality. :-)

@jbednar
Copy link
Member

jbednar commented Jun 13, 2016

Right; batched looks better, i.e. sharper and clearer, though I don't know which version actually matches the true extent of the line segment (as they can't both). The originals for my screenshot are towards the middle of this thread above; it's just a screenshot of a zoomed in screenshot.

@philippjfr
Copy link
Member Author

Okay, I'll get the tests in order now, there was one instance with a plot with more than 20 legend items in the tests, so I've upped the limit slightly to 25. I've also added a comment to the _batched attribute definition. I'm not sure I want to fold the plot methods in together, particularly because _plot_method = ('scatter', 'scatter') seems a bit weird.

Not sure what to do about the remaining differences though. I can briefly try to determine which line drawing call is more correct and then check whether there's any changes/improvements in the latest dev version but we may just have to come to terms with the differences for now.

@jlstevens
Copy link
Contributor

i.e. support a get_batched_data method. I suppose it could use a hasattr to look for that method but I thought I'd be explicit.

I would consider simply trying to call get_batched_data in a try/except block and catching NotImplementedError. This is cheap to do if it isn't implemented and avoids the need for the _batched attribute.

Generally, I think we should always assume a batched implementation is better and should always try to use get_batched_data if possible.

@jlstevens
Copy link
Contributor

As for _plot_method maybe it could be a dictionary with keys 'single' and 'batched'?

Then you only need to specify the 'single' keyword in most cases:

_plot_method = dict(single='scatter')

Or both if they differ:

_plot_methods=dict(single='line', batched='multi_line')

Then for getting the batched plot method, it can fall back to the 'single' one if not specified:

batch_plot_method = self._plot_method.get('batch', self._plot_method['single'])

Seems less redundant in many cases and avoids adding an attribute which I expect will be rarely needed and would mirror _plot_method in most cases.

@philippjfr
Copy link
Member Author

philippjfr commented Jun 14, 2016

I would consider simply trying to call get_batched_data in a try/except block and catching NotImplementedError. This is cheap to do if it isn't implemented and avoids the need for the _batched attribute.

That doesn't really help since it's the OverlayPlot that has to check whether the plot supports batching or not and either create multiple subplots or a single one.

Generally, I think we should always assume a batched implementation is better and should always try to use get_batched_data if possible.

I don't think this is generally true, yes in most cases it's what you want when you have a lot of plots but there are certain restrictions, such as the inability to use various hover and selection tools, which means we should be very clear about what batched mode does and does not support.

As for _plot_method maybe it could be a dictionary with keys 'single' and 'batched'?

That sounds like a good idea.

Then for getting the batched plot method, it can fall back to the 'single' one if not specified:

Fallbacks like that won't work, as I said above by this point the OverlayPlot will have already decided whether to assign the plot an NdOverlay of Elements or a single Element, it will choose the appropriate one for the object it has been passed. It should never end up in a situation where the plot that doesn't support batching has been given an NdOverlay.

@jlstevens
Copy link
Contributor

jlstevens commented Jun 14, 2016

That doesn't really help since it's the OverlayPlot that has to check whether the plot supports batching or not and either create multiple subplots or a single one.

Ok that makes sense. You could use hasattr but only if the superclass doesn't implement the method - not something I would want to rely on.

I don't think this is generally true, yes in most cases it's what you want when you have a lot of plots but there are certain restrictions, such as the inability to use various hover and selection tools, that we should be very clear about what batched mode does and does not support.

I was not implying that we would ignore the batched parameter which we agreed on earlier. I also agree we need to be careful to note what impact batched plots might have on the user.

How about this...instead of _batched attributes everywhere, the OverlayPlots could have an attribute to specify which plot classes should be batched or not (if the batched parameter is True). This would allow the user to enable/disable batching for some plot types and not others all in one place.

A list of plotting classes that support batching would do, assuming all the necessary plotting classes will be defined (which must be true if OverlayPlot is creating instances of these classes). This attribute could be called _batched_plots and it could even be a parameter if you like. That said, it would be a list a user would only ever want to remove elements from to disable batched plots...

One of the reasons I feel this is better (other than avoiding _batched everywhere) is it is now clearer to me that the use of get_batched_data is something decided by the OverlayPlot which is where the batched parameter lives.

@philippjfr
Copy link
Member Author

How about this...instead of _batched attributes everywhere, the OverlayPlots could have an attribute to specify which plot classes should be batched or not (if the batched parameter is True). This would allow the user to enable/disable batching for some plot types and not others all in one place.

Not a huge fan of having another registry for a specific type of plots. We have a lot of them and I'm not convinced that's actually any cleaner than declaring it on the plot itself. That said, I've decided to unify the matplotlib and bokeh backends a bit more by creating a general _plot_methods attribute. While it's not always used to do the actual plotting it's a good way of declaring what plotting call actually gets used and lets the OverlayPlot check whether a particular ElementPlot supports a batched plotting call without needing another attribute.

@philippjfr philippjfr force-pushed the batched_plots branch 3 times, most recently from 37ac89d to 2582fa8 Compare June 16, 2016 12:22
@jlstevens
Copy link
Contributor

Ok, I'm very happy with this use of _plot_methods and I like how it gets rid of _batched as well.

I'll have a quick look again to see if I have any other comments, but right now I am happy to merge (once the tests pass).



def get_zorder(self, overlay, key, el):
spec = util.get_overlay_spec(overlay, key, el)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A docstring here would be helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, doing that now.

@jlstevens
Copy link
Contributor

jlstevens commented Jun 16, 2016

I've made one minor comment above and I am happy to merge once tests pass.

In general, I think _plot_methods is a great idea as it declares the main link between the plotting class and the plotting library. I could be used in future for documentation e.g:

  • Automatically getting style options or maybe validating the ones we state.
  • Offering a chance to link to the relevant documentation of the plotting library. This could be helpful for developers...

I remember wanting something like this ages ago when I investigated getting style options using matplotlibs eccentric documentation system.

Edit: Linking to the underlying documentation of the plotting system would be useful for users as well. E.g to look up what style options really mean...

@philippjfr
Copy link
Member Author

PR build passed, not sure why the push build didn't restart but the branch is up to date anyway. Ready to merge.

@jlstevens
Copy link
Contributor

Great!

@jlstevens jlstevens merged commit 3f03974 into master Jun 16, 2016
@jlstevens jlstevens deleted the batched_plots branch July 12, 2016 22:57
Copy link

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants