-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring: rebuild geom
like the layers in ggplot2
#221
Comments
Looks like this is meant to resolve the TODOs in Is this correct per your vision; a |
No, the geoms add a layer and |
I have had more time to think about this, skimmed the ggplot2 code and now have a clearer sense of how the it would all come together. I will focus on the First the picture. The The key methods are something like. For the
For the
Other than in the declarations and in Based on this picture and if you have a similar view, then from the stuff I have done in #246 I can rework the Also within the As the data splitting now moves to the |
IMO this looks good! I've just one question: Why would you need a If this is for the different lines (like currently in case we have different colors for different series of a line plot -> the reason why Otherwise that's almost exactly as I understand the ggplot2 code :-) Thanks for your work! |
Yes it is the case of different series of a line, but my mistake, I had ruled out the I think I'm on the same track as to where the splitting would happen. Here is how
And
Currently, the problem with |
My idea was
|
Great, I am on the same page. Anything missing from this?
About the legends and limits. I may have seen that hickup in Here is what I noted, uncertainity on my part and not a solution.
|
As I understood the Legends: currently the legends get set to everything which is in one (or more) of the I think noone minds where they live. They need to now about stuff in ggplot (colors and so on), but actually they are a result of the geoms (which knows hat it has to plot...) |
You are right about the There is also an issue of the translations, as you may need to modify how they happen. In #246 I did not follow through in converting potential conflict translations e.g. About the legends, somewhere in the ggplot2 documentation is a good but unimplemented idea of turning the |
Re doc: I think it should be possible to use a decorator to generate/append the documentation about the aes, defaults, stats and so on. re translations. I don't know what you mean. :-) Currently that's well defined: the first color is the color aestetic, the second color a variable name in the data frame. As such I would think that transform could be done by simple moving the Re legends: legends are currently in a state of "works but also need lots of love". ggplot2 has a way to disable/enable legends ("guides") and merge them (color + shape set to the same variable -> we show two legends, ggplot2 only one), ggplot can move them around,... |
About translations, the potential conflict comes from the order in which they are carried out. Consider these two, {'color': 'edgecolor', 'fill': 'color'}
{'fill': 'color', 'color': 'edgecolor'} It is the same dictionary but if renaming is done in the order of the second, the effective rename is {'fill': 'edgecolor'} |
I don't understand: if that's the aes (as in
This is without caching (if no data or aes is in the geom, we could reuse The next step would then be to |
Sorry for the confusion, I am referring to the translation declarations that a TRANSLATIONS = {'color': 'edgecolor', 'fill': 'color'}
# same as above but rearranged to show the edge case
# if renaming happens in this order.
TRANSLATIONS = {'fill': 'color', 'color': 'edgecolor'}
# solution would be an `OrderedDict` but the renaming function
# would have to be aware of this.
TRANSLATIONS = OrderedDict([('color', 'edgecolor'), ('fill', 'color')]) The Otherwise, the On the issue related to aesthetics and transforms, have you thought about computed aesthetics, i.e aes(y='..density..') I will put up a gist for the structure of the |
Ok, now I understand :-) I think that's one place which I would not abstract away. I thing the "API" functions should take the data as produced from the aes/stats and the last step in that function should be to "translate" that into the mpl code. So splitting data into multiple series to plot different lines and renaming variables would go there. as well as constructing a dict from both the data and the other parameter (passed in via
|
BTW: this is the layer code in ggplot2: https://github.com/hadley/ggplot2/blob/master/R/layer.r and this is where it all comes together: https://github.com/hadley/ggplot2/blob/master/R/plot-build.r#L14 |
Makes sense, I like the idea of helper functions not being automatically applied, less magic stuff. Here is the "roadmap" I will be following to get the https://gist.github.com/has2k1/9637948
There is still the issue of |
geomsI think I still think that the "translation" part should not become API or visible outside of the final I don't like the idea of a
is actually much more readable? One interesting thing is the handling of StatsI don't think
LayerThis is currently missing:
legends and limitsggplot2 has a step between legends: see above, I think that should be done somewhere in the layer.compute(). Limits is again tricky, similar to "position": it also depends on position (if it is stacked you get other y limits than simply using max(y) :-/ ) Not sure yet where it is best handled... Maybe also keep that as is in this refactoring. It would be really interesting to check what happens if you use 'geom_point' and 'position_dodge' (or whatever...) in ggplot2 together... I will have to try that and see what happens :-) |
Great input. I made some modifications to that reflection and also correct a few slip ups. I am following the convention of all class variables as caps, though I did not realise that the About the I had imagined that both def __radd__(self, gg):
l = layer()
# add stuff geom, stat, data, aes to layer
gg.layers.append(l) but if def __radd__(gg):
if self.DEFAULT_GEOM is None:
raise Exception("no default geom associated with this statistics")
geom = self.DEFAULT_GEOM(data=self.data, self.aes, stat=self)
return gg + geom For the positions, I'm only doing them to get proper jittering for the X = {'x', 'xmin', 'xmax', 'xend', 'xintercept'}
Y = {'y', 'ymin', 'xmax', 'yend', 'yintercept'} So |
Ok, I didn't see the Re position: when is ggplot2 actually applying position stuff? Somehow it look like it is in the geom? If so that should become a property of the geom, which would need to do some proper calling. right now I can't think about anything else which is missing. I think we should go forward now: can you prepare a some more commits? One easy one would be to change the whole thing to let geom.plot_layer() accept Dataframes and change the 'one call per line' into 'one call per geom, do the line thingy groupby in geom.plot_layer' (this should still let all unittests pass). From there the next change would be the rest of the refactoring. |
The Comparing with ggplot2
The ggplot2 My picture of our current vision layer.geom # object with plot_layer(), uses self.params
layer.geom.params # args to geom_xxx excluding data, aes, manual_aes
layer.stat #object with compute(), uses self.params
layer.stat.params # args to stat_xxx excluding data, aes, manual_aes
layer.geom.data # from geom_xxx or stat_xxx
layer.geom.aes # from geom_xxx or stat_xxx
layer.geom.manual_aes # from geom_xxx or stat_xxx
layer.geom.params['position'] # from geom_xxx or stat_xxx
layer.params # ?? Our Whereas I think mimicking this aspect of ggplot2 (within reason of course) would not On the way forward, that would work. |
I think we should go with this. The only other way I can think of is going with the ggplot2 way, with -> I thing further discussion can only happen directly with the code. |
I have got I will adapt some of the changes from #246 and clean up the duplication. The tests should still pass after that. While I have made changes to How should we do this?
|
Just use your current PR #246 (or a branch which is based on that, if you want to play save). In the end that can be merged. Yikes, isn't there any case where matplotlib accepts a rect + color instead of plotting each rect individually if it has a different color? Seems that I missed something here :-/ This code duplication is really awfull, sorry... Seems that we should declare a grouping set, which lists all grouping variables, and then pull the groupby code into |
Yes the code duplication is sickening. For the grouping a single helper function should do. Each That way the A single grouping set as was done using I haven't thought about how the |
This is merged in. |
As the layer refactoring itself is not done, reopen. |
This is quite interesting: http://cpsievert.github.io/2014/06/visualizing-ggplot2-internals-with-shiny-and-d3/ |
Currently a layer is just data (all variables needed for the matplotlib plot functions in a dic of lists https://github.com/yhat/ggplot/blob/master/ggplot/ggplot.py#L455 and http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_dict.html). In ggplot2 it is more (see https://github.com/hadley/ggplot2/blob/master/R/layer.r):
geom
subclasses)ggplot._get_layer()
and the_transform()
methods)_transform()
method)The ggplot2 object contains a list of layers, like
ggplot.geoms
. So one way to refactor would be to renamegeom
tolayer
(or simple see the geom as a layer...) and add all the functionality. Adding it to a new layer object has the advantage that the layer can know about the ggplot object and the data in there, the geom must not have a reference to that (you must be able to add a geom to two different ggplot objects)__radd__()
code to add such a object and not the geom itself.pandas.DataFrame
and not as a dic of lists -> This would mean thatgeom_point.plot_layer()
has to deal with different colors by doing a groupby instead ofggplot._get_layers()
. As some geoms already construct again a dataframe from the dict of lists, I think this would make these geoms faster :-) [this step is independed of all other stuff here]_apply_transforms
needs to be called ifstatistics != "indentity"
and for each variable name like '..xxx..' the statistics needs to be asked to provide the values.scale_*
to the plot?_get_layer
code into the layer/geom and pass in the original (or transformed?) dataframe via a genericlayer.plot(data, ax)
(or refactor the currentgeom.plot_layer
-> see geom: improvements to the base class #175)geom.plot_layer()
code path -> only print legends for what we use not everything which is inaes(...)
Todo:
factor()
needs to happen at first but it would be nice to not copy the whole dataframe just to apply a different x axis statistics -> just store the diff to the main dataframe in the layer and then combine them during plotting?stat_bin
and get a complete layer? -> the statistics need to have a way to add a layer too? Could be done by defining__radd__()
to produce the default geom for the statistics, adding the statistics to these geom and then add the geom to the ggplot object.The text was updated successfully, but these errors were encountered: