-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Refactor layers like ggplot2 [No. 2] #266
Conversation
status
Immediate direction
Desired objective
|
@@ -76,8 +86,16 @@ def plot_layer(self, data, ax): | |||
if 'linestyle' in data: | |||
data = data.rename(columns={'linestyle': 'linetype'}) | |||
|
|||
# should happen in the layer | |||
data = data[list(set(data.columns) & set(self.valid_aes))] | |||
# In ggplot2, aes() does map columns in data to parameters, not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment. I always thought that aes are "aestetic mappings", and parameter are either setting some mapping to a single value or are used for some other stuff (the geom_function stuff).
If In understand the code correctly, I would comment it as "only keep columns which are used in this layer"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a huge surprise to me too. In some places ggplot2 seems to blur the lines between aesthetics and parameters.
This is from geom_abline
# Slopes and intercepts as data
p <- ggplot(mtcars, aes(x = wt, y=mpg), . ~ cyl) + geom_point()
df <- data.frame(a=rnorm(10, 25), b=rnorm(10, 0))
p + geom_abline(aes(intercept=a, slope=b), data=df)
intercept
and slope
are parameters to stat_abline
.
It turns out you can map stuff to any parameters. So aesthetic mappings
should just be thought of as mappings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this simple mean that intercept and slope are mappings (and should go there) and are just usually used as settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that at first, but it is not the case. It does not seem to be an error in the documentation although there are a bunch. They aren't even listed among the all the aesthetics, in aes.r
They are just like any other parameter. However, for geom_vline
and geom_hline
the xintercept
and yintercept
are parameters but they are listed in aes.r
as part of all aesthetics.
So whatever is and is not listed as an aesthetic is immaterial, the code maps both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a R handy right now, but if using it as mapping produces multiple slope lines (or can be used in faceting to produce different lines per facet), then they should e mappings and we should just use them in that way, no matter what the doc in ggplot2 says?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense and that's what I did for the xintercept
and yintercept
, for the others I'll be checking some of the geoms
and stats
that we don't have yet to get a better perspective. Maybe even raise it up on their issues. Stuff is supposed to fall into neat little boxes.
Wow, this is one hell of a commit! I've given that a light code review, but can't really comment on the actual functionality. I do hope that this is ensured by the unittests (and new one :-)) In my opinion it would make this thingy much easier if the layer refactoring would be in place. There are several place where you comemtn on this and I also think the "each layer has it's own data" makes for less headache in not duplicating/overwrite column names. Do you want to do that? I probably won't get to code on ggplot until after Easter and feel that you are actually more familiar with the code base right now than I am. |
Sorry, you shouldn't have gone through it yet, I should have given you a heads up. All that ugliness is just part of a gradual the transition. |
No problem. It would be nice if such large PR could be split into muliple commits. That would make reviewing it easier :-) Please only add commits, not amend/change old ones: that way the next round can only look at the new ones. If you want to cleanup, just do it at the last step via |
I think only the first commit had to be huge -- only way to get stuff working all across. About the layer refactoring that should come at the end, given how mixed up things have been it would have been as messy. When all is done, I should certainly move some of the code into a Also I don't quite get the dataframe object, it seems unpredictable yet we will have to minimize copying and also ensure some stuff is never modified. Edit: |
Okay, I'll do so when it is merged. |
** Noteworthy activity ** - stats are now in a separate directory - each call to `geom_abc(stat='bin')` uses the specified `stat` this creates an object `geom._stat`. - `stat.__radd__` creates and adds a `geom` to the ggplot object - A proper implementation of `stat_bin2d` only constrained by the lack of `fill` mapping - A `geom_histogram` is an alias of `geom_bar` and the bin computations are done in `stat_bin` - `geom_vline` and `geom_hline` can plot multiple lines ** issues ** - `stats` are not computed on proper groups, so the plot may not be correct for some factorings. - Related to the non existent `stat` grouping, any coloring is broken. - Wrong test cases have been exposed but since the current results are also not purely correct the expected results have not been adjusted.
It avoids a mix up with the real groups created by the aesthetic mappings and the group parameter.
- Introduce `geom._extra_requires` to fill the gap between the required aesthetic and calculated stuff from the statistics. - Remove the 'label' parameter where it was used, that is the job of gg_title. i.e ggplot level and not geom or layer level.
ggplot2 line geoms seem to allow mapping to parameters! A clearer view for these "mappable parameters" is to treat them aesthetics in the geoms (users know they can map to them) and as parameters in the stats (users discover they can assign functions which compute on a specific column and yield the required result).
- hijacked ggplot/utils/utils.py, it is a more obvious place for many small functions that are not specific to the ggplot object
For some smoothers, the actual smoothing points are less than the input data points. eg. lowess. The smoothers need to return this information. For consistency all smoothing functions are made to return the x values.
- had to fix utils because of buggy behaviour in matplotlib
- Although this is cleaner than the previous implementation, it turns out the real performance bottleneck is the plotting function for geom_rect -- ax.bar(). It is not meant to draw many small rectangles. The solution will be to use ax.fill_between for geom_rect
In geom initialization, any argument that is both an aesthetic to to the geom and a parameter setting to the stat is taken by the stat if it is not an array-like. This is useful where the weight is an aestetic for a geom, yet it is used during stat computation. For this case, the problem was the API value for the weight is 1 (a scalar), yet it is also mappable. Although scalar weights are essentially meaningless they do manipulate the y-scaling of histograms. geoms abline, vline and hline can take functions to compute the slope and intercepts. The functions are used by the respective stats.
- After computing the stats, sort dataframe by x, geoms expect it that way, and it is what ggplot2 does - geom_bar, only set the xtick and xlabels for categorical data, prevents numbers with many decimals places - stat_bin, fix for correctly determining the bins with zero counts - stat_bin2d, add 'right' parameter - stat_function, get rid of mapped 'y' value - util.pop(), to pop dataframes and specify a default value - six.string_types, fixes Python 3 not understanding unicode
- categorical get spacing, num-likes don't
- Test that the right parameters are get to the stats for cases where those parameters are also geom aesthetics, - Test all exceptions
Update
|
@has2k1 ok, tests are passing, so at least no failing code in already tested parts. I will merge. Can you add issues about the two problems you identified in your last comment? |
Refactor layers like ggplot2
@has2k1 Thanks again for this refactorings and your patients with my comments! |
I'll add the issues and hopefully fixes don't come-in in the form of hacks, especially for |
This is a continuation of #252.
There is still some relevant discussion that happened at #221 that will guide the direction.
This PR is a work in progress. Some comments will include highlights of the status, the immediate direction and maybe a desired objective outcome.