WIP: Refactor layers like ggplot2 [No. 2] #266

has2k1 · 2014-04-08T02:13:53Z

This is a continuation of #252.

There is still some relevant discussion that happened at #221 that will guide the direction.

This PR is a work in progress. Some comments will include highlights of the status, the immediate direction and maybe a desired objective outcome.

has2k1 · 2014-04-08T02:16:19Z

status

The first commit is still rough in the way the geoms and stats are linked up in geom.py and stat.py, especially how the parameters between the two are handled.
stats are still computed on the wrong groups.
Failing test cases where the expected results are wrong and the current output is still not entirely correct.

Immediate direction

Make clear the internal API by which the geoms and stats know that they are compatible. i.e. if geom_abc requires aesthetics {'x', 'y'} and stat_uvw requires {'x'} and creates {'y', 'other_computation'}, this call geom_abc(aes(x='x'), stat='uvw') is valid. Also other geoms may require that 'other_computation' (which is not a recognised aesthetic or parameter name). That is the rationale for the internal API. ggplot2 does some re-parameterization step, it is obscure as it doesn't make it immediately clear what pieces are compatible.
Get the calculating function of the stats to accept a dataframe and not a dictionary like the plotting function of the geoms.
Proper grouping for the stats. The groups will be the aesthetics and parameters that are specified in the mappings. Therefore they will need to be sorted out somewhere in the ggplot or the layer classes.

Desired objective

With the user and internal APIs (dictionaries and sets), we should be able to automatically generate a chart or network-graph of which geoms and stats can be applied together.

jankatins · 2014-04-10T19:34:14Z

ggplot/geoms/geom.py

@@ -76,8 +86,16 @@ def plot_layer(self, data, ax):
        if 'linestyle' in data:
            data = data.rename(columns={'linestyle': 'linetype'})

-        # should happen in the layer
-        data = data[list(set(data.columns) & set(self.valid_aes))]
+        # In ggplot2, aes() does map columns in data to parameters, not


I don't understand this comment. I always thought that aes are "aestetic mappings", and parameter are either setting some mapping to a single value or are used for some other stuff (the geom_function stuff).

If In understand the code correctly, I would comment it as "only keep columns which are used in this layer"

That was a huge surprise to me too. In some places ggplot2 seems to blur the lines between aesthetics and parameters.

This is from geom_abline

# Slopes and intercepts as data p <- ggplot(mtcars, aes(x = wt, y=mpg), . ~ cyl) + geom_point() df <- data.frame(a=rnorm(10, 25), b=rnorm(10, 0)) p + geom_abline(aes(intercept=a, slope=b), data=df)

intercept and slope are parameters to stat_abline.

It turns out you can map stuff to any parameters. So aesthetic mappings should just be thought of as mappings.

Doesn't this simple mean that intercept and slope are mappings (and should go there) and are just usually used as settings?

I thought that at first, but it is not the case. It does not seem to be an error in the documentation although there are a bunch. They aren't even listed among the all the aesthetics, in aes.r

They are just like any other parameter. However, for geom_vline and geom_hline the xintercept and yintercept are parameters but they are listed in aes.r as part of all aesthetics.

So whatever is and is not listed as an aesthetic is immaterial, the code maps both.

I don't have a R handy right now, but if using it as mapping produces multiple slope lines (or can be used in faceting to produce different lines per facet), then they should e mappings and we should just use them in that way, no matter what the doc in ggplot2 says?

Makes sense and that's what I did for the xintercept and yintercept, for the others I'll be checking some of the geoms and stats that we don't have yet to get a better perspective. Maybe even raise it up on their issues. Stuff is supposed to fall into neat little boxes.

jankatins · 2014-04-10T20:38:08Z

Wow, this is one hell of a commit!

I've given that a light code review, but can't really comment on the actual functionality. I do hope that this is ensured by the unittests (and new one :-))

In my opinion it would make this thingy much easier if the layer refactoring would be in place. There are several place where you comemtn on this and I also think the "each layer has it's own data" makes for less headache in not duplicating/overwrite column names. Do you want to do that? I probably won't get to code on ggplot until after Easter and feel that you are actually more familiar with the code base right now than I am.

has2k1 · 2014-04-11T02:14:54Z

Sorry, you shouldn't have gone through it yet, I should have given you a heads up. All that ugliness is just part of a gradual the transition.

jankatins · 2014-04-11T06:09:54Z

No problem. It would be nice if such large PR could be split into muliple commits. That would make reviewing it easier :-) Please only add commits, not amend/change old ones: that way the next round can only look at the new ones. If you want to cleanup, just do it at the last step via rebase -i. Thanks! :-)

has2k1 · 2014-04-11T08:56:36Z

I think only the first commit had to be huge -- only way to get stuff working all across.

About the layer refactoring that should come at the end, given how mixed up things have been it would have been as messy. When all is done, I should certainly move some of the code into a layer.py, but I haven't looked into how that will affect the faceting, maybe I will by then.

Also I don't quite get the dataframe object, it seems unpredictable yet we will have to minimize copying and also ensure some stuff is never modified.

Edit:
I'll definitely need to rebase when #271 is merged.

jankatins · 2014-04-28T14:58:42Z

@has2k1 No, I'm waiting for at least on of the big ones to be merged in. If you have to rebase, I think it would be easiest to rebase on top of #222, which seems to be ready to go in.

has2k1 · 2014-04-28T15:13:51Z

Okay, I'll do so when it is merged.

jankatins · 2014-04-28T16:12:04Z

@has2k1 rebase away. #222 is merged :-)

** Noteworthy activity ** - stats are now in a separate directory - each call to `geom_abc(stat='bin')` uses the specified `stat` this creates an object `geom._stat`. - `stat.__radd__` creates and adds a `geom` to the ggplot object - A proper implementation of `stat_bin2d` only constrained by the lack of `fill` mapping - A `geom_histogram` is an alias of `geom_bar` and the bin computations are done in `stat_bin` - `geom_vline` and `geom_hline` can plot multiple lines ** issues ** - `stats` are not computed on proper groups, so the plot may not be correct for some factorings. - Related to the non existent `stat` grouping, any coloring is broken. - Wrong test cases have been exposed but since the current results are also not purely correct the expected results have not been adjusted.

It avoids a mix up with the real groups created by the aesthetic mappings and the group parameter.

- Introduce `geom._extra_requires` to fill the gap between the required aesthetic and calculated stuff from the statistics. - Remove the 'label' parameter where it was used, that is the job of gg_title. i.e ggplot level and not geom or layer level.

ggplot2 line geoms seem to allow mapping to parameters! A clearer view for these "mappable parameters" is to treat them aesthetics in the geoms (users know they can map to them) and as parameters in the stats (users discover they can assign functions which compute on a specific column and yield the required result).

- hijacked ggplot/utils/utils.py, it is a more obvious place for many small functions that are not specific to the ggplot object

For some smoothers, the actual smoothing points are less than the input data points. eg. lowess. The smoothers need to return this information. For consistency all smoothing functions are made to return the x values.

- had to fix utils because of buggy behaviour in matplotlib

- Although this is cleaner than the previous implementation, it turns out the real performance bottleneck is the plotting function for geom_rect -- ax.bar(). It is not meant to draw many small rectangles. The solution will be to use ax.fill_between for geom_rect

In geom initialization, any argument that is both an aesthetic to to the geom and a parameter setting to the stat is taken by the stat if it is not an array-like. This is useful where the weight is an aestetic for a geom, yet it is used during stat computation. For this case, the problem was the API value for the weight is 1 (a scalar), yet it is also mappable. Although scalar weights are essentially meaningless they do manipulate the y-scaling of histograms. geoms abline, vline and hline can take functions to compute the slope and intercepts. The functions are used by the respective stats.

- After computing the stats, sort dataframe by x, geoms expect it that way, and it is what ggplot2 does - geom_bar, only set the xtick and xlabels for categorical data, prevents numbers with many decimals places - stat_bin, fix for correctly determining the bins with zero counts - stat_bin2d, add 'right' parameter - stat_function, get rid of mapped 'y' value - util.pop(), to pop dataframes and specify a default value - six.string_types, fixes Python 3 not understanding unicode

- categorical get spacing, num-likes don't

- Test that the right parameters are get to the stats for cases where those parameters are also geom aesthetics, - Test all exceptions

- status warning for stat_bin2d - changed tolenrance for some theme related tests - swap in geom_line result images. Plots were different but correct though not failing.

has2k1 · 2014-04-29T22:22:16Z

Update

I think this is an okay merge point although the plotting pipeline is still not correct. Things should be neater when a layer class is eventually created. Before that should happen the scales should be refactored (Proper scales #283) and scale training done.
geom_tile and geom_step are still wrong(probably) and have not been touched. It'll be done later.
stat_bin2d is now mostly correct but has glitches due to a lack of scale training and a proper plotting pipeline (it requires that visual mapping is applied after the stats are computed). With that not in place the fill color has been set to a uniform gray.

jankatins · 2014-04-30T07:48:23Z

@has2k1 ok, tests are passing, so at least no failing code in already tested parts. I will merge.

Can you add issues about the two problems you identified in your last comment?

Refactor layers like ggplot2

jankatins · 2014-04-30T08:03:50Z

@has2k1 Thanks again for this refactorings and your patients with my comments!

has2k1 · 2014-04-30T08:04:32Z

I'll add the issues and hopefully fixes don't come-in in the form of hacks, especially for stat_bin2d.

This was referenced Apr 8, 2014

Select data from external data frame for aesthetics? #262

Closed

Refactor: Legends & fill #270

Closed

Implement geom_line with size mapping #161

Closed

jankatins reviewed Apr 10, 2014
View reviewed changes

has2k1 mentioned this pull request Apr 14, 2014

fix imports and namespaces in qplot #277

Merged

has2k1 mentioned this pull request Apr 23, 2014

Proper scales #283

Closed

has2k1 referenced this pull request Apr 28, 2014

bump version; support for lists in vline and hline

4e95d6c

has2k1 added 16 commits April 29, 2014 14:07

A few fixes from the review

c20fa38

Rename geom._groups to geom._units

6eef91e

It avoids a mix up with the real groups created by the aesthetic mappings and the group parameter.

Clean up API

47561ec

- Introduce `geom._extra_requires` to fill the gap between the required aesthetic and calculated stuff from the statistics. - Remove the 'label' parameter where it was used, that is the job of gg_title. i.e ggplot level and not geom or layer level.

Add utility functions

3551b29

- hijacked ggplot/utils/utils.py, it is a more obvious place for many small functions that are not specific to the ggplot object

Let smoothers return the x values

7c6ffd1

For some smoothers, the actual smoothing points are less than the input data points. eg. lowess. The smoothers need to return this information. For consistency all smoothing functions are made to return the x values.

Make stats accept a dataframe

e07e60f

- had to fix utils because of buggy behaviour in matplotlib

Use GgplotError instead of Exception

59a2106

Fix geom_bar separation distances

1bf786b

- categorical get spacing, num-likes don't

Fixes #286

bf8ebe7

Add tests for the stats

f5d8713

- Test that the right parameters are get to the stats for cases where those parameters are also geom aesthetics, - Test all exceptions

Fix tests

c5a3f24

- status warning for stat_bin2d - changed tolenrance for some theme related tests - swap in geom_line result images. Plots were different but correct though not failing.

has2k1 mentioned this pull request Apr 29, 2014

Add element_text with ability to add theme to axis.title.x #222

Merged

jankatins added a commit that referenced this pull request Apr 30, 2014

Merge pull request #266 from has2k1/refactor-layers-like-ggplot2

de99403

Refactor layers like ggplot2

jankatins merged commit de99403 into yhat:master Apr 30, 2014

has2k1 mentioned this pull request Apr 30, 2014

Refactor geom plot functions to be class/static methods. #292

Closed

has2k1 mentioned this pull request Jun 14, 2014

Reworked bar plot #325

Merged

has2k1 deleted the refactor-layers-like-ggplot2 branch July 29, 2014 12:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Refactor layers like ggplot2 [No. 2] #266

WIP: Refactor layers like ggplot2 [No. 2] #266

has2k1 commented Apr 8, 2014

has2k1 commented Apr 8, 2014

jankatins Apr 10, 2014

has2k1 Apr 11, 2014

jankatins Apr 11, 2014

has2k1 Apr 11, 2014

jankatins Apr 11, 2014

has2k1 Apr 11, 2014

jankatins commented Apr 10, 2014

has2k1 commented Apr 11, 2014

jankatins commented Apr 11, 2014

has2k1 commented Apr 11, 2014

jankatins commented Apr 28, 2014

has2k1 commented Apr 28, 2014

jankatins commented Apr 28, 2014

has2k1 commented Apr 29, 2014

jankatins commented Apr 30, 2014

jankatins commented Apr 30, 2014

has2k1 commented Apr 30, 2014

WIP: Refactor layers like ggplot2 [No. 2] #266

WIP: Refactor layers like ggplot2 [No. 2] #266

Conversation

has2k1 commented Apr 8, 2014

has2k1 commented Apr 8, 2014

status

Immediate direction

Desired objective

jankatins Apr 10, 2014

Choose a reason for hiding this comment

has2k1 Apr 11, 2014

Choose a reason for hiding this comment

jankatins Apr 11, 2014

Choose a reason for hiding this comment

has2k1 Apr 11, 2014

Choose a reason for hiding this comment

jankatins Apr 11, 2014

Choose a reason for hiding this comment

has2k1 Apr 11, 2014

Choose a reason for hiding this comment

jankatins commented Apr 10, 2014

has2k1 commented Apr 11, 2014

jankatins commented Apr 11, 2014

has2k1 commented Apr 11, 2014

jankatins commented Apr 28, 2014

has2k1 commented Apr 28, 2014

jankatins commented Apr 28, 2014

has2k1 commented Apr 29, 2014

jankatins commented Apr 30, 2014

jankatins commented Apr 30, 2014

has2k1 commented Apr 30, 2014