Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Refactor layers like ggplot2 [No. 2] #266

Merged
merged 16 commits into from
Apr 30, 2014
Merged

WIP: Refactor layers like ggplot2 [No. 2] #266

merged 16 commits into from
Apr 30, 2014

Conversation

has2k1
Copy link

@has2k1 has2k1 commented Apr 8, 2014

This is a continuation of #252.

There is still some relevant discussion that happened at #221 that will guide the direction.

This PR is a work in progress. Some comments will include highlights of the status, the immediate direction and maybe a desired objective outcome.

@has2k1
Copy link
Author

has2k1 commented Apr 8, 2014

status

  • The first commit is still rough in the way the geoms and stats are linked up in geom.py and stat.py, especially how the parameters between the two are handled.
  • stats are still computed on the wrong groups.
  • Failing test cases where the expected results are wrong and the current output is still not entirely correct.

Immediate direction

  • Make clear the internal API by which the geoms and stats know that they are compatible. i.e. if geom_abc requires aesthetics {'x', 'y'} and stat_uvw requires {'x'} and creates {'y', 'other_computation'}, this call geom_abc(aes(x='x'), stat='uvw') is valid. Also other geoms may require that 'other_computation' (which is not a recognised aesthetic or parameter name). That is the rationale for the internal API. ggplot2 does some re-parameterization step, it is obscure as it doesn't make it immediately clear what pieces are compatible.
  • Get the calculating function of the stats to accept a dataframe and not a dictionary like the plotting function of the geoms.
  • Proper grouping for the stats. The groups will be the aesthetics and parameters that are specified in the mappings. Therefore they will need to be sorted out somewhere in the ggplot or the layer classes.

Desired objective

  • With the user and internal APIs (dictionaries and sets), we should be able to automatically generate a chart or network-graph of which geoms and stats can be applied together.

@@ -76,8 +86,16 @@ def plot_layer(self, data, ax):
if 'linestyle' in data:
data = data.rename(columns={'linestyle': 'linetype'})

# should happen in the layer
data = data[list(set(data.columns) & set(self.valid_aes))]
# In ggplot2, aes() does map columns in data to parameters, not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. I always thought that aes are "aestetic mappings", and parameter are either setting some mapping to a single value or are used for some other stuff (the geom_function stuff).

If In understand the code correctly, I would comment it as "only keep columns which are used in this layer"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a huge surprise to me too. In some places ggplot2 seems to blur the lines between aesthetics and parameters.

This is from geom_abline

# Slopes and intercepts as data
p <- ggplot(mtcars, aes(x = wt, y=mpg), . ~ cyl) + geom_point()
df <- data.frame(a=rnorm(10, 25), b=rnorm(10, 0))
p + geom_abline(aes(intercept=a, slope=b), data=df)

intercept and slope are parameters to stat_abline.

It turns out you can map stuff to any parameters. So aesthetic mappings should just be thought of as mappings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this simple mean that intercept and slope are mappings (and should go there) and are just usually used as settings?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that at first, but it is not the case. It does not seem to be an error in the documentation although there are a bunch. They aren't even listed among the all the aesthetics, in aes.r

They are just like any other parameter. However, for geom_vline and geom_hline the xintercept and yintercept are parameters but they are listed in aes.r as part of all aesthetics.

So whatever is and is not listed as an aesthetic is immaterial, the code maps both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a R handy right now, but if using it as mapping produces multiple slope lines (or can be used in faceting to produce different lines per facet), then they should e mappings and we should just use them in that way, no matter what the doc in ggplot2 says?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense and that's what I did for the xintercept and yintercept, for the others I'll be checking some of the geoms and stats that we don't have yet to get a better perspective. Maybe even raise it up on their issues. Stuff is supposed to fall into neat little boxes.

@jankatins
Copy link
Contributor

Wow, this is one hell of a commit!

I've given that a light code review, but can't really comment on the actual functionality. I do hope that this is ensured by the unittests (and new one :-))

In my opinion it would make this thingy much easier if the layer refactoring would be in place. There are several place where you comemtn on this and I also think the "each layer has it's own data" makes for less headache in not duplicating/overwrite column names. Do you want to do that? I probably won't get to code on ggplot until after Easter and feel that you are actually more familiar with the code base right now than I am.

@has2k1
Copy link
Author

has2k1 commented Apr 11, 2014

Sorry, you shouldn't have gone through it yet, I should have given you a heads up. All that ugliness is just part of a gradual the transition.

@jankatins
Copy link
Contributor

No problem. It would be nice if such large PR could be split into muliple commits. That would make reviewing it easier :-) Please only add commits, not amend/change old ones: that way the next round can only look at the new ones. If you want to cleanup, just do it at the last step via rebase -i. Thanks! :-)

@has2k1
Copy link
Author

has2k1 commented Apr 11, 2014

I think only the first commit had to be huge -- only way to get stuff working all across.

About the layer refactoring that should come at the end, given how mixed up things have been it would have been as messy. When all is done, I should certainly move some of the code into a layer.py, but I haven't looked into how that will affect the faceting, maybe I will by then.

Also I don't quite get the dataframe object, it seems unpredictable yet we will have to minimize copying and also ensure some stuff is never modified.

Edit:
I'll definitely need to rebase when #271 is merged.

@has2k1 has2k1 mentioned this pull request Apr 23, 2014
@jankatins
Copy link
Contributor

@has2k1 No, I'm waiting for at least on of the big ones to be merged in. If you have to rebase, I think it would be easiest to rebase on top of #222, which seems to be ready to go in.

@has2k1
Copy link
Author

has2k1 commented Apr 28, 2014

Okay, I'll do so when it is merged.

@jankatins
Copy link
Contributor

@has2k1 rebase away. #222 is merged :-)

** Noteworthy activity **
- stats are now in a separate directory
- each call to `geom_abc(stat='bin')` uses the specified `stat`
  this creates an object `geom._stat`.
- `stat.__radd__` creates and adds a `geom` to the ggplot object
- A proper implementation of `stat_bin2d` only constrained
  by the lack of `fill` mapping
- A `geom_histogram` is an alias of `geom_bar` and the bin
  computations are done in `stat_bin`
- `geom_vline` and `geom_hline` can plot multiple lines

** issues **
- `stats` are not computed on proper groups, so the plot may not
  be correct for some factorings.
- Related to the non existent `stat` grouping, any coloring is broken.
- Wrong test cases have been exposed but since the current results are
  also not purely correct the expected results have not been adjusted.
It avoids a mix up with the real groups created
by the aesthetic mappings and the group parameter.
- Introduce `geom._extra_requires` to fill the gap between the
  required aesthetic and calculated stuff from the statistics.
- Remove the 'label' parameter where it was used, that is the
  job of gg_title. i.e ggplot level and not geom or layer level.
ggplot2 line geoms seem to allow mapping to parameters!
A clearer view for these "mappable parameters" is to treat them
aesthetics in the geoms (users know they can map to them) and
as parameters in the stats (users discover they can assign functions
which compute on a specific column and yield the required result).
- hijacked ggplot/utils/utils.py, it is a more obvious place for
  many small functions that are not specific to the ggplot object
For some smoothers, the actual smoothing points are less than
the input data points. eg. lowess. The smoothers need to return
this information.

For consistency all smoothing functions are made to return the
x values.
- had to fix utils because of buggy behaviour in matplotlib
- Although this is cleaner than the previous implementation, it turns
  out the real performance bottleneck is the plotting function for
  geom_rect -- ax.bar(). It is not meant to draw many small
  rectangles. The solution will be to use ax.fill_between for geom_rect
In geom initialization, any argument that is both an aesthetic to
to the geom and a parameter setting to the stat is taken by the
stat if it is not an array-like. This is useful where the weight is
an aestetic for a geom, yet it is used during stat computation.
For this case, the problem was the API value for the weight is 1
(a scalar), yet it is also mappable. Although scalar weights are
essentially meaningless they do manipulate the y-scaling of histograms.

geoms abline, vline and hline can take functions to compute the slope
and intercepts. The functions are used by the respective stats.
- After computing the stats, sort dataframe by x, geoms expect it
  that way, and it is what ggplot2 does
- geom_bar, only set the xtick and xlabels for categorical data,
  prevents numbers with many decimals places
- stat_bin, fix for correctly determining the bins with zero counts
- stat_bin2d, add 'right' parameter
- stat_function, get rid of mapped 'y' value
- util.pop(), to pop dataframes and specify a default value
- six.string_types, fixes Python 3 not understanding unicode
- categorical get spacing, num-likes don't
- Test that the right parameters are get to the stats for cases where
  those parameters are also geom aesthetics,
- Test all exceptions
- status warning for stat_bin2d
- changed tolenrance for some theme related tests
- swap in geom_line result images. Plots were different but correct
  though not failing.
@has2k1
Copy link
Author

has2k1 commented Apr 29, 2014

Update

  • I think this is an okay merge point although the plotting pipeline is still not correct. Things should be neater when a layer class is eventually created. Before that should happen the scales should be refactored (Proper scales #283) and scale training done.
  • geom_tile and geom_step are still wrong(probably) and have not been touched. It'll be done later.
  • stat_bin2d is now mostly correct but has glitches due to a lack of scale training and a proper plotting pipeline (it requires that visual mapping is applied after the stats are computed). With that not in place the fill color has been set to a uniform gray.

@jankatins
Copy link
Contributor

@has2k1 ok, tests are passing, so at least no failing code in already tested parts. I will merge.

Can you add issues about the two problems you identified in your last comment?

jankatins added a commit that referenced this pull request Apr 30, 2014
@jankatins jankatins merged commit de99403 into yhat:master Apr 30, 2014
@jankatins
Copy link
Contributor

@has2k1 Thanks again for this refactorings and your patients with my comments!

@has2k1
Copy link
Author

has2k1 commented Apr 30, 2014

I'll add the issues and hopefully fixes don't come-in in the form of hacks, especially for stat_bin2d.

@has2k1 has2k1 mentioned this pull request Jun 14, 2014
@has2k1 has2k1 deleted the refactor-layers-like-ggplot2 branch July 29, 2014 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants