Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reworked bar plot #325

Merged
merged 5 commits into from
Jun 26, 2014
Merged

Reworked bar plot #325

merged 5 commits into from
Jun 26, 2014

Conversation

arnfred
Copy link

@arnfred arnfred commented Jun 12, 2014

Rework of bar plot

I realize that @ericchiang just added another pull request for something similar here: #324 so in light of that I'm putting this up here so that we might merge the two. The differences as I can see between the two patches are:

  • Eric's add position="dodge" (mine doesn't)
  • Mine fix problem with bar plots and facets and add custom label order

It is now possible to:

In order to do this I had to make a few changes.

  • Most importantly I added a method '_calculate_global' to stats.py. It
    is not required to implement it in subclasses but for stat_bin it
    makes it possible to read in all labels ahead of time and then when we
    are processing separate groups fill the labels not represented with
    0's.
  • I added a new stat class 'stat_bar' which for simple x/y bar plots
    (not histograms) take advantage of '_calculate_global' to deliver
    consistent bar plots when using facets or when stacking colors
  • I changed geom_bar to use stat_bar per default and geom_histogram to
    use stat_bin. This differentiates the bar plot and the histogram in a
    way that makes intuitive sense (I would think at least)
  • I disabled sorting the x-axis by default to preserve user specified
    label orders. As far as I can see only geom_area needs the x-axis to
    be ordered. I've added code to sort the x-axis inside geom_area.

Examples
facet_colors

    gg = ggplot(diamonds, aes(x = 'clarity', fill = 'cut', color='cut')) +\
            stat_bin(binwidth=1200) + facet_wrap("color")

labels_manual

    df = pd.DataFrame({ "y" : [3.362, 1.2, 3.424, 2.574, 0.679],
                        "x" : ["BF","BF","Flann","FastMatch","FastMatch2"],
                        "c" : ["a", "b", "a", "a","a"]})
    p = ggplot(df, aes(x = 'x', y = 'y', fill="c"))
    gg2 = p + geom_bar(stat="bar", labels=["BF","Flann","FastMatch"])

faceting_grid_discrete

    df = pd.DataFrame({
        "x": np.arange(0, 10),
        "y": np.arange(0, 10),
        "z": np.arange(0, 10),
        "a": [1,1,1,1,1,2,2,2,3,3],
        "b": ["a","a","a","a","a","b","b","b","c","c"]
    })
    df['facets'] = np.where(df.x > 4, 'over', 'under')
    df['facets2'] = np.where((df.x % 2) == 0, 'even', 'uneven')
    df = _build_testing_df()
    gg = ggplot(aes(x='a', y='y', fill='y'), data=df)

(from #196 but modified to show color)

It is now possible to:
- Stack bars of different colors. In fact this is the default behaviour
  when a color of fill is specified (like R's ggplot)
- Reorder columns in a bar plot by adding a label parameter as so:
  geom_bar(labels=["blip","blop","dyt","båt"]) as discussed here:
  yhat#315
- Show only a select few of the columns (as above, just don't include
  all possible x-values)
- Use bar plots when faceting (although faceting overwrites the
  x-labels for some reason: yhat#319)
  This addresses yhat#196

In order to do this I had to make a few changes.
- Most importantly I added a method '_calculate_global' to stats.py. It
  is not required to implement it in subclasses but for stat_bin it
  makes it possible to read in all labels ahead of time and then when we
  are processing separate groups fill the labels not represented with
  0's.
- I added a new stat class 'stat_bar' which for simple x/y bar plots
  (not histograms) take advantage of '_calculate_global' to deliver
  consistent bar plots when using facets or when stacking colors
- I changed geom_bar to use stat_bar per default and geom_histogram to
  use stat_bin. This differentiates the bar plot and the histogram in a
  way that makes intuitive sense (I would think at least)
- I disabled sorting the x-axis by default to preserve user specified
  label orders. As far as I can see only geom_area needs the x-axis to
  be ordered. I've added code to sort the x-axis inside geom_area.
@ericchiang
Copy link

This looks really awesome! I'm inclined to merge this PR over my own then work in position="dodge" after since mine won't behave as well with facets (plus there's stat_bin).

Could you speak a little more on _calculate_globals()? I have a feeling that this could be a good fix for a few problems we have with other geoms.

@arnfred
Copy link
Author

arnfred commented Jun 12, 2014

That's great to hear. I was worried the work had been in vain.

The idea behind _calculate_global() is to make it possible for a stat module like stat_bin to draw information from a full dataframe instead of only working on dataframes on a group to group basis.

The motivation behind it was a problem I had with stat_bin. When we calculate bin-widths we basically do (max(data['x']) - min(data['x']) / number_of_bins. However when we pass in data groups the max and min values of particular groups will differ from group to group making the bin widths different. That meant that when I was stacking a barplot by colors, the columns would look like badly stacked jenga blocks.

Currently I pass the entire data column to _calculate_global() in the function calculate_stats in geom.py. Then in ``stat_bin` I have access to the entire range of x-values which makes it easy to create a list of x-values as an instance variable and use it across groups. This also means that when there are x-values that aren't present in a particular group, we know about it and can fill in the missing gaps with zeros.

Another benefit is that _calculate_global() makes it very simple to include only a subset or a custom ordering of columns in the plot.

It turns out that geom_line also expects x-values to be sorted. This
commit moves the function ```_sort_list_types_by_x``` to geom.py and
renames it to ```sort_by_x``` and adds a call to it in ```geom_line```
It turns out that geom_line also expects x-values to be sorted. This
commit moves the function ```_sort_list_types_by_x``` to geom.py and
renames it to ```sort_by_x``` and adds a call to it in ```geom_line```

(amended to no longer break build)
Conflicts:
	ggplot/geoms/geom.py
@has2k1
Copy link

has2k1 commented Jun 13, 2014

If we are to conform with ggplot2, geom_histogram is just an alias of geom_bar. This is the case because histogram inherently implies a statistic has been computed, i.e a binning statistic, yet the idea of the grammar of graphics is to separate the graphing and the statistics. As a result geom_bar should be the only geom, but geom_histogram as a name is there to meet user expectations.

The calculate_global function is a work around to some of the issues of not having scale trainning, (see #283). Ideally #283, #221 and #259 need to be addressed to avoid the multiple edge cases especially when it comes to facetting. It would also allow for a seamless integration of positioning across the board.

@ericchiang
Copy link

@arnfred In relation to @has2k1 concern about geom_histogram, can you do an assignment rather than a subclass? For example:

# geom_histogram.py
from .geom_bar import geom_bar
geom_histogram = geom_bar

After that I'd be happy to merge. We can add in position=dodge after.

@arnfred
Copy link
Author

arnfred commented Jun 14, 2014

@ericchiang, Sure I'll look in to that. Should I strike the DEFAULT_PARAMS = {'stat': 'bar', 'position': 'stack'} in geom_bar so that it will be completely analogous to geom_histogram? I can see the motivation about keeping stats and geoms completely separate but at the same time I remember originally finding geom_bar counter intuitive because I had to assign a separate stat to do a simple bar plot.

@has2k1 I agree with you that calculate_global could be replaced with proper scales if the scales provide a geom with a way to know the range of the data. I think it would be easy to replace calculate_global with this information once we have it.

I've looked in to the issues that you referenced and I realize that another pull request I've made (#327 - reworked legends) might also conflict with the refactoring that I see you and @JanSchulz discussing in #283. At the same time it corrects a lot of other small issues with legends that I think might be useful in any case.

@has2k1
Copy link

has2k1 commented Jun 14, 2014

The way to think about it is, geom_bar defaults to a histogram because the default stat is stat_bin and for a bar plot you use stat_identity on top of providing the y aesthetic. The confusion comes about when you start applying statistics other than stat_bin to geom_histogram.

Getting the scales sorted out would allow us to add scale training -- which really means find the ranges of the data. The stats, positions and the geoms need it otherwise you get edge cases of data falling off the limits and an inability to really predict the size of the margins around the plots. In fact it has to be done three times during plotting pipeline to ensure a perfect plot.

_calculate_global is very appealing, but I refrained from doing something eerily similar in #266 because then _position_global and _plot_global would soon pop up and I wasn't convinced that they were enough to get us not to tackle the scales issue head on. The ggplot2 logs showed a fair number of bugs on the issue.

I'll add a comment at #327.

@arnfred
Copy link
Author

arnfred commented Jun 14, 2014

@has2k1, thanks for your comment. The code in stat_bar is a workaround because of the same issue, and if there were scales they would make it possible to use stat_identity for a stacked bar plot without problems where as right now it isn't possible without breaking facets.

I wonder though, would it not make sense to have two geoms (geom_bar and geom_histogram) that defaults to two different stats? On a logical level the stats are still separate, and on an api level there aretwo geom's in all cases. That geom_bar currently defaults to stat_bin and not stat_identity seems like an arbitrary choice anyway. What I mean to ask is, what is the motivation for having both geom's default to the same stat?

@has2k1
Copy link

has2k1 commented Jun 14, 2014

That makes sense and I would prefer it that way, but it would be a major break in compatibility with ggplot2. One thing I thought of to stride between the two is be smart and do a bar plot if the y aesthetic is specified and a histogram if none. This is, instead displaying a warning about the y aesthetic and stat_bin.

@arnfred
Copy link
Author

arnfred commented Jun 14, 2014

I like that solution, but I don't think it's possible currently. geom_bar only learns of the data when we call _plot_unit, and by then the stats have already been calculated.

@has2k1
Copy link

has2k1 commented Jun 14, 2014

The distinction would be in stat_bin. Instead of a message to the effect of "we have ignored the y aesthetic", we leave the y aesthetic, skip any computations and have a message to the effect of "because of your y values we are not computing anything".

@ericchiang
Copy link

While there are issues to work out I'd still like to merge in the PR for the greater goal of having the ability to stack bar plots. Can we forego a little compatibility with ggplot2 for that in the meantime?

@has2k1
Copy link

has2k1 commented Jun 17, 2014

Although we get the to stack bar plots, it comes at a cost. This PR introduces an internal API that will ultimately need to be replaced if further progress is to be made. This API will most likely be used to fix other bugs. When I considered doing something similar, I counted about 4 bugs that it would solve. So, whenever it gets removed there will be complications and regressions.

In addition the geom_histogram is no longer a separate class
@arnfred
Copy link
Author

arnfred commented Jun 17, 2014

I've just pushed a commit with your suggested changes, @ericchiang. I've reverted the behaviour to use stat_bin on default and changed the error message to suggest using stat_bar.

@has2k1, I liked the idea of defaulting to stat_bin but switching in case we saw a y variable, however I was reluctant to implement this behaviour in stat_bin since I suspect stat bin could be used for some other purpose than a bar plot where this behaviour might be unwanted, and putting the behaviour anywhere else felt a little hack'ish.

I can also see the argument against adding an internal API just to superseed it with something else. One solution might be to make the introduced API less useful so that it solves just the problem of stacking bar plots in which case it should lead to less regressions.

glamp added a commit that referenced this pull request Jun 26, 2014
@glamp glamp merged commit bfe77f9 into yhat:master Jun 26, 2014
@glamp
Copy link
Contributor

glamp commented Jun 26, 2014

sorry for the delay

@glamp glamp mentioned this pull request Jun 26, 2014
@sixers
Copy link

sixers commented Jan 2, 2015

Any update on position="dodge" ?

@has2k1
Copy link

has2k1 commented Jan 4, 2015

@sixers, in PR #360(still very far from ready), I have ported all the position stuff to match ggplot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants