Reworked bar plot #325

arnfred · 2014-06-12T12:59:53Z

Rework of bar plot

I realize that @ericchiang just added another pull request for something similar here: #324 so in light of that I'm putting this up here so that we might merge the two. The differences as I can see between the two patches are:

Eric's add position="dodge" (mine doesn't)
Mine fix problem with bar plots and facets and add custom label order

It is now possible to:

Stack bars of different colors. In fact this is the default behaviour
when a color of fill is specified (like R's ggplot)
Reorder columns in a bar plot by adding a label parameter as so:
geom_bar(labels=["blip","blop","dyt","båt"]) as discussed here:
How should users reorder the x axis bins in a bar chart? #315
Show only a select few of the columns (as above, just don't include
all possible x-values)
Use bar plots when faceting (although faceting overwrites the
x-labels for some reason: facet_wrap() removes color legends #319)
This addresses facets with descrete values (e.g. geom_bar) does not work #196

In order to do this I had to make a few changes.

Most importantly I added a method '_calculate_global' to stats.py. It
is not required to implement it in subclasses but for stat_bin it
makes it possible to read in all labels ahead of time and then when we
are processing separate groups fill the labels not represented with
0's.
I added a new stat class 'stat_bar' which for simple x/y bar plots
(not histograms) take advantage of '_calculate_global' to deliver
consistent bar plots when using facets or when stacking colors
I changed geom_bar to use stat_bar per default and geom_histogram to
use stat_bin. This differentiates the bar plot and the histogram in a
way that makes intuitive sense (I would think at least)
I disabled sorting the x-axis by default to preserve user specified
label orders. As far as I can see only geom_area needs the x-axis to
be ordered. I've added code to sort the x-axis inside geom_area.

Examples

    gg = ggplot(diamonds, aes(x = 'clarity', fill = 'cut', color='cut')) +\
            stat_bin(binwidth=1200) + facet_wrap("color")

    df = pd.DataFrame({ "y" : [3.362, 1.2, 3.424, 2.574, 0.679],
                        "x" : ["BF","BF","Flann","FastMatch","FastMatch2"],
                        "c" : ["a", "b", "a", "a","a"]})
    p = ggplot(df, aes(x = 'x', y = 'y', fill="c"))
    gg2 = p + geom_bar(stat="bar", labels=["BF","Flann","FastMatch"])

    df = pd.DataFrame({
        "x": np.arange(0, 10),
        "y": np.arange(0, 10),
        "z": np.arange(0, 10),
        "a": [1,1,1,1,1,2,2,2,3,3],
        "b": ["a","a","a","a","a","b","b","b","c","c"]
    })
    df['facets'] = np.where(df.x > 4, 'over', 'under')
    df['facets2'] = np.where((df.x % 2) == 0, 'even', 'uneven')
    df = _build_testing_df()
    gg = ggplot(aes(x='a', y='y', fill='y'), data=df)

(from #196 but modified to show color)

It is now possible to: - Stack bars of different colors. In fact this is the default behaviour when a color of fill is specified (like R's ggplot) - Reorder columns in a bar plot by adding a label parameter as so: geom_bar(labels=["blip","blop","dyt","båt"]) as discussed here: yhat#315 - Show only a select few of the columns (as above, just don't include all possible x-values) - Use bar plots when faceting (although faceting overwrites the x-labels for some reason: yhat#319) This addresses yhat#196 In order to do this I had to make a few changes. - Most importantly I added a method '_calculate_global' to stats.py. It is not required to implement it in subclasses but for stat_bin it makes it possible to read in all labels ahead of time and then when we are processing separate groups fill the labels not represented with 0's. - I added a new stat class 'stat_bar' which for simple x/y bar plots (not histograms) take advantage of '_calculate_global' to deliver consistent bar plots when using facets or when stacking colors - I changed geom_bar to use stat_bar per default and geom_histogram to use stat_bin. This differentiates the bar plot and the histogram in a way that makes intuitive sense (I would think at least) - I disabled sorting the x-axis by default to preserve user specified label orders. As far as I can see only geom_area needs the x-axis to be ordered. I've added code to sort the x-axis inside geom_area.

ericchiang · 2014-06-12T13:24:17Z

This looks really awesome! I'm inclined to merge this PR over my own then work in position="dodge" after since mine won't behave as well with facets (plus there's stat_bin).

Could you speak a little more on _calculate_globals()? I have a feeling that this could be a good fix for a few problems we have with other geoms.

arnfred · 2014-06-12T14:22:30Z

That's great to hear. I was worried the work had been in vain.

The idea behind _calculate_global() is to make it possible for a stat module like stat_bin to draw information from a full dataframe instead of only working on dataframes on a group to group basis.

The motivation behind it was a problem I had with stat_bin. When we calculate bin-widths we basically do (max(data['x']) - min(data['x']) / number_of_bins. However when we pass in data groups the max and min values of particular groups will differ from group to group making the bin widths different. That meant that when I was stacking a barplot by colors, the columns would look like badly stacked jenga blocks.

Currently I pass the entire data column to _calculate_global() in the function calculate_stats in geom.py. Then in ``stat_bin` I have access to the entire range of x-values which makes it easy to create a list of x-values as an instance variable and use it across groups. This also means that when there are x-values that aren't present in a particular group, we know about it and can fill in the missing gaps with zeros.

Another benefit is that _calculate_global() makes it very simple to include only a subset or a custom ordering of columns in the plot.

It turns out that geom_line also expects x-values to be sorted. This commit moves the function ```_sort_list_types_by_x``` to geom.py and renames it to ```sort_by_x``` and adds a call to it in ```geom_line```

It turns out that geom_line also expects x-values to be sorted. This commit moves the function ```_sort_list_types_by_x``` to geom.py and renames it to ```sort_by_x``` and adds a call to it in ```geom_line``` (amended to no longer break build)

Conflicts: ggplot/geoms/geom.py

has2k1 · 2014-06-13T12:11:30Z

If we are to conform with ggplot2, geom_histogram is just an alias of geom_bar. This is the case because histogram inherently implies a statistic has been computed, i.e a binning statistic, yet the idea of the grammar of graphics is to separate the graphing and the statistics. As a result geom_bar should be the only geom, but geom_histogram as a name is there to meet user expectations.

The calculate_global function is a work around to some of the issues of not having scale trainning, (see #283). Ideally #283, #221 and #259 need to be addressed to avoid the multiple edge cases especially when it comes to facetting. It would also allow for a seamless integration of positioning across the board.

ericchiang · 2014-06-14T01:56:21Z

@arnfred In relation to @has2k1 concern about geom_histogram, can you do an assignment rather than a subclass? For example:

# geom_histogram.py
from .geom_bar import geom_bar
geom_histogram = geom_bar

After that I'd be happy to merge. We can add in position=dodge after.

arnfred · 2014-06-14T06:36:17Z

@ericchiang, Sure I'll look in to that. Should I strike the DEFAULT_PARAMS = {'stat': 'bar', 'position': 'stack'} in geom_bar so that it will be completely analogous to geom_histogram? I can see the motivation about keeping stats and geoms completely separate but at the same time I remember originally finding geom_bar counter intuitive because I had to assign a separate stat to do a simple bar plot.

@has2k1 I agree with you that calculate_global could be replaced with proper scales if the scales provide a geom with a way to know the range of the data. I think it would be easy to replace calculate_global with this information once we have it.

I've looked in to the issues that you referenced and I realize that another pull request I've made (#327 - reworked legends) might also conflict with the refactoring that I see you and @JanSchulz discussing in #283. At the same time it corrects a lot of other small issues with legends that I think might be useful in any case.

has2k1 · 2014-06-14T08:05:29Z

The way to think about it is, geom_bar defaults to a histogram because the default stat is stat_bin and for a bar plot you use stat_identity on top of providing the y aesthetic. The confusion comes about when you start applying statistics other than stat_bin to geom_histogram.

Getting the scales sorted out would allow us to add scale training -- which really means find the ranges of the data. The stats, positions and the geoms need it otherwise you get edge cases of data falling off the limits and an inability to really predict the size of the margins around the plots. In fact it has to be done three times during plotting pipeline to ensure a perfect plot.

_calculate_global is very appealing, but I refrained from doing something eerily similar in #266 because then _position_global and _plot_global would soon pop up and I wasn't convinced that they were enough to get us not to tackle the scales issue head on. The ggplot2 logs showed a fair number of bugs on the issue.

I'll add a comment at #327.

arnfred · 2014-06-14T08:26:22Z

@has2k1, thanks for your comment. The code in stat_bar is a workaround because of the same issue, and if there were scales they would make it possible to use stat_identity for a stacked bar plot without problems where as right now it isn't possible without breaking facets.

I wonder though, would it not make sense to have two geoms (geom_bar and geom_histogram) that defaults to two different stats? On a logical level the stats are still separate, and on an api level there aretwo geom's in all cases. That geom_bar currently defaults to stat_bin and not stat_identity seems like an arbitrary choice anyway. What I mean to ask is, what is the motivation for having both geom's default to the same stat?

has2k1 · 2014-06-14T08:45:09Z

That makes sense and I would prefer it that way, but it would be a major break in compatibility with ggplot2. One thing I thought of to stride between the two is be smart and do a bar plot if the y aesthetic is specified and a histogram if none. This is, instead displaying a warning about the y aesthetic and stat_bin.

arnfred · 2014-06-14T10:48:16Z

I like that solution, but I don't think it's possible currently. geom_bar only learns of the data when we call _plot_unit, and by then the stats have already been calculated.

has2k1 · 2014-06-14T11:16:34Z

The distinction would be in stat_bin. Instead of a message to the effect of "we have ignored the y aesthetic", we leave the y aesthetic, skip any computations and have a message to the effect of "because of your y values we are not computing anything".

ericchiang · 2014-06-16T23:07:19Z

While there are issues to work out I'd still like to merge in the PR for the greater goal of having the ability to stack bar plots. Can we forego a little compatibility with ggplot2 for that in the meantime?

has2k1 · 2014-06-17T05:48:07Z

Although we get the to stack bar plots, it comes at a cost. This PR introduces an internal API that will ultimately need to be replaced if further progress is to be made. This API will most likely be used to fix other bugs. When I considered doing something similar, I counted about 4 bugs that it would solve. So, whenever it gets removed there will be complications and regressions.

In addition the geom_histogram is no longer a separate class

arnfred · 2014-06-17T10:24:32Z

I've just pushed a commit with your suggested changes, @ericchiang. I've reverted the behaviour to use stat_bin on default and changed the error message to suggest using stat_bar.

@has2k1, I liked the idea of defaulting to stat_bin but switching in case we saw a y variable, however I was reluctant to implement this behaviour in stat_bin since I suspect stat bin could be used for some other purpose than a bar plot where this behaviour might be unwanted, and putting the behaviour anywhere else felt a little hack'ish.

I can also see the argument against adding an internal API just to superseed it with something else. One solution might be to make the introduced API less useful so that it solves just the problem of stacking bar plots in which case it should lead to less regressions.

Reworked bar plot

glamp · 2014-06-26T18:56:39Z

sorry for the delay

sixers · 2015-01-02T18:54:16Z

Any update on position="dodge" ?

has2k1 · 2015-01-04T08:55:10Z

@sixers, in PR #360(still very far from ready), I have ported all the position stuff to match ggplot.

arnfred added 3 commits June 13, 2014 05:52

Sort x-values in geom_line

1009348

It turns out that geom_line also expects x-values to be sorted. This commit moves the function ```_sort_list_types_by_x``` to geom.py and renames it to ```sort_by_x``` and adds a call to it in ```geom_line```

Sort x-values in geom_line

a8f3f36

It turns out that geom_line also expects x-values to be sorted. This commit moves the function ```_sort_list_types_by_x``` to geom.py and renames it to ```sort_by_x``` and adds a call to it in ```geom_line``` (amended to no longer break build)

Fixing my failed git rebase attempts

952ded7

Conflicts: ggplot/geoms/geom.py

Changed bar plot to use bin stat on default

7cf9048

In addition the geom_histogram is no longer a separate class

glamp added a commit that referenced this pull request Jun 26, 2014

Merge pull request #325 from arnfred/bar

bfe77f9

Reworked bar plot

glamp merged commit bfe77f9 into yhat:master Jun 26, 2014

glamp mentioned this pull request Jun 26, 2014

multi bar plots added #324

Closed

jrk mentioned this pull request Dec 30, 2015

Missing support for grouped bar charts (position='dodge') #464

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reworked bar plot #325

Reworked bar plot #325

arnfred commented Jun 12, 2014

ericchiang commented Jun 12, 2014

arnfred commented Jun 12, 2014

has2k1 commented Jun 13, 2014

ericchiang commented Jun 14, 2014

arnfred commented Jun 14, 2014

has2k1 commented Jun 14, 2014

arnfred commented Jun 14, 2014

has2k1 commented Jun 14, 2014

arnfred commented Jun 14, 2014

has2k1 commented Jun 14, 2014

ericchiang commented Jun 16, 2014

has2k1 commented Jun 17, 2014

arnfred commented Jun 17, 2014

glamp commented Jun 26, 2014

sixers commented Jan 2, 2015

has2k1 commented Jan 4, 2015

Reworked bar plot #325

Reworked bar plot #325

Conversation

arnfred commented Jun 12, 2014

Rework of bar plot

It is now possible to:

In order to do this I had to make a few changes.

ericchiang commented Jun 12, 2014

arnfred commented Jun 12, 2014

has2k1 commented Jun 13, 2014

ericchiang commented Jun 14, 2014

arnfred commented Jun 14, 2014

has2k1 commented Jun 14, 2014

arnfred commented Jun 14, 2014

has2k1 commented Jun 14, 2014

arnfred commented Jun 14, 2014

has2k1 commented Jun 14, 2014

ericchiang commented Jun 16, 2014

has2k1 commented Jun 17, 2014

arnfred commented Jun 17, 2014

glamp commented Jun 26, 2014

sixers commented Jan 2, 2015

has2k1 commented Jan 4, 2015