Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geom_histogram: wrong bins? #1651

Closed
collioud opened this issue Jun 16, 2016 · 18 comments
Closed

geom_histogram: wrong bins? #1651

collioud opened this issue Jun 16, 2016 · 18 comments
Labels
Milestone

Comments

@collioud
Copy link

@collioud collioud commented Jun 16, 2016

Hello,

I am using ggplot 2.1.0 to plot histograms, and I have an unexpected behaviour concerning the bins.
I put here an example with left-closed bins (i.e. [ 0, 0.1[ ) with a binwidth of 0.1.

mydf <- data.frame(myvar=c(-1,-0.5,-0.4,-0.1,-0.1,0.05,0.1,0.1,0.25,0.5,1))
myplot <- ggplot(mydf, aes(myvar)) + geom_histogram(aes(y=..count..),binwidth = 0.1, boundary=0.1,closed="left")
myplot
ggplot_build(myplot)$data[[1]]

On this example, one may expect the value -0.4 to be within the bin [-0.4, -0.3[, but it falls instead (mysteriously) in the bin [-0.5,-0.4[. Same thing for the value -0.1 which falls in [-0.2,-0.1[ instead of [-0.1,0[...etc.

Is there something here I do not fully understand (especially with the new "center" and "boundary" params)? Or is ggplot2 doing weird things there?

Thanks in advance,
Best regards,
Arnaud

PS: Also asked here: http://stackoverflow.com/questions/37876096/geom-histogram-wrong-bins

@ecoRoland
Copy link

@ecoRoland ecoRoland commented Jun 17, 2016

I suspect there is not much that can be done. This is probably an example of R-FAQ 7.31.

@collioud
Copy link
Author

@collioud collioud commented Jun 17, 2016

That would be...unfortunate.
However I do not have the same result with ggplot 2.1.0 (http://i66.tinypic.com/1sy2w4.jpg) and ggplot 0.9.3 (http://i65.tinypic.com/4rsvev.jpg), which is very disturbing.
By looking at both graphs, we can see that ggplot 0.9.3 behaves as expected.

@collioud
Copy link
Author

@collioud collioud commented Jun 23, 2016

Well...I suppose I have to stay on ggplot version 0.9.3.

@ptoche
Copy link

@ptoche ptoche commented Jun 23, 2016

It looks like a piece of code originally borrowed from base::hist, intended to deal with rounding errors, was removed in recent versions of ggplot2 (look for diddle). Not sure if that is the cause (Edit looks like the diddle and fuzz are still there under slightly different names, so issue appears to be elsewhere) (Edit 2 I believe the fuzzy parameters are correctly computed and saved in bins$fuzzy, but now I wonder if they are actually used).

Edit Setting the boundary to a different value seems to help. May require tweaking. See my stackoverflow answer. I have made further comments over at stackoverflow, where it is more convenient to post images, even if the discussion probably belonged here to begin with ...

@collioud
Copy link
Author

@collioud collioud commented Jun 24, 2016

Thanks!
But by looking into the file bin.R, I guess the param fuzz took the place of diddle. Could you please confirm this?

@collioud
Copy link
Author

@collioud collioud commented Jun 24, 2016

The boundary param solves indeed this problem if set slightly smaller than the binwidth. But this is not a 'clean' nor logical solution...

@ptoche
Copy link

@ptoche ptoche commented Jun 25, 2016

I've tried to debug the problem, but I'm not familiar at all with ggplot2 (I'm just an enthusiastic user). I have copy-pasted my preliminary investigations here: http://pastebin.com/cYbczwqp.

To sum up, I have attempted to track the fuzzy parameters via all the functions they are processed by. So far, they appear to be correctly computed and saved under bins$fuzzy. But I don't see where/if they are actually used. Consider the function bin_vector, copied below. If you replace bins$breaks with bins$fuzzy, at the location indicated below, you get the expected plot. So my current favoured hypothesis is that bin_vector is supposed to condition on something and return bins$breaks sometimes and bins$fuzzy at other times. See below the location I indicate. Expert opinion will be appreciated at this point.

    ggplot2:::bin_vector
function (x, bins, weight = NULL, pad = FALSE) 
{
    stopifnot(is_bins(bins))
    if (all(is.na(x))) {
        return(bin_out(length(x), NA, NA, xmin = NA, xmax = NA))
    }
    if (is.null(weight)) {
        weight <- rep(1, length(x))
    }
    else {
        weight[is.na(weight)] <- 0
    }

    bin_idx <- cut(x, bins$breaks, right = bins$right_closed, 
        include.lowest = TRUE) ## REPLACE bins$breaks with bins$fuzzy 
                                               ## TO GET EXPECTED OUTPUT
                                               ## CONDITIONAL MISSING HERE?

    bin_count <- as.numeric(tapply(weight, bin_idx, sum, na.rm = TRUE))
    bin_count[is.na(bin_count)] <- 0
    bin_x <- (bins$breaks[-length(bins$breaks)] + bins$breaks[-1])/2
    bin_widths <- diff(bins$breaks)
    if (pad) {
        bin_count <- c(0, bin_count, 0)
        width1 <- bin_widths[1]
        widthn <- bin_widths[length(bin_widths)]
        bin_widths <- c(width1, bin_widths, widthn)
        bin_x <- c(bin_x[1] - width1, bin_x, bin_x[length(bin_x)] + 
            widthn)
    }
    if (any(is.na(bins))) {
        bin_count <- c(bin_count, sum(is.na(bins)))
        bin_widths <- c(bin_widths, NA)
        bin_x <- c(bin_x, NA)
    }
    bin_out(bin_count, bin_x, bin_widths)
}
<environment: namespace:ggplot2>   

@ptoche
Copy link

@ptoche ptoche commented Jun 25, 2016

This is what I think is going on: in the older versions of ggplot2, the fuzzy breaks were assigned to breaks no questions asked:

fuzzybreaks <- sort(breaks) + fuzz

This is line 123 in the old bin function located in stat-bin.r

In the current implementation, the bins structure holds both the non-fuzzy breaks, named bins$breaks as well as the fuzzy breaks, named bins$fuzzy. However, correct me if I'm wrong it appears that nothing is done of bins$fuzzy and only the non-fuzzy bins$breaks are passed to the bin_vector function and ultimately to geom_histogram. I expected to find a condition upon which to set either bins$breaks or bins$fuzzy somewhere at the top of bin_vector

@collioud
Copy link
Author

@collioud collioud commented Jun 27, 2016

Indeed, bin$fuzzy does not seem to be used anywhere.
I hope Hadley (or anyone competent) will enlighten us on this topic...

@hadley
Copy link
Member

@hadley hadley commented Jul 28, 2016

More minimal reprex:

mydf <- data.frame(x = c(-1, -0.5, -0.4, -0.1, -0.1, 0.05, 0.1, 0.1, 0.25, 0.5, 1))
p <- ggplot(mydf, aes(x)) + 
  geom_histogram(binwidth = 0.1, boundary = 0.1, closed = "left")
layer_data(p) %>% subset(count > 0) %>% .[1:5]

Looks like a FP bug. @collioud would you mind trying to make the example even more minimal? (i.e. progressively remove the points in x until the problem no longer appears). That'll make it easier to find and fix

@ptoche
Copy link

@ptoche ptoche commented Jul 28, 2016

@hadley
Copy link
Member

@hadley hadley commented Jul 28, 2016

No, thanks - want to do a PR?

@ptoche
Copy link

@ptoche ptoche commented Jul 28, 2016

do you remember when you intended to use bins$breaks and when you intended to use bins$fuzzy?

Apart from having located the line where bins$fuzzy might have been intended, it's not much of a PR as is, I think (never done a PR before)...

@hadley
Copy link
Member

@hadley hadley commented Jul 28, 2016

I don't remember anything about that code :(

@ptoche
Copy link

@ptoche ptoche commented Jul 28, 2016

I'm not surprised, I'm struggling too and it was only one month ago I was reading it! ;-)

My understanding of the situation is this: there was a major refactoring of the code. In the earlier version, something like bins$fuzzy was used (under a different name); in the revised version, both a bins$breaks and a bins$fuzzy coexist, but only the former is used. I would guess that substituting bins$fuzzy for bins$breaks would be reverting to something close to the earlier implementation. But since you took the trouble of carrying both the non-fuzzy bins$breaks and their fuzzy counterpart bins$fuzzy, you probably intended to use both based on whether a particular condition was satisfied (hypothethical condition not currently in the code).

@thomasp85
Copy link
Member

@thomasp85 thomasp85 commented Aug 12, 2016

It seems this is a simple oversight during refactoring - exchanging bins$breaks for bin$fuzzy should solve this... Can't see any negative impact. (as @hadley can't remember I think it is safe to say that fixing by reverting to old behaviour is the best approach)

I can fix this in a PR, but if you want to get PR experience @ptoche this is a simple case that you can learn something from (and I will help you get it right)

@thomasp85 thomasp85 added this to the v2.2.0 milestone Aug 12, 2016
@ptoche
Copy link

@ptoche ptoche commented Aug 25, 2016

Thanks for the feedback Thomas, I was away from the internet (partly at least) during my summer break, just getting back to the 'real' world: yes I'd love to give PR a try. I'll read up on it first and get back to you.

@ptoche
Copy link

@ptoche ptoche commented Sep 16, 2016

Thanks Hadley, Thanks Thomas, I've had no time to follow up with a PR, so thanks!

@lock lock bot locked as resolved and limited conversation to collaborators Jun 19, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants