Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2d histogram within min/max limits has border rows/column that are all zero #2337

Open
abf7d opened this issue Feb 9, 2023 · 4 comments
Open

Comments

@abf7d
Copy link
Contributor

abf7d commented Feb 9, 2023

Description
I'm trying to bin a two dimensional histogram using the df.count method. I wish for the histogram to be bound inside the min/max points for each axis. In other words I want a histogram to stretch out over the whole chart. I'm expecting to get a histogram that has at least one non-zero bin in every edge row or column. The problem is I get back histograms that have multiple contiguous zero rows or columns on the border.

How do I generate a histogram of two columns where each edge contains the bounding min or max value for the row / column?

Here is an example of a histogram that I generated which is not bound by non-zero bins along the edges. The top, bottom, and right edges of this histogram have a lot of empty area:

image

The bin values match what is rendering in the chart:.

In my code, I first get the limits:

limits = df.limits(list(axes_val.values()), delay=True, selection=True)
    await df.execute_async()
    limits = await limits

then I get and return the bins:

    hist = df.count(
        binby=list(axes_val.values()),
        limits=limits,
        shape=num_bins,
        delay=True,
        selection=True,
    )

    await df.execute_async()
    hist = await hist

    # filters out any zeroes
    if sum(hist[hist > 0].shape) == 0:
        counts = [0, 0]

    else:
        counts = [hist[hist > 0].min(), hist.max()]
        counts = [0 if numpy.isinf(c) else c for c in counts]

        # Normalize the histogram counts
        hist = (hist - counts[0]) / (counts[1] - counts[0] + 0.001)
        hist = hist * 254 + 1
        hist[hist < 0] = 0
        hist = hist.astype("uint8")

    output = {"bins": hist.tolist(), "limits": limits, "counts": counts}

    return output

Software information

  • Vaex version vaex==4.14.0
  • Vaex was installed via: pip
  • OS: Windows

Additional information
Please state any supplementary information or provide additional context for the problem (e.g. screenshots, data, etc..).

@maartenbreddels
Copy link
Member

You probably have some outliers in your data. And, in Vaex, the histogram bins are half open [min, max). A dirty way to include the last value in the last bin is to do. limits=[[xmin, xmax+eps], [ymin, ymax+eps], ...] where eps=1e-10, or ideally (1e-16/(xmin-xmax). Does that make sense?

@abf7d
Copy link
Contributor Author

abf7d commented Feb 13, 2023

I think I understand. Let me clarify: So by half open, do you mean that, for the max value, the bins go up to but don't include the last point? I should add eps caculation to my max values to include the max point?

Also should that value be be (1e-16/(xmax - xmin)) or (1e-16/(xmin - xmax))?

@maartenbreddels
Copy link
Member

Yes, and yes :) and yes!

@abf7d
Copy link
Contributor Author

abf7d commented Feb 13, 2023

Thank you so much. I tried the formula provided and it looks like for one of my axes eps is too small. It gets rounded off. When I tried eps=1e-10 it works. Again, I appreciate you pointing me in the right direction and your quick response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants