Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do I understand binning correctly? #59

Closed
domoritz opened this issue Dec 13, 2014 · 7 comments
Closed

Do I understand binning correctly? #59

domoritz opened this issue Dec 13, 2014 · 7 comments
Labels
RFC / Discussion 💬 For discussing proposed changes

Comments

@domoritz
Copy link
Member

I'm just trying to confirm the semantics of binning.

To create a histogram, we need to use the same column for x and y (y could also be another quantitative field) and choose to bin x and sum y (in some cases another aggregation might make sense). Also, we need to set the type of the binned field to O to fix the labels. I believe at some point we want to show labels that show the range and not hide groups with no values but we can leave that for later.

Vegalite atomically chooses a good bin size that optimizes for 1) not too many bins 2) nice bin widths (based on the base, usually 10). The only input to this binning function are the min and max.

screen shot 2014-12-12 at 17 22 19

sqlite> select cast((Cost__Repair-25000)/500000 as int)*500000, sum(Cost__Repair) from birdstrikes_json group by cast((Cost__Repair-25000)/500000 as int);
0|9227604
500000|5519778
1500000|2762200
1500000|1715077
3000000|6390178
3500000|3644483
7000000|7043545

Choosing two ordinals only makes sense if we also add alpha, size, or a color scale to show the count (without a field). As an alternative, we could use the sum, max, ... of the quantitative scale to map to the alpha, size, or color. But it will never make sense to use another field that we haven't used (?).

screen shot 2014-12-12 at 17 40 50

screen shot 2014-12-12 at 17 47 30

What I'm trying to understand is what the limitations are and what things we can propagate automatically (or disable in the interface). I'm not yet seeing the generalized rules but will think more about this.

@domoritz domoritz added the RFC / Discussion 💬 For discussing proposed changes label Dec 13, 2014
@kanitw
Copy link
Member

kanitw commented Dec 13, 2014

Regarding your first point about binning:

After thinking about it, I think we should prevent casting binned Q to be O because it's actually hiding empty bins
do_i_understand_binning_correctly__ issue__59 _uwdata_vegalite
.

The reason we were confused is currently, bandsize is appropriate only for O but too big for binned Q

vegalite_ui

This makes much more sense as a histogram:

vega_live_editor__vegalite_mod_

@kanitw
Copy link
Member

kanitw commented Dec 13, 2014

Not sure about the second point, you mentioned having two O's but your example showed one O with binned Q casted as O, which I guess should be disallowed anyway. So the point might be no longer relevant.

If you really means having two Os, having "other Q" can make sense.
For example, you can see some relationship between When: Time of Day and Wildlife Size and avg Cost Total in point.x-When__Time_of_day-O.y-Wildlife__Size-O.size-avg_Cost__Total_$-Q

vegalite_ui

(Please correct me if I'm wrong.)

@domoritz
Copy link
Member Author

I think the decision whether ordinal or quantitative depends on the number of bins.

@domoritz
Copy link
Member Author

If you really means having two Os, having "other Q" can make sense.
For example, you can see some relationship between When: Time of Day and Wildlife Size and avg Cost Total in point.x-When__Time_of_day-O.y-Wildlife__Size-O.size-avg_Cost__Total_$-Q

The point I made is only valid for binning (which makes it O). Not O in general.

@kanitw
Copy link
Member

kanitw commented Dec 13, 2014

True, I think ordinal might make sense with some binning method. (e.g., binning ordinal/string with some hashing function)

However, given we currently only do uniform binning for quantative with size ≤ 20.
I think it only make sense if we don't hide empty bin?

@domoritz
Copy link
Member Author

I think it only make sense if we don't hide empty bin?

Agreed. What I meant is the way we show the ticks on the axis. If we only have 5 mins, we can label each with 0-10, 10-20, 20-30 and 30-40. Otherwise just show ticks like we do for Q.

@kanitw
Copy link
Member

kanitw commented Jan 2, 2015

I guess we’re done with this question. Please reopen if you disagree.

@kanitw kanitw closed this as completed Jan 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC / Discussion 💬 For discussing proposed changes
Projects
None yet
Development

No branches or pull requests

2 participants