Support pre-binned data #2912

kanitw · 2017-08-22T20:47:41Z

From an earlier conversation with @domoritz, supporting pre-binned data would be useful for connecting with database. I think this would be useful for @leibatt as well.

From the conversation, the tricky part is how to know the bin step.

I wonder if we should just let users input the step in such case?

kanitw · 2017-08-22T20:50:26Z

Relevant to tracking binned data -- #2862.

kanitw · 2017-08-22T23:27:57Z

It looks like we can make the prebin version of bin take the following parameters

prebin: always true for this mode
step: step size
endField: field name for the bin-end
start: bin start (optional) -- we can infer this from the min value of the field
end: bin end (optional) -- we can infer this from the max value of the field + step

Either step or endField must be specified so we can infer step size.

kanitw · 2017-09-13T22:23:33Z

For now one workaround is just use type: "ordinal".

Just be aware that you a) don’t get empty bins and b) don’t get the nice axis labels
(copied from @domoritz's comment on slack)

{
    "data": {
        "values": [
            {
                "bin": 0.0,
                "count": 28
            },
            {
                "bin": 0.1,
                "count": 55
            },
            {
                "bin": 0.2,
                "count": 43
            },
            {
                "bin": 0.3,
                "count": 91
            },
            {
                "bin": 0.4,
                "count": 81
            },
            {
                "bin": 0.5,
                "count": 53
            },
            {
                "bin": 0.6,
                "count": 19
            },
            {
                "bin": 0.7,
                "count": 87
            },
            {
                "bin": 0.8,
                "count": 52
            }
        ]
    },
    "description": "A simple bar chart with embedded data.",
    "mark": "bar",
    "encoding": {
        "x": {
            "field": "bin",
            "type": "ordinal"
        },
        "y": {
            "field": "count",
            "type": "quantitative"
        },
        "color": {
            "value": "#007AFF"
        }
    },
    "width": 600,
    "height": 450
}

kanitw · 2017-09-13T22:31:09Z

To make sure the scale include missing bin, you can also use quantitative but you need to manually set scale domain to be a big larger to make sure the bars fit within the plot:

{
    "data": {
        "values": [
            {
                "bin": 0.0,
                "count": 28
            },
            {
                "bin": 0.1,
                "count": 55
            },
            {
                "bin": 0.2,
                "count": 43
            },
            {
                "bin": 0.3,
                "count": 91
            },
            {
                "bin": 0.4,
                "count": 81
            },
            {
                "bin": 0.5,
                "count": 53
            },
            {
                "bin": 0.6,
                "count": 19
            },
            {
                "bin": 0.7,
                "count": 87
            },
            {
                "bin": 0.8,
                "count": 52
            }
        ]
    },
    "description": "A simple bar chart with embedded data.",
    "mark": "bar",
    "encoding": {
        "x": {
            "field": "bin",
            "type": "quantitative",
            "scale": {"domain": [-0.1, 0.9]}
        },
        "y": {
            "field": "count",
            "type": "quantitative"
        },
        "color": {
            "value": "#007AFF"
        },
        "size": {
          "value": 10   
        }
    }
}

Without manual scale domain, the bars may exceed the plotting area:

kanitw · 2017-10-11T16:56:04Z

Given pre-binned data, we may think about supporting prebinned data where the bin is non-uniform.

domoritz · 2017-10-11T17:48:40Z

There are two strategies: ask users to provide the start, end and step and asking users to give us the fields that are the bin boundaries and compute the steps from those. The latter allows non-uniform bins.

kanitw · 2018-01-17T01:03:36Z

We should also make format work with bin-range for pre-binned data (formerly BinTransform does not support rangeFormula #2368, but I think it's just a special case of this issue).

sirahd · 2018-05-07T23:16:42Z

I think these are required properties to have in prebin (as a new property in each encoding):

endField

The reason is that if the users have step and startField in their data, they can generate endField using CalculateNode. For startField, it should be whatever field that is specified in encoding

start and end could be inferred from min and max of the data

kanitw · 2018-05-08T02:07:40Z

I think we need to support two use cases without additional calculate
a) start field + end field
b) start field + fixed step

Thinking more, if possible I would like to avoid adding a endField to bin property if possible as it is making encoding object inconsistent. Basically one key-value pair in the encoding represents one mapping between a visual channel and a data field. By adding endField to bin's prebin object, it's breaking this core design assumption.

Alternatively, we could think about pre-binned data in two cases:

For position encoding (x and y):

If there is a start field and a fixed step size, I think the logic should be similar to existing logic. (Thus, supporting bin: {prebin: true, step: ...} is probably sensible.
If there are start and end fields, users can already use x and x2 to encode each field already. We just need to understand what we do different with the scale and axis for binned field and see if we can infer / allow users to do similar customization.

For non-position encoding (e.g., color, size):

If there is a start field and a fixed step size, I think the logic should be similar to existing logic -- just need to make sure legend label appears reasonably. (Thus, supporting bin: {prebin: true, step: ...} or an equivalent syntax is probably sensible.
If there are start and end fields, this won't be very ideal, but they can still derive a new range field by concatenation the two fields and treat the new field as ordinal.

For the bin: {prebin: true, step: ...} syntax, I also wonder if this should be a special property of the corresponding scale instead as well.

@sirahd Anyway, it would help move our conversation forward if you can summarize what binned field does differently for scales and axes of position channels as well as scales and legends of non-position channels.

sirahd · 2018-05-14T19:56:16Z

For position channels,

axis:
- tickCount is default to no more than maxbins (I assume 20?) -- this case won't be applied to either of the pre-binned case.
- grid is default to false for any binned fields -- should be same for prebinned
- values uses start and stop bin signal to generate explicit tick values on quantitative binned field
scale:
- type is default to linear for positional channel
- nice is default to false for any binned fields
- domain uses specified bin extent if possible, otherwise it uses start and end field from transform

For non-position channels,

scale:
- type is default to bin-ordinal for color channel and bin-linear for other non-position channels
- nice is default to false for any binned fields
- domain uses start and stop bin signal as a domain
legend:
- type is default to symbol for binned quantitative field

@kanitw Please feel free to add anything that I might've missed here!

kanitw · 2018-05-14T22:53:02Z

It seems like the only think we need is a new property for letting the axis know the step size and this property should affect axis tickCount and values. I suggest that we name this tickStep.

Basically, tickStep would modify tickCount and values.

(Note that I cheat to use signal down here --it's not really officially supported in VL)

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "data": {"values": [{"bin_start":12,"bin_end":14,"count_*":71},{"bin_start":10,"bin_end":12,"count_*":29},{"bin_start":8,"bin_end":10,"count_*":7},{"bin_start":16,"bin_end":18,"count_*":94},{"bin_start":14,"bin_end":16,"count_*":127},{"bin_start":20,"bin_end":22,"count_*":17},{"bin_start":18,"bin_end":20,"count_*":54},{"bin_start":22,"bin_end":24,"count_*":5},{"bin_start":24,"bin_end":26,"count_*":2}]},
  "mark": "rect",
  "encoding": {
    "x": {
      "field": "bin_start",
      "type": "quantitative",
      "scale": {
        "zero": false
      },
      "axis": {
        "grid": false,
        "tickCount": {
          "signal": "(domain('x')[1] - domain('x')[0]) / 2 + 1"
          
        },
        "values": {
          "signal": "sequence(domain('x')[0],domain('x')[1] + 2, 2)"
        }
      }
    },
    "x2": {
      "field": "bin_end",
      "type": "quantitative"
    },
    "y2": {
      "field": "count_*",
      "type": "quantitative"
    },
    "y": {
      "value": 0
    }
  }
}

With this new tickStep thing, our "prebin" syntax could be simply bin: "prebin" and this is basically shortcut to perform 4 things:

scale.zero = false
scale.nice = false
axis.grid = false
apply x/yOffset to the bar/rect
apply x/yOffset to the binned dimension based on config.bar.binSpacing

kanitw · 2018-05-15T18:24:51Z

@sirahd I correct the comment above to include scale.nice = false

kanitw · 2018-05-20T19:28:26Z

@sirahd it might be better to have tickStep affect values only ( and note in docs that both tickStep and values are affected by tickCount)

The rationale is that we need to make tickStep behave like bin’s step on the tick in order to allow extraction of bin from encoding while preserving the same behavior.

sirahd · 2018-05-21T03:07:03Z

@kanitw I think we still need to override tickCount, otherwise vega-lite will default to width / 40, which may or may not be correct total number of step size for the bin step

kanitw · 2018-05-21T03:43:13Z

Yeah, may be you're right. In any case, we should make sure that the extracting bin to transform case can still produce 100% identical output. Thus, maybe we should merge any progress on this topic to a feature branch instead of master and then merge all of them to master later. :)

sirahd · 2018-05-23T18:59:41Z

After a long discussion, we decided that we'll add binned to scale's type property. Here are some rationale behind:

Since binned will only affect properties on scale, as well as axis and legend (both of which are visualizations of scale), it is more proper to put it under scale, rather than fieldDef's bin property, which implies data transformation.
We already have bin-linear and bin-ordinal in scale's type. But bin-linear is only for non-position channels. Binned field on x/y still use linear. Adding binned suggest that the data is already binned prior to vega-lite without complecting with the existing scale type system
It is easier to implement binned in scale than fieldDef since all existing logics for bin in fieldDef assume that bin is either boolean or binParams

kanitw · 2018-05-24T05:48:30Z

binned will only affect properties on scale, as well as axis and legend

It will actually affect offset too, but we can argue that's a part of how data get converted to visual values too (a scale = function from data domain to visual values).

domoritz · 2018-05-25T23:15:47Z

When we use a point mark, we only have one field. How would one create https://vega.github.io/editor/#/examples/vega-lite/circle_binned bit with prebinned data?

kanitw · 2018-05-26T05:50:26Z

When we use a point mark, we only have one field. How would one create https://vega.github.io/editor/#/examples/vega-lite/circle_binned bit with prebinned data?

In this case we want to encode the point position to be bin_mid, but set the scale domain to combine bin_start and bin_end. Thus, I think we should extend scale.domain to support fields. I'm adding a new issue in #3818.

domoritz · 2018-05-26T06:16:10Z

@sirahd @jakevdp @jheer @arvind Please vote for your favorite! We are trying to decide what the syntax should be if you already have binned data and want to render it but still have nice axes and legends. The dataset already has bin_start and bin_end in it.

In the specs below, replace ???

{
  "data": {"url": "binned_data.json"},
  "mark": "bar",
  "encoding": {
    "x": {
      ???,
      "field": "bin_start",
      "type": "quantitative"
    },
	"x2": {
      "field": "bin_end",
      "type": "quantitative"
    },
    "y": {
      "aggregate": "count",
      "type": "quantitative"
    }
  }
}

Replace with:

"bin": "prebin/binned"
"scale": {"binned": true}
"binned/prebin": true

kanitw · 2018-05-26T06:29:30Z

I thought more about this. I think both 2. and 3. require adding one more property, and thus are less discoverable. (People already know about bin.)

"bin": "prebin" reads a bit confusing because bin is normally a transform (telling the system to bin it for users). However, in this case, we want an annotation to tell the system that this is previously binned so the system should not bin again, but still generate the scale/axis in a coherent fashion with data that we bin inside the system. The term "prebin" isn't very clear about this.

I think "bin": "binned" is more explicit that this is different than normal bin: true.

Thus my vote is bin: "binned".

- implements #2912 - I'll add examples after code change is approved

kanitw · 2018-07-03T16:49:19Z

Fixed in #3937

kanitw added this to the 2.1? Important Patches milestone Aug 22, 2017

kanitw added the help-wanted label Sep 13, 2017

kanitw mentioned this issue Sep 15, 2017

Pad continuous scales for bar #2988

Closed

domoritz mentioned this issue Sep 20, 2017

Track transformed fields (e.g. bin / timeUnit) #2862

Open

kanitw modified the milestones: 2.x? Important Patches, 2.x Data Transforms Sep 22, 2017

kanitw assigned sirahd Dec 5, 2017

kanitw mentioned this issue Jan 17, 2018

BinTransform does not support rangeFormula #2368

Closed

abanh206 self-assigned this Feb 24, 2018

kanitw mentioned this issue May 26, 2018

Support setting fields for scale domain #3818

Open

kanitw unassigned abanh206 May 26, 2018

sirahd mentioned this issue Jun 27, 2018

Sh/binned #3937

Merged

kanitw removed the 🙏 Help wanted label Jun 29, 2018

domoritz added the 🙋 Feature Request label Jun 29, 2018

kanitw modified the milestones: 2.x Data & Transforms Patches, 3.0 Jun 30, 2018

kanitw pushed a commit that referenced this issue Jul 3, 2018

Add bin: "binned" support for data already binned

5a94fa4

- implements #2912 - I'll add examples after code change is approved

kanitw closed this as completed Jul 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support pre-binned data #2912

Support pre-binned data #2912

kanitw commented Aug 22, 2017

kanitw commented Aug 22, 2017

kanitw commented Aug 22, 2017 •

edited

kanitw commented Sep 13, 2017 •

edited

kanitw commented Sep 13, 2017

kanitw commented Oct 11, 2017

domoritz commented Oct 11, 2017

kanitw commented Jan 17, 2018

sirahd commented May 7, 2018 •

edited

kanitw commented May 8, 2018

sirahd commented May 14, 2018 •

edited by kanitw

kanitw commented May 14, 2018 •

edited

kanitw commented May 15, 2018

kanitw commented May 20, 2018

sirahd commented May 21, 2018

kanitw commented May 21, 2018

sirahd commented May 23, 2018 •

edited by kanitw

kanitw commented May 24, 2018 •

edited

domoritz commented May 25, 2018

kanitw commented May 26, 2018

domoritz commented May 26, 2018 •

edited by kanitw

kanitw commented May 26, 2018 •

edited

kanitw commented Jul 3, 2018

Support pre-binned data #2912

Support pre-binned data #2912

Comments

kanitw commented Aug 22, 2017

kanitw commented Aug 22, 2017

kanitw commented Aug 22, 2017 • edited

kanitw commented Sep 13, 2017 • edited

kanitw commented Sep 13, 2017

kanitw commented Oct 11, 2017

domoritz commented Oct 11, 2017

kanitw commented Jan 17, 2018

sirahd commented May 7, 2018 • edited

kanitw commented May 8, 2018

sirahd commented May 14, 2018 • edited by kanitw

kanitw commented May 14, 2018 • edited

kanitw commented May 15, 2018

kanitw commented May 20, 2018

sirahd commented May 21, 2018

kanitw commented May 21, 2018

sirahd commented May 23, 2018 • edited by kanitw

kanitw commented May 24, 2018 • edited

domoritz commented May 25, 2018

kanitw commented May 26, 2018

domoritz commented May 26, 2018 • edited by kanitw

kanitw commented May 26, 2018 • edited

kanitw commented Jul 3, 2018

kanitw commented Aug 22, 2017 •

edited

kanitw commented Sep 13, 2017 •

edited

sirahd commented May 7, 2018 •

edited

sirahd commented May 14, 2018 •

edited by kanitw

kanitw commented May 14, 2018 •

edited

sirahd commented May 23, 2018 •

edited by kanitw

kanitw commented May 24, 2018 •

edited

domoritz commented May 26, 2018 •

edited by kanitw

kanitw commented May 26, 2018 •

edited