Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pre-binned data #2912

Closed
kanitw opened this issue Aug 22, 2017 · 22 comments
Closed

Support pre-binned data #2912

kanitw opened this issue Aug 22, 2017 · 22 comments
Assignees

Comments

@kanitw
Copy link
Member

kanitw commented Aug 22, 2017

From an earlier conversation with @domoritz, supporting pre-binned data would be useful for connecting with database. I think this would be useful for @leibatt as well.

From the conversation, the tricky part is how to know the bin step.

I wonder if we should just let users input the step in such case?

@kanitw kanitw added this to the 2.1? Important Patches milestone Aug 22, 2017
@kanitw
Copy link
Member Author

kanitw commented Aug 22, 2017

Relevant to tracking binned data -- #2862.

@kanitw
Copy link
Member Author

kanitw commented Aug 22, 2017

It looks like we can make the prebin version of bin take the following parameters

  • prebin: always true for this mode
  • step: step size
  • endField: field name for the bin-end
  • start: bin start (optional) -- we can infer this from the min value of the field
  • end: bin end (optional) -- we can infer this from the max value of the field + step

Either step or endField must be specified so we can infer step size.

@kanitw
Copy link
Member Author

kanitw commented Sep 13, 2017

For now one workaround is just use type: "ordinal".

Just be aware that you a) don’t get empty bins and b) don’t get the nice axis labels
(copied from @domoritz's comment on slack)

{
    "data": {
        "values": [
            {
                "bin": 0.0,
                "count": 28
            },
            {
                "bin": 0.1,
                "count": 55
            },
            {
                "bin": 0.2,
                "count": 43
            },
            {
                "bin": 0.3,
                "count": 91
            },
            {
                "bin": 0.4,
                "count": 81
            },
            {
                "bin": 0.5,
                "count": 53
            },
            {
                "bin": 0.6,
                "count": 19
            },
            {
                "bin": 0.7,
                "count": 87
            },
            {
                "bin": 0.8,
                "count": 52
            }
        ]
    },
    "description": "A simple bar chart with embedded data.",
    "mark": "bar",
    "encoding": {
        "x": {
            "field": "bin",
            "type": "ordinal"
        },
        "y": {
            "field": "count",
            "type": "quantitative"
        },
        "color": {
            "value": "#007AFF"
        }
    },
    "width": 600,
    "height": 450
}

image

@kanitw
Copy link
Member Author

kanitw commented Sep 13, 2017

To make sure the scale include missing bin, you can also use quantitative but you need to manually set scale domain to be a big larger to make sure the bars fit within the plot:

{
    "data": {
        "values": [
            {
                "bin": 0.0,
                "count": 28
            },
            {
                "bin": 0.1,
                "count": 55
            },
            {
                "bin": 0.2,
                "count": 43
            },
            {
                "bin": 0.3,
                "count": 91
            },
            {
                "bin": 0.4,
                "count": 81
            },
            {
                "bin": 0.5,
                "count": 53
            },
            {
                "bin": 0.6,
                "count": 19
            },
            {
                "bin": 0.7,
                "count": 87
            },
            {
                "bin": 0.8,
                "count": 52
            }
        ]
    },
    "description": "A simple bar chart with embedded data.",
    "mark": "bar",
    "encoding": {
        "x": {
            "field": "bin",
            "type": "quantitative",
            "scale": {"domain": [-0.1, 0.9]}
        },
        "y": {
            "field": "count",
            "type": "quantitative"
        },
        "color": {
            "value": "#007AFF"
        },
        "size": {
          "value": 10   
        }
    }
}

image

Without manual scale domain, the bars may exceed the plotting area:

image

@kanitw
Copy link
Member Author

kanitw commented Oct 11, 2017

Given pre-binned data, we may think about supporting prebinned data where the bin is non-uniform.

@domoritz
Copy link
Member

There are two strategies: ask users to provide the start, end and step and asking users to give us the fields that are the bin boundaries and compute the steps from those. The latter allows non-uniform bins.

@kanitw
Copy link
Member Author

kanitw commented Jan 17, 2018

@sirahd
Copy link
Contributor

sirahd commented May 7, 2018

I think these are required properties to have in prebin (as a new property in each encoding):

  • endField

The reason is that if the users have step and startField in their data, they can generate endField using CalculateNode. For startField, it should be whatever field that is specified in encoding

start and end could be inferred from min and max of the data

@kanitw
Copy link
Member Author

kanitw commented May 8, 2018

I think we need to support two use cases without additional calculate
a) start field + end field
b) start field + fixed step


Thinking more, if possible I would like to avoid adding a endField to bin property if possible as it is making encoding object inconsistent. Basically one key-value pair in the encoding represents one mapping between a visual channel and a data field. By adding endField to bin's prebin object, it's breaking this core design assumption.

Alternatively, we could think about pre-binned data in two cases:

  1. For position encoding (x and y):
  • If there is a start field and a fixed step size, I think the logic should be similar to existing logic. (Thus, supporting bin: {prebin: true, step: ...} is probably sensible.
  • If there are start and end fields, users can already use x and x2 to encode each field already. We just need to understand what we do different with the scale and axis for binned field and see if we can infer / allow users to do similar customization.
  1. For non-position encoding (e.g., color, size):
  • If there is a start field and a fixed step size, I think the logic should be similar to existing logic -- just need to make sure legend label appears reasonably. (Thus, supporting bin: {prebin: true, step: ...} or an equivalent syntax is probably sensible.
  • If there are start and end fields, this won't be very ideal, but they can still derive a new range field by concatenation the two fields and treat the new field as ordinal.

For the bin: {prebin: true, step: ...} syntax, I also wonder if this should be a special property of the corresponding scale instead as well.

@sirahd Anyway, it would help move our conversation forward if you can summarize what binned field does differently for scales and axes of position channels as well as scales and legends of non-position channels.

@sirahd
Copy link
Contributor

sirahd commented May 14, 2018

For position channels,

  • axis:
    • tickCount is default to no more than maxbins (I assume 20?) -- this case won't be applied to either of the pre-binned case.
    • grid is default to false for any binned fields -- should be same for prebinned
    • values uses start and stop bin signal to generate explicit tick values on quantitative binned field
  • scale:
    • type is default to linear for positional channel
    • nice is default to false for any binned fields
    • domain uses specified bin extent if possible, otherwise it uses start and end field from transform

For non-position channels,

  • scale:
    • type is default to bin-ordinal for color channel and bin-linear for other non-position channels
    • nice is default to false for any binned fields
    • domain uses start and stop bin signal as a domain
  • legend:
    • type is default to symbol for binned quantitative field

@kanitw Please feel free to add anything that I might've missed here!

@kanitw
Copy link
Member Author

kanitw commented May 14, 2018

It seems like the only think we need is a new property for letting the axis know the step size and this property should affect axis tickCount and values. I suggest that we name this tickStep.

Basically, tickStep would modify tickCount and values.

(Note that I cheat to use signal down here --it's not really officially supported in VL)

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "data": {"values": [{"bin_start":12,"bin_end":14,"count_*":71},{"bin_start":10,"bin_end":12,"count_*":29},{"bin_start":8,"bin_end":10,"count_*":7},{"bin_start":16,"bin_end":18,"count_*":94},{"bin_start":14,"bin_end":16,"count_*":127},{"bin_start":20,"bin_end":22,"count_*":17},{"bin_start":18,"bin_end":20,"count_*":54},{"bin_start":22,"bin_end":24,"count_*":5},{"bin_start":24,"bin_end":26,"count_*":2}]},
  "mark": "rect",
  "encoding": {
    "x": {
      "field": "bin_start",
      "type": "quantitative",
      "scale": {
        "zero": false
      },
      "axis": {
        "grid": false,
        "tickCount": {
          "signal": "(domain('x')[1] - domain('x')[0]) / 2 + 1"
          
        },
        "values": {
          "signal": "sequence(domain('x')[0],domain('x')[1] + 2, 2)"
        }
      }
    },
    "x2": {
      "field": "bin_end",
      "type": "quantitative"
    },
    "y2": {
      "field": "count_*",
      "type": "quantitative"
    },
    "y": {
      "value": 0
    }
  }
}

image

With this new tickStep thing, our "prebin" syntax could be simply bin: "prebin" and this is basically shortcut to perform 4 things:

  1. scale.zero = false
  2. scale.nice = false
  3. axis.grid = false
  4. apply x/yOffset to the bar/rect
  5. apply x/yOffset to the binned dimension based on config.bar.binSpacing

@kanitw
Copy link
Member Author

kanitw commented May 15, 2018

@sirahd I correct the comment above to include scale.nice = false

@kanitw
Copy link
Member Author

kanitw commented May 20, 2018

@sirahd it might be better to have tickStep affect values only ( and note in docs that both tickStep and values are affected by tickCount)

The rationale is that we need to make tickStep behave like bin’s step on the tick in order to allow extraction of bin from encoding while preserving the same behavior.

@sirahd
Copy link
Contributor

sirahd commented May 21, 2018

@kanitw I think we still need to override tickCount, otherwise vega-lite will default to width / 40, which may or may not be correct total number of step size for the bin step

@kanitw
Copy link
Member Author

kanitw commented May 21, 2018

Yeah, may be you're right. In any case, we should make sure that the extracting bin to transform case can still produce 100% identical output. Thus, maybe we should merge any progress on this topic to a feature branch instead of master and then merge all of them to master later. :)

@sirahd
Copy link
Contributor

sirahd commented May 23, 2018

After a long discussion, we decided that we'll add binned to scale's type property. Here are some rationale behind:

  • Since binned will only affect properties on scale, as well as axis and legend (both of which are visualizations of scale), it is more proper to put it under scale, rather than fieldDef's bin property, which implies data transformation.
  • We already have bin-linear and bin-ordinal in scale's type. But bin-linear is only for non-position channels. Binned field on x/y still use linear. Adding binned suggest that the data is already binned prior to vega-lite without complecting with the existing scale type system
  • It is easier to implement binned in scale than fieldDef since all existing logics for bin in fieldDef assume that bin is either boolean or binParams

@kanitw
Copy link
Member Author

kanitw commented May 24, 2018

binned will only affect properties on scale, as well as axis and legend

It will actually affect offset too, but we can argue that's a part of how data get converted to visual values too (a scale = function from data domain to visual values).

@domoritz
Copy link
Member

When we use a point mark, we only have one field. How would one create https://vega.github.io/editor/#/examples/vega-lite/circle_binned bit with prebinned data?

@kanitw
Copy link
Member Author

kanitw commented May 26, 2018

When we use a point mark, we only have one field. How would one create https://vega.github.io/editor/#/examples/vega-lite/circle_binned bit with prebinned data?

In this case we want to encode the point position to be bin_mid, but set the scale domain to combine bin_start and bin_end. Thus, I think we should extend scale.domain to support fields. I'm adding a new issue in #3818.

@domoritz
Copy link
Member

domoritz commented May 26, 2018

@sirahd @jakevdp @jheer @arvind Please vote for your favorite! We are trying to decide what the syntax should be if you already have binned data and want to render it but still have nice axes and legends. The dataset already has bin_start and bin_end in it.

In the specs below, replace ???

{
  "data": {"url": "binned_data.json"},
  "mark": "bar",
  "encoding": {
    "x": {
      ???,
      "field": "bin_start",
      "type": "quantitative"
    },
	"x2": {
      "field": "bin_end",
      "type": "quantitative"
    },
    "y": {
      "aggregate": "count",
      "type": "quantitative"
    }
  }
}

Replace with:

  1. "bin": "prebin/binned"
  2. "scale": {"binned": true}
  3. "binned/prebin": true

@kanitw
Copy link
Member Author

kanitw commented May 26, 2018

I thought more about this. I think both 2. and 3. require adding one more property, and thus are less discoverable. (People already know about bin.)

"bin": "prebin" reads a bit confusing because bin is normally a transform (telling the system to bin it for users). However, in this case, we want an annotation to tell the system that this is previously binned so the system should not bin again, but still generate the scale/axis in a coherent fashion with data that we bin inside the system. The term "prebin" isn't very clear about this.

I think "bin": "binned" is more explicit that this is different than normal bin: true.

Thus my vote is bin: "binned".

@sirahd sirahd mentioned this issue Jun 27, 2018
kanitw pushed a commit that referenced this issue Jul 3, 2018
- implements #2912 
- I'll add examples after code change is approved
@kanitw
Copy link
Member Author

kanitw commented Jul 3, 2018

Fixed in #3937

@kanitw kanitw closed this as completed Jul 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants