Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some value of maxbins can cause scale to incorrectly calculate domain #468

Closed
kanitw opened this issue Dec 6, 2015 · 2 comments
Closed
Labels
bug For bugs or other software errors

Comments

@kanitw
Copy link
Member

kanitw commented Dec 6, 2015

The following spec for binned scatterplot is broken. However, if we increase Displacement’s maxbins from 10 to 15 then it works.

I think I’m pretty close to finding the root cause and I think I reach the point where it will be more productive for you guys to figure out :) Here are the spec and my debug log:

{
  "width": 200,
  "height": 200,
  "padding": "auto",
  "data": [
    {
      "name": "source",
      "url": "data/cars.json",
      "format": {
        "type": "json",
        "parse": {
          "Displacement": "number",
          "Miles_per_Gallon": "number"
        }
      },
      "transform": [
        {
          "type": "filter",
          "test": "datum.Displacement!==null && datum.Miles_per_Gallon!==null"
        },
        {
          "type": "bin",
          "field": "Displacement",
          "output": {
            "start": "bin_Displacement_start",
            "mid": "bin_Displacement_mid",
            "end": "bin_Displacement_end"
          },
          "maxbins": 10
        },
        {
          "type": "bin",
          "field": "Miles_per_Gallon",
          "output": {
            "start": "bin_Miles_per_Gallon_start",
            "mid": "bin_Miles_per_Gallon_mid",
            "end": "bin_Miles_per_Gallon_end"
          },
          "maxbins": 10
        }
      ]
    },
    {
      "name": "summary",
      "source": "source",
      "transform": [
        {
          "type": "aggregate",
          "groupby": [
            "bin_Displacement_start",
            "bin_Displacement_mid",
            "bin_Displacement_end",
            "bin_Miles_per_Gallon_start",
            "bin_Miles_per_Gallon_mid",
            "bin_Miles_per_Gallon_end"
          ],
          "summarize": {"*": ["count"]}
        }
      ]
    },
    {
      "name": "layout",
      "source": "summary",
      "transform": [
        {
          "type": "aggregate",
          "summarize": [
            {
              "field": "bin_Displacement_start",
              "ops": ["distinct"]
            },
            {
              "field": "bin_Miles_per_Gallon_start",
              "ops": ["distinct"]
            }
          ]
        },
        {
          "type": "formula",
          "field": "cellWidth",
          "expr": "(datum.distinct_bin_Displacement_start + 1) * 21"
        },
        {
          "type": "formula",
          "field": "cellHeight",
          "expr": "(datum.distinct_bin_Miles_per_Gallon_start + 1) * 21"
        }
      ]
    }
  ],
  "marks": [
    {
      "name": "root",
      "type": "group",
      "from": {"data": "layout"},
      "properties": {
        "update": {"width": {"value": 200},"height": {"value": 200}}
      },
      "marks": [
        {
          "type": "symbol",
          "properties": {
            "update": {
              "x": {"scale": "x","field": "bin_Displacement_mid"},
              "y": {
                "scale": "y",
                "field": "bin_Miles_per_Gallon_mid"
              },
              "size": {"scale": "size","field": "count"},
              "shape": {"value": "circle"},
              "stroke": {"value": "#4682b4"},
              "strokeWidth": {"value": 2}
            }
          },
          "from": {"data": "summary"}
        }
      ],
      "scales": [
        {
          "name": "x",
          "type": "linear",
          "domain": {
            "data": "summary",
            "field": [
              "bin_Displacement_start",
              "bin_Displacement_end"
            ]
          },
          "rangeMin": 0,
          "rangeMax": 200,
          "round": true,
          "clamp": true,
          "nice": true,
          "zero": false
        },
        {
          "name": "y",
          "type": "linear",
          "domain": {
            "data": "summary",
            "field": [
              "bin_Miles_per_Gallon_start",
              "bin_Miles_per_Gallon_end"
            ]
          },
          "rangeMin": 200,
          "rangeMax": 0,
          "round": true,
          "clamp": true,
          "nice": true,
          "zero": false
        },
        {
          "name": "size",
          "type": "linear",
          "domain": {"data": "summary","field": "count","sort": true},
          "range": [10,320],
          "round": true,
          "clamp": true,
          "zero": false
        }
      ],
      "axes": [
        {
          "type": "x",
          "scale": "x",
          "format": "s",
          "grid": false,
          "title": "BIN(Displacement)",
          "properties": {
            "labels": {
              "angle": {"value": 270},
              "align": {"value": "right"},
              "baseline": {"value": "middle"}
            }
          }
        },
        {
          "type": "y",
          "scale": "y",
          "format": "s",
          "grid": false,
          "title": "BIN(Miles_per_Gallon)"
        }
      ],
      "legends": [
        {
          "size": "size",
          "title": "Number of Records",
          "properties": {
            "symbols": {
              "stroke": {"value": "#4682b4"},
              "fill": {"value": "transparent"},
              "strokeWidth": {"value": 2}
            }
          }
        }
      ]
    }
  ]
}

If I tried to debug,

> ved.view._model._scene.items[0].items[0].items[0]._scales.x.domain()
[NaN, NaN]

Considering x’s domain

"domain": {
            "data": "summary",
            "field": [
              "bin_Displacement_start",
              "bin_Displacement_end"
            ]
          }

I try to inspect domain data:

> ved.view.data('summary').values().reduce(function(a, x) { return a.concat([x.bin_Displacement_start, x.bin_Displacement_end]); }, []) 
[300, 400, 400, 500, 100, 200, 100, 200, 200, 300, 0, 100, 200, 300, 0, 100, 0, 100, 100, 200, 0, 100, 300, 400, 200, 300]

There is no NaN value!

I put a breakpoint on line 223 of Scale.js

scale.domain(domain);

Looks like this line is evaluated twice for each scale. For x, the first time, it correctly has domain = [0, 500]. However, the second time, it has [Infinity, 500].

However, if I change maxbins to 15, then the same line is evaluated only once and has domain = [50, 500].

I think the bug has to deal with the fact that when maxbins is lower (e.g., 10 in this case), the domain contains zero.

@kanitw kanitw added the bug For bugs or other software errors label Dec 6, 2015
@jheer
Copy link
Member

jheer commented Dec 6, 2015

Thanks @kanitw. I took a quick look and can confirm that the binning operator is not producing any invalid values. That means the problem most likely resides with Scale computations or (perhaps) with the underlying datalib aggregator. Maybe @arvind will have some insights regarding the scale.

kanitw added a commit to vega/vega-lite that referenced this issue Dec 6, 2015
@kanitw kanitw mentioned this issue Dec 6, 2015
3 tasks
@jheer
Copy link
Member

jheer commented Dec 6, 2015

A couple notes:

  • The data set "layout" is completely superfluous. It is unneeded and can be safely removed. After removal, the reported bug still remains.
  • If one removes either "bin_Displacement_start" or "bin_Displacement_end" from the array of x-scale fields, the bug disappears. This appears to localize the problem to scale domain computation over multiple fields.
  • If the padding is set to a constant (e.g., 100) the bug disappears. So the problem is indeed happening on a second run of the scale domain computation (triggered after the auto-padding calculation).
  • Going deeper, if you run the following the test code, you discover that a 'null' value is returned for min in the last instance. I traced this down to a faulty line in datalib's collector and will push a fix shortly. I've tested locally and this appears to resolve the issue. So I'm closing this issue as it is not a Vega bug. Look for a new datalib version (1.5.7) soon.
var v1 = [300, 400, 100, 100, 200, 0, 200, 0, 0, 100, 0, 300, 200];
var v2 = [400, 500, 200, 200, 300, 100, 300, 100, 100, 200, 100, 400, 300];

var agg = dl.groupby().stream(true).summarize([{
  name:'value', get:dl.identity, ops:['min','max'], as:['min','max']
}]);

// add values to aggregator
v1.forEach(function(x) { agg._add(x); }); agg.changes();
v2.forEach(function(x) { agg._add(x); }); agg.changes();
console.log('ADD', JSON.stringify(agg.result()));

// test mark mod
v1.forEach(function(x) { agg._markMod(x); }); agg.changes();
v2.forEach(function(x) { agg._markMod(x); }); agg.changes();
console.log('MARK', JSON.stringify(agg.result()));

// test actual mod
v1.forEach(function(x) { agg._mod(x,x); }); agg.changes();
v2.forEach(function(x) { agg._mod(x,x); }); agg.changes();
console.log('MOD', JSON.stringify(agg.result()));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For bugs or other software errors
Projects
None yet
Development

No branches or pull requests

2 participants