Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatten nested fields #3676

Merged
merged 6 commits into from
May 3, 2018
Merged

Flatten nested fields #3676

merged 6 commits into from
May 3, 2018

Conversation

domoritz
Copy link
Member

@domoritz domoritz commented Apr 27, 2018

Always flattens nested fields. Flattening happens when data is parsed. This is similar to what we are doing already since an expression always returns flattened data and we often generate expressions instead of suing vega's native format.parse.

Fixes #3369
Fixes #3368
Fixes #3251
Fixes #3441

The logic is now as follows: the input to any transform could be nested. After the transforms, all data is flat and should be accessed as such.

@domoritz domoritz changed the title Flatten nested fields. Fixes #3369 Flatten nested fields Apr 28, 2018
@domoritz domoritz changed the title Flatten nested fields [WIP] Flatten nested fields Apr 28, 2018
@domoritz
Copy link
Member Author

@nyurik do you have a few test cases that I can try?

@nyurik
Copy link
Member

nyurik commented Apr 28, 2018

Not readily available, no.

@domoritz
Copy link
Member Author

In expressions, we require users to write the correct, non-nested, version.

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "description": "Unable to truly filter data",
  "data": {
    "values":[
      {
        "rank": "1",
        "options": {
          "price": 10
        }
      },
      {
        "rank": "2",
        "options": {
          "price": 16 
        }
      },
      {
        "rank": "3",
        "options": {
          "price": 17 
        }
      } 
    ]
  },
  "transform": [{"filter": "datum['options.price'] > 10"}],
  "mark": "line",
  "encoding": {
    "x": {"field": "rank", "type": "ordinal"},
    "y": {"field": "options.price", "type": "quantitative"}
  }
}

However, for all other filters, aggregates, and so on where users specify a field, we automatically o the right thing.

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "description": "Unable to truly filter data",
  "data": {
    "values":[
      {
        "rank": "1",
        "options": {
          "price": 10
        }
      },
      {
        "rank": "2",
        "options": {
          "price": 16 
        }
      },
      {
        "rank": "3",
        "options": {
          "price": 16
        }
      } 
    ]
  },
  "transform": [{"filter": {"field": "options.price", "equal": 16}}],
  "mark": "line",
  "encoding": {
    "x": {"field": "rank", "type": "ordinal"},
    "y": {"field": "options.price", "type": "quantitative"}
  }
}

@domoritz domoritz changed the title [WIP] Flatten nested fields Flatten nested fields Apr 29, 2018
@@ -74,6 +79,11 @@ export class ParseNode extends DataFlowNode {
return;
}
parse[fieldDef.field] = 'number';
} else if (countAccessPath(fieldDef.field) > 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

countAccessPath is a pretty confusing name.

accessPathDepth?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why is this case checked after isTimeFieldDef / isNumberFieldDef?

What if a number field def has nested field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we flatten the field already. Note that every expression flattens.

@kanitw
Copy link
Member

kanitw commented Apr 29, 2018

In expressions, we require users to write the correct, non-nested, version.

Why is that the case? What if you never have an encoding that uses the field "options.price"?

Your example is correct because the encoding makes options.price get flattened so datum['options.price'] in the expression is fine.

but in general, isn't it more correct to filter by the nested expression?

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "description": "Unable to truly filter data",
  "data": {
    "values":[
      {
        "rank": "1",
        "options": {
          "price": 10
        }
      },
      {
        "rank": "2",
        "options": {
          "price": 16 
        }
      },
      {
        "rank": "3",
        "options": {
          "price": 17 
        }
      } 
    ]
  },
  "transform": [{"filter": "datum.options.price > 10"}],
  "mark": "point",
  "encoding": {
    "x": {"field": "rank", "type": "ordinal"}
  }
}

Note that what you do here is not literally flattening, but rather augment each data point with a flattened field, while the old nested field still remains.

@@ -0,0 +1,29 @@
{
"$schema": "https://vega.github.io/schema/vega-lite/v2.json",
"description": "Unable to truly filter data",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is very confusing.

The description suggests that this spec does NOT work, but it seems to be working. So I don't understand what's the point you're trying to make here. (What is "truly filter" data?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ups, copy pasta.

@domoritz
Copy link
Member Author

domoritz commented Apr 29, 2018

@kanitw The case you described is okay. Yes, you don't have to use the flattened field and there may be fields that we don't know how to flatten.

@kanitw
Copy link
Member

kanitw commented Apr 29, 2018

Yes, you don't have to use the flattened field and there may be fields that we don't know how to flatten.

I don't understand this sentence. Can you explain more?

@domoritz
Copy link
Member Author

The idea with this PR is that as long as you only use fields and no expressions, everything will just work even with nested fields because we automatically flatten them. However, if you use an expression, we don't parse them and thus you will be able to access the nested fields as well. This is okay and I don't think we want to prevent it.

@@ -84,17 +84,17 @@
"transform": [
{
"type": "formula",
"expr": "toNumber(datum[\"source\"][\"reco\"])",
"expr": "toNumber(datum[\"source\"] && datum[\"source\"][\"reco\"])",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if source is missing, what this does output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null

screen shot 2018-04-29 at 10 59 01

@kanitw
Copy link
Member

kanitw commented Apr 29, 2018

In #3369, there are links to many other related issues (mostly in the field nesting milestones).

Does this PR fix each of those, if so, listed them so we can close once we merge?

} else if (accessPathDepth(fieldDef.field) > 1) {
// For non-date/non-number (strings and booleans), derive a flattened field for a referenced nested field.
// (Parsing numbers / dates already flatten numeric and temporal fields.)
parse[fieldDef.field] = 'flatten';
Copy link
Member

@kanitw kanitw Apr 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@domoritz Don't you have to add the same logic above for transform -- for all of aggregate, timeUnit, window, filter.

Copy link
Member

@kanitw kanitw Apr 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side issue: I also wonder if the original logic above is behaving correctly for standalone timeUnit transform.
(the field in timeUnit won't get included in parse -- see line_timeunit_transform for example.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you have to add the same logic above for transform -- for all of aggregate, timeUnit, window, filter.

Hmm, yes. I'm surprised we don't do that already. That seems like an orthogonal bug, doesn't it?

Copy link
Member Author

@domoritz domoritz Apr 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side issue: I also wonder if the original logic above is behaving correctly for standalone timeUnit transform.

It looks correct although we parse the generates date as date again (which is not necessary but we need to track fields to remove it).

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "data": {"url": "data/seattle-weather.csv"},
  "mark": "point",
  "transform": [{
    "timeUnit": "month",
    "field": "foo.date",
    "as": "month"
  }],
  "encoding": {
    "x": {
      "field": "month",
      "type": "temporal"
    }
  }
}

Vega

{
      "name": "source_0",
      "url": "data/seattle-weather.csv",
      "format": {"type": "csv"},
      "transform": [
        {
          "type": "formula",
          "as": "month",
          "expr": "datetime(0, month(datum[\"foo\"] && datum[\"foo\"][\"date\"]), 1, 0, 0, 0, 0)"
        },
        {
          "type": "formula",
          "expr": "toDate(datum[\"month\"])",
          "as": "month"
        },
        {
          "type": "filter",
          "expr": "datum[\"month\"] !== null && !isNaN(datum[\"month\"])"
        }
      ]
    }

@kanitw
Copy link
Member

kanitw commented Apr 29, 2018

@domoritz Don't you have to add the same logic above for transform -- for all of aggregate, timeUnit, window, filter.

Hmm, yes. I'm surprised we don't do that already. That seems like an orthogonal bug, doesn't it?

It's not orthogonal since you didn't add the new nested field logic (this PR's concern) to above? This means that nested "field" in aggregate, timeUnit, filter, window, bin may not work?

(It might, but you better check)

@domoritz
Copy link
Member Author

domoritz commented Apr 30, 2018

test

  • TimeUnit (works as shown above)
  • Calculate (we already had that)
  • Bin
  • Aggregate
  • Filter
  • Window

@domoritz
Copy link
Member Author

domoritz commented Apr 30, 2018

To test filter

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "data": {
    "values": {
      "hits": {
        "hits": [
          {"source": {"reco": 2, "yes": 1}},
          {"source": {"reco": 3, "yes": 4}},
          {"source": {"reco": 2, "yes": 0}},
          {"source": {"reco": 1, "yes": 3}},
          {"source": {"reco": 3, "yes": 4}},
          {"source": {"reco": 1, "yes": 1}},
          {"source": {"reco": 1, "yes": 1}},
          {"source": {"reco": 1, "yes": 1}},
          {"source": {"reco": 1, "yes": 0}},
          {"source": {"reco": 1, "yes": 1}}
        ]
      }
    },
    "format": {"property": "hits.hits"}
  },
  "transform": [{
    "filter": {
      "field": "source.reco",
      "equal": 1
    }
  }],
  "mark": "point",
  "encoding": {
    "x": {"field": "source.reco", "type": "ordinal"},
    "y": {"field": "source.yes", "type": "ordinal"}
  }
}

To test calculate

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "data": {
    "values": {
      "hits": {
        "hits": [
          {"source": {"reco": 2, "yes": 1}},
          {"source": {"reco": 3, "yes": 4}},
          {"source": {"reco": 2, "yes": 0}},
          {"source": {"reco": 1, "yes": 3}},
          {"source": {"reco": 3, "yes": 4}},
          {"source": {"reco": 1, "yes": 1}},
          {"source": {"reco": 1, "yes": 1}},
          {"source": {"reco": 1, "yes": 1}},
          {"source": {"reco": 1, "yes": 0}},
          {"source": {"reco": 1, "yes": 1}}
        ]
      }
    },
    "format": {"property": "hits.hits"}
  },
  "transform": [{
    "calculate": "datum.source.reco",
    "as": "reco"
  }],
  "mark": "point",
  "encoding": {
    "x": {"field": "reco", "type": "ordinal"},
    "y": {"field": "source.yes", "type": "ordinal"}
  }
}

to test aggregate

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "data": {
    "values": {
      "hits": {
        "hits": [
          {"source": {"reco": 2, "yes": 1}},
          {"source": {"reco": 3, "yes": 4}},
          {"source": {"reco": 2, "yes": 0}},
          {"source": {"reco": 1, "yes": 3}},
          {"source": {"reco": 3, "yes": 4}},
          {"source": {"reco": 1, "yes": 1}},
          {"source": {"reco": 1, "yes": 1}},
          {"source": {"reco": 1, "yes": 1}},
          {"source": {"reco": 1, "yes": 0}},
          {"source": {"reco": 1, "yes": 1}}
        ]
      }
    },
    "format": {"property": "hits.hits"}
  },
  "transform": [{
    "aggregate": [{
      "op": "mean",
      "field": "source.reco",
      "as": "mean_reco"
    }],
    "groupby": ["source.yes"]
  }],
  "mark": "point",
  "encoding": {
    "x": {"field": "mean_reco", "type": "ordinal"},
    "y": {"field": "source.yes", "type": "ordinal"}
  }
}

to test bin

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "data": {
    "values": {
      "hits": {
        "hits": [
          {"source": {"reco": 2, "yes": 1}},
          {"source": {"reco": 3, "yes": 4}},
          {"source": {"reco": 2, "yes": 0}},
          {"source": {"reco": 1, "yes": 3}},
          {"source": {"reco": 3, "yes": 4}},
          {"source": {"reco": 1, "yes": 1}},
          {"source": {"reco": 1, "yes": 1}},
          {"source": {"reco": 1, "yes": 1}},
          {"source": {"reco": 1, "yes": 0}},
          {"source": {"reco": 1, "yes": 1}}
        ]
      }
    },
    "format": {"property": "hits.hits"}
  },
  "transform": [{
    "bin": true,
    "field": "source.reco",
    "as": "reco_binned"
  }],
  "mark": "point",
  "encoding": {
    "x": {"field": "reco_binned", "type": "ordinal"},
    "y": {"field": "source.yes", "type": "ordinal"}
  }
}

to test window

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "data": {
    "values": {
      "hits": {
        "hits": [
          {"source": {"reco": 2, "yes": 1}},
          {"source": {"reco": 3, "yes": 4}},
          {"source": {"reco": 2, "yes": 0}},
          {"source": {"reco": 1, "yes": 3}},
          {"source": {"reco": 3, "yes": 4}},
          {"source": {"reco": 1, "yes": 1}},
          {"source": {"reco": 1, "yes": 1}},
          {"source": {"reco": 1, "yes": 1}},
          {"source": {"reco": 1, "yes": 0}},
          {"source": {"reco": 1, "yes": 1}}
        ]
      }
    },
    "format": {"property": "hits.hits"}
  },
  "transform": [{
    "window": [{
      "op": "mean",
      "field": "source.reco",
      "as": "mean_reco"
    }],
    "groupby": ["source.yes"]
  }],
  "mark": "point",
  "encoding": {
    "x": {"field": "mean_reco", "type": "ordinal"},
    "y": {"field": "source.yes", "type": "ordinal"}
  }
}

Test escaped dots

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "data": {
    "values": {
      "hits": {
        "hits": [
          {"source.reco": 2 ,"source.yes": 1},
          {"source.reco": 3 ,"source.yes": 4},
          {"source.reco": 2 ,"source.yes": 0},
          {"source.reco": 1 ,"source.yes": 3},
          {"source.reco": 3 ,"source.yes": 4},
          {"source.reco": 1 ,"source.yes": 1},
          {"source.reco": 1 ,"source.yes": 1},
          {"source.reco": 1 ,"source.yes": 1},
          {"source.reco": 1 ,"source.yes": 0},
          {"source.reco": 1 ,"source.yes": 1}
        ]
      }
    },
    "format": {"property": "hits.hits"}
  },
  "mark": "point",
  "encoding": {
    "x": {"field": "source\\.reco", "type": "ordinal"},
    "y": {"field": "source\\.yes", "type": "ordinal"}
  }
}

@domoritz domoritz changed the title Flatten nested fields [WIP] Flatten nested fields Apr 30, 2018
@domoritz domoritz changed the title [WIP] Flatten nested fields Flatten nested fields Apr 30, 2018
@domoritz domoritz force-pushed the dom/flatten branch 2 times, most recently from 2fd6033 to a855ad9 Compare April 30, 2018 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants