Implement PartitionAvailability & PartitionCompositeTable #195

QubitPi · 2017-03-16T15:49:06Z

No description provided.

QubitPi · 2017-03-16T15:49:10Z

The current implementation of getAvailableIntervals takes a DataSourceConstraint and pass that constraint to all Availabilities.getAvailableIntervals(DataSourceConstraint). My concern is should we pass a sub-constraint that is specific to an Availability? By sub-constraint I mean something similar used in MetricUnionAvailability

garyluoex · 2017-03-16T17:06:36Z

No, we do not need to since we should only call getAvailableInterval on the table in concern while skipping other tables that we do not care and all the tables should have the same schema.

cdeszaq

Part 1 review. Didn't get to the tests

cdeszaq · 2017-03-24T17:17:06Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/availability/PartitionAvailability.java

+ * An implementation of {@link com.yahoo.bard.webservice.table.availability.Availability}
+ * that unions different tables in druid that have the same columns together,
+ * in a way that the same column in two different table contains different values.
+ * Therefore if only values in one of the table is requested we only need to consider that table's availability.


We should generally be careful about where we talk about tables vs. where we're actually talking about the availablities of tables.

cdeszaq · 2017-03-24T17:17:45Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/availability/PartitionAvailability.java

+            @NotNull Set<Column> columns,
+            @NotNull Function<DataSourceConstraint, Set<PhysicalTable>> partitionFunction
+    ) {
+        this.sourceTables = sourceTables;


We should check that the columns are the same across all tables if that's truly a requirement (as indicated in the JavaDoc).

cdeszaq · 2017-03-24T17:17:56Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/availability/PartitionAvailability.java

+            @NotNull Function<DataSourceConstraint, Set<PhysicalTable>> partitionFunction
+    ) {
+        this.sourceTables = sourceTables;
+        this.columns = columns;


These should be defensively coppied

@cdeszaq Wow, beautiful design! Did you mean something like this?

columns.forEach(column -> { this.columns.add(new Column(column.getName())); });

so that columns in PartitionAvailability will stay immutable to outside world?

cdeszaq · 2017-03-24T17:18:33Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/availability/PartitionAvailability.java

+ */
+public class PartitionAvailability implements Availability {
+
+    private final Set<PhysicalTable> sourceTables;


This likely should hold Availability objects, since it looks like that's what we're really using.

@cdeszaq Did you mean private final Set<Availability> sourceAvailabilities;?

I think so?

cdeszaq · 2017-03-24T17:19:45Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/availability/PartitionAvailability.java

+     * Constructor.
+     *
+     * @param sourceTables  The set of tables that have the same columns
+     * @param columns  The set of columns in concern of this availability(i.e. ParitionAvailability)


I'm not sure quite what this means. What does "in concern" mean? We should probably be more detailed here as to how these columns are used.

Also, referring back to this class is probably not as good as just repeating ourselves here if that's what we're trying to do.

cdeszaq · 2017-03-24T17:22:22Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/availability/PartitionAvailability.java

+    @Override
+    public Set<TableName> getDataSourceNames() {
+        return sourceTables.stream()
+                .map(PhysicalTable::getTableName)


Should be pulling from the availabilities of the physical tables, not the physical tables directly.

Also, this is the 2nd time we're seeing this "gather the leaves from multiple tables"... should / can we have a base "compositing availability" class to hold some of these sorts of common behaviors? I'm guessing this isn't the only method has similar behavior.

(if it makes sense to dedupe the code in a follow-on PR once the similar things are merged, that's fine too, just make a techdebt issue to come back to)

@cdeszaq Issue #223 has been created for this. I can pick it up once it's approved to be implemented.

cdeszaq · 2017-03-24T17:28:52Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/availability/PartitionAvailability.java

+    }
+
+    @Override
+    public SimplifiedIntervalList getAvailableIntervals(DataSourceConstraint constraints) {


Why is getAllAvailableIntervals() implemented so differently from getAvailableIntervals(DataSourceConstraint)? I would expect that, other than applying the constraints to determine which tables participate, the rest of the logic should be largely the same, right? Something seems off here.

Agree. I have refactored getAllAvailableIntervals()

cdeszaq · 2017-03-24T17:29:38Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/availability/PartitionAvailability.java

+            for (Map.Entry<Column, List<Interval>> columnSetEntry
+                    : physicalTable.getAvailability().getAllAvailableIntervals().entrySet()) {
+                Column columnKey = columnSetEntry.getKey();
+                List<Interval> union = unionedColumnToIntervalMapping.getOrDefault(columnKey, new ArrayList<>());


More comments in here about what's happening would make this method much more understandable.

cdeszaq · 2017-03-24T17:32:01Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/availability/PartitionAvailability.java

+                Column columnKey = columnSetEntry.getKey();
+                List<Interval> union = unionedColumnToIntervalMapping.getOrDefault(columnKey, new ArrayList<>());
+
+                union.addAll(columnSetEntry.getValue());


I think a generic List add won't do the right thing. If two columns have the same availability ranges, they will be duplicated here, rather than be collapsed. I think there needs to be a SimplifiedIntervalList in the mix for this to work correctly (possibly just use that instead of ArrayList for the default)

cdeszaq · 2017-03-24T19:49:25Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/availability/PartitionAvailability.java

+    @Override
+    public Map<Column, List<Interval>> getAllAvailableIntervals() {
+        return getColumnToIntervalsMapping().entrySet().stream()
+                .filter(entry -> columns.contains(entry.getKey()))


I feels like we might be able to use a single stream, rather than dumping to an intermediate collection, applying this filter before we start to aggregate intervals.

cdeszaq

Changelog entry needed.

cdeszaq · 2017-04-13T22:43:03Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/PartitionCompositeTable.java

+     *
+     * @param name  Name of the physical table as TableName, also used as fact table name
+     * @param columns  The columns for this table
+     * @param physicalTables  A set of <tt>PhysicalTable</tt>s


What is the purpose or meaning of this set of physical tables? This description, as it stands, does not provide any information that the type of the parameter doesn't give us already.

cdeszaq · 2017-04-13T22:44:06Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/PartitionCompositeTable.java

+     * @param physicalTables  A set of <tt>PhysicalTable</tt>s
+     * @param logicalToPhysicalColumnNames  Mappings from logical to physical names
+     * @param partitionFunction  A function that transform a DataSourceConstraint to a set of
+     * Availabilities


Is there any expectation of the way in which this transform is done? I think there likely needs to be more information here about what kind of transform this needs to be. Again, we need more in this description than what we can get from the type signature.

cdeszaq · 2017-04-13T22:54:02Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/PartitionCompositeTable.java

+    ) {
+        super(
+                name,
+                getCoarsestTimeGrain(physicalTables),


Rather than compute the coarsest grain, let's change this contract to simply take a grain, and then (later in the constructor) validate that the provided grain is satisfiable by all of the dependent physicalTables.

And since both this table and the MetricUnion table (once this gets rebased onto master) have that same sort of contract, let's actually extract the common "take a grain and validate it" into the constructor of a common BaseCompositePhysicalTable class.

_{Note: As part of this, let's also make MetricUnionPhysicalTable simply take a grain, and pass it up to it's new parent class (this new BaseCompositePhysicalTable class) for validation}

cdeszaq · 2017-04-13T22:58:19Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/availability/PartitionAvailability.java

+import javax.validation.constraints.NotNull;
+
+/**
+ * An implementation of Availability that unions same columns from Druid tables together.


It would be good to expand this a little bit to indicate what sort of circumstances where this would be useful, etc., as well as calling out the conditions under which it would not result in bad data. In particular, we should point out that this is for unioning across tables with the same dimension columns who's values are non-overlapping. An example would be a good idea too.

cdeszaq · 2017-04-13T23:02:13Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/availability/PartitionAvailability.java

+        this.columns = new HashSet<>();
+        columns.forEach(column -> {
+            this.columns.add(new Column(column.getName()));
+        });


Why not something simpler like this.columns = new HashSet<>(columns);?

cdeszaq · 2017-04-13T23:16:11Z

...rc/test/groovy/com/yahoo/bard/webservice/table/availability/PartitionAvailabilitySpec.groovy

+        ]
+        availability2.getAllAvailableIntervals() >> [
+                (column2): ['2018-01-01/2018-02-01']
+        ]


No need to wrap these maps, just inline them since there's only 1 entry

cdeszaq · 2017-04-13T23:17:18Z

...rc/test/groovy/com/yahoo/bard/webservice/table/availability/PartitionAvailabilitySpec.groovy

+        )
+
+        Function<DataSourceConstraint, Set<Availability>> partitionFunction = Mock(Function.class)
+        partitionFunction.apply(_ as DataSourceConstraint) >> Sets.newHashSet(availability1, availability2)


Groovier: `([availability1, availability2] as Set)

cdeszaq · 2017-04-13T23:17:39Z

...rc/test/groovy/com/yahoo/bard/webservice/table/availability/PartitionAvailabilitySpec.groovy

+                availableIntervals2.collect{it -> new Interval(it)} as Set
+        )
+
+        Function<DataSourceConstraint, Set<Availability>> partitionFunction = Mock(Function.class)


No need for .class in groovy. Class names are the class instance.

cdeszaq · 2017-04-13T23:24:01Z

...rc/test/groovy/com/yahoo/bard/webservice/table/availability/PartitionAvailabilitySpec.groovy

+        )
+
+        expect:
+        partitionAvailability.getAvailableIntervals(Mock(DataSourceConstraint)) == new SimplifiedIntervalList(


Do we have a test that verifies the behavior of the partition stuff? In particular, I would expect a test that has many availabilities, and a partition function that maps to a subset, and the resulting available intervals includes only the subset of available intervals, not the full set of available intervals.

This test looks like it might be doing it, but I don't think it is...

cdeszaq · 2017-04-13T23:24:50Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/availability/PartitionAvailability.java

+    public SimplifiedIntervalList getAvailableIntervals(DataSourceConstraint constraints) {
+        return new SimplifiedIntervalList(
+                partitionFunction.apply(constraints).stream()
+                        .map(availability -> availability.getAvailableIntervals(constraints))


Is the MetricUnionAvailability just a specialized (ie. fixed) form of the PartitionAvailability? @garyluoex or @michael-mclawhorn, what do you think?

Actually, no.
Metric union contains inner-join semantics.
Partition applies more of an unconstrained outer join behavior.

michael-mclawhorn · 2017-04-14T21:32:41Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/availability/PartitionAvailability.java

+ * For example, two availabilities of the following
+ * <pre>
+ * {@code
+ * +-------------------------+--------------------------+


This documentation is totally wrong.

Wrong how? What's wrong about it? Feedback pointing in the right direction would be nice.

My guess would be that @michael-mclawhorn is looking more for this example to talk about the values of a dimension, and show that partition and what the result is when combined, etc.

Ok. I'm confused. I'm pretty sure when I was looking at this code I was reading the metric union documentation in here. I think perhaps there was a gap between my local copy and the current commit.

michael-mclawhorn · 2017-04-18T17:17:05Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/availability/PartitionAvailability.java

+ * +-------------------------+--------------------------+
+ * | [2017-01-01/2017-02-01] |  [2017-03-01/2017-04-01] |
+ * +-------------------------+--------------------------+
+ * }


This documentation highlights my confusion. Partition doesn't use column information at all. It simply uses the availability of child tables without considering columns.

michael-mclawhorn · 2017-04-18T17:18:15Z

fili-core/src/main/java/com/yahoo/bard/webservice/table/availability/PartitionAvailability.java

+     * @param columns  The set of configured columns
+     * @param partitionFunction  A function that transform a DataSourceConstraint to a set of
+     * Availabilities. The function is composed of the following works functions/transformations
+     * <pre>


Technically Partition may assume common columns, but it doesn't require it. It would be better to emphasize that the availability is the result of considering the availability of the child tables, full stop.

garyluoex · 2017-04-26T18:36:05Z

This PR is reopened here and merged #244

garyluoex added NEED 2 REVIEWS REVIEWABLE labels Mar 16, 2017

garyluoex force-pushed the UDCore3 branch 2 times, most recently from 74b324b to f046ee2 Compare March 17, 2017 00:04

QubitPi changed the base branch from UDCore3 to FieldTagging March 23, 2017 20:37

QubitPi changed the base branch from FieldTagging to UDCore3 March 23, 2017 20:37

cdeszaq requested changes Mar 24, 2017

View reviewed changes

garyluoex force-pushed the UDCore3 branch 3 times, most recently from 4c65f95 to a36562f Compare March 24, 2017 21:51

QubitPi changed the base branch from UDCore3 to master March 30, 2017 15:05

QubitPi added WIP and removed REVIEWABLE NEED 2 REVIEWS labels Mar 30, 2017

QubitPi mentioned this pull request Mar 31, 2017

Implement base "compositing availability" #223

Open

QubitPi added NEED 2 REVIEWS REVIEWABLE and removed WIP labels Mar 31, 2017

QubitPi changed the title ~~Implement PartitionAvailability~~ Implement PartitionAvailability & PartitionCompositeTable Apr 3, 2017

michael-mclawhorn assigned cdeszaq Apr 11, 2017

garyluoex self-assigned this Apr 11, 2017

garyluoex self-requested a review April 13, 2017 16:18

garyluoex assigned garyluoex and unassigned garyluoex Apr 13, 2017

cdeszaq added NEED REBASE and removed NEED REBASE labels Apr 13, 2017

cdeszaq requested changes Apr 13, 2017

View reviewed changes

QubitPi mentioned this pull request Apr 14, 2017

Implement BaseCompositePhysicalTable #242

Merged

QubitPi added 2 commits April 14, 2017 11:45

Implement PartitionAvailability and PartitionCompositeTable

1080148

address @cdeszaq's comments

7209f61

cdeszaq approved these changes Apr 14, 2017

View reviewed changes

QubitPi added NEED 1 REVIEW and removed NEED 2 REVIEWS labels Apr 14, 2017

michael-mclawhorn reviewed Apr 14, 2017

View reviewed changes

michael-mclawhorn reviewed Apr 18, 2017

View reviewed changes

michael-mclawhorn mentioned this pull request Apr 18, 2017

Pr/195 #244

Merged

garyluoex closed this Apr 26, 2017

Implement PartitionAvailability & PartitionCompositeTable #195

Implement PartitionAvailability & PartitionCompositeTable #195

Conversation

QubitPi commented Mar 16, 2017

QubitPi commented Mar 16, 2017 • edited Loading

garyluoex commented Mar 16, 2017

cdeszaq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QubitPi Mar 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QubitPi Mar 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cdeszaq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cdeszaq Apr 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garyluoex commented Apr 26, 2017

QubitPi commented Mar 16, 2017 •

edited

Loading

QubitPi Mar 31, 2017 •

edited

Loading

QubitPi Mar 31, 2017 •

edited

Loading

cdeszaq Apr 14, 2017 •

edited

Loading