-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement PartitionAvailability & PartitionCompositeTable #195
Conversation
@cdeszaq @garyluoex @michael-mclawhorn The current implementation of |
No, we do not need to since we should only call |
74b324b
to
f046ee2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part 1 review. Didn't get to the tests
* An implementation of {@link com.yahoo.bard.webservice.table.availability.Availability} | ||
* that unions different tables in druid that have the same columns together, | ||
* in a way that the same column in two different table contains different values. | ||
* Therefore if only values in one of the table is requested we only need to consider that table's availability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should generally be careful about where we talk about tables vs. where we're actually talking about the availablities of tables.
@NotNull Set<Column> columns, | ||
@NotNull Function<DataSourceConstraint, Set<PhysicalTable>> partitionFunction | ||
) { | ||
this.sourceTables = sourceTables; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check that the columns are the same across all tables if that's truly a requirement (as indicated in the JavaDoc).
@NotNull Function<DataSourceConstraint, Set<PhysicalTable>> partitionFunction | ||
) { | ||
this.sourceTables = sourceTables; | ||
this.columns = columns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be defensively coppied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdeszaq Wow, beautiful design! Did you mean something like this?
columns.forEach(column -> {
this.columns.add(new Column(column.getName()));
});
so that columns
in PartitionAvailability
will stay immutable to outside world?
*/ | ||
public class PartitionAvailability implements Availability { | ||
|
||
private final Set<PhysicalTable> sourceTables; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This likely should hold Availability
objects, since it looks like that's what we're really using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdeszaq Did you mean private final Set<Availability> sourceAvailabilities;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so?
* Constructor. | ||
* | ||
* @param sourceTables The set of tables that have the same columns | ||
* @param columns The set of columns in concern of this availability(i.e. ParitionAvailability) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure quite what this means. What does "in concern" mean? We should probably be more detailed here as to how these columns are used.
Also, referring back to this class is probably not as good as just repeating ourselves here if that's what we're trying to do.
@Override | ||
public Set<TableName> getDataSourceNames() { | ||
return sourceTables.stream() | ||
.map(PhysicalTable::getTableName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be pulling from the availabilities of the physical tables, not the physical tables directly.
Also, this is the 2nd time we're seeing this "gather the leaves from multiple tables"... should / can we have a base "compositing availability" class to hold some of these sorts of common behaviors? I'm guessing this isn't the only method has similar behavior.
(if it makes sense to dedupe the code in a follow-on PR once the similar things are merged, that's fine too, just make a techdebt issue to come back to)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
@Override | ||
public SimplifiedIntervalList getAvailableIntervals(DataSourceConstraint constraints) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is getAllAvailableIntervals()
implemented so differently from getAvailableIntervals(DataSourceConstraint)
? I would expect that, other than applying the constraints to determine which tables participate, the rest of the logic should be largely the same, right? Something seems off here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I have refactored getAllAvailableIntervals()
for (Map.Entry<Column, List<Interval>> columnSetEntry | ||
: physicalTable.getAvailability().getAllAvailableIntervals().entrySet()) { | ||
Column columnKey = columnSetEntry.getKey(); | ||
List<Interval> union = unionedColumnToIntervalMapping.getOrDefault(columnKey, new ArrayList<>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comments in here about what's happening would make this method much more understandable.
Column columnKey = columnSetEntry.getKey(); | ||
List<Interval> union = unionedColumnToIntervalMapping.getOrDefault(columnKey, new ArrayList<>()); | ||
|
||
union.addAll(columnSetEntry.getValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a generic List add won't do the right thing. If two columns have the same availability ranges, they will be duplicated here, rather than be collapsed. I think there needs to be a SimplifiedIntervalList in the mix for this to work correctly (possibly just use that instead of ArrayList for the default)
@Override | ||
public Map<Column, List<Interval>> getAllAvailableIntervals() { | ||
return getColumnToIntervalsMapping().entrySet().stream() | ||
.filter(entry -> columns.contains(entry.getKey())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feels like we might be able to use a single stream, rather than dumping to an intermediate collection, applying this filter before we start to aggregate intervals.
4c65f95
to
a36562f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changelog entry needed.
* | ||
* @param name Name of the physical table as TableName, also used as fact table name | ||
* @param columns The columns for this table | ||
* @param physicalTables A set of <tt>PhysicalTable</tt>s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose or meaning of this set of physical tables? This description, as it stands, does not provide any information that the type of the parameter doesn't give us already.
* @param physicalTables A set of <tt>PhysicalTable</tt>s | ||
* @param logicalToPhysicalColumnNames Mappings from logical to physical names | ||
* @param partitionFunction A function that transform a DataSourceConstraint to a set of | ||
* Availabilities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any expectation of the way in which this transform is done? I think there likely needs to be more information here about what kind of transform this needs to be. Again, we need more in this description than what we can get from the type signature.
) { | ||
super( | ||
name, | ||
getCoarsestTimeGrain(physicalTables), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than compute the coarsest grain, let's change this contract to simply take a grain, and then (later in the constructor) validate that the provided grain is satisfiable by all of the dependent physicalTables.
And since both this table and the MetricUnion table (once this gets rebased onto master) have that same sort of contract, let's actually extract the common "take a grain and validate it" into the constructor of a common BaseCompositePhysicalTable
class.
Note: As part of this, let's also make MetricUnionPhysicalTable
simply take a grain, and pass it up to it's new parent class (this new BaseCompositePhysicalTable
class) for validation
import javax.validation.constraints.NotNull; | ||
|
||
/** | ||
* An implementation of Availability that unions same columns from Druid tables together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to expand this a little bit to indicate what sort of circumstances where this would be useful, etc., as well as calling out the conditions under which it would not result in bad data. In particular, we should point out that this is for unioning across tables with the same dimension columns who's values are non-overlapping. An example would be a good idea too.
this.columns = new HashSet<>(); | ||
columns.forEach(column -> { | ||
this.columns.add(new Column(column.getName())); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not something simpler like this.columns = new HashSet<>(columns);
?
] | ||
availability2.getAllAvailableIntervals() >> [ | ||
(column2): ['2018-01-01/2018-02-01'] | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to wrap these maps, just inline them since there's only 1 entry
) | ||
|
||
Function<DataSourceConstraint, Set<Availability>> partitionFunction = Mock(Function.class) | ||
partitionFunction.apply(_ as DataSourceConstraint) >> Sets.newHashSet(availability1, availability2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Groovier: `([availability1, availability2] as Set)
availableIntervals2.collect{it -> new Interval(it)} as Set | ||
) | ||
|
||
Function<DataSourceConstraint, Set<Availability>> partitionFunction = Mock(Function.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for .class
in groovy. Class names are the class instance.
) | ||
|
||
expect: | ||
partitionAvailability.getAvailableIntervals(Mock(DataSourceConstraint)) == new SimplifiedIntervalList( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test that verifies the behavior of the partition stuff? In particular, I would expect a test that has many availabilities, and a partition function that maps to a subset, and the resulting available intervals includes only the subset of available intervals, not the full set of available intervals.
This test looks like it might be doing it, but I don't think it is...
public SimplifiedIntervalList getAvailableIntervals(DataSourceConstraint constraints) { | ||
return new SimplifiedIntervalList( | ||
partitionFunction.apply(constraints).stream() | ||
.map(availability -> availability.getAvailableIntervals(constraints)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the MetricUnionAvailability just a specialized (ie. fixed) form of the PartitionAvailability? @garyluoex or @michael-mclawhorn, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, no.
Metric union contains inner-join semantics.
Partition applies more of an unconstrained outer join behavior.
* For example, two availabilities of the following | ||
* <pre> | ||
* {@code | ||
* +-------------------------+--------------------------+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documentation is totally wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong how? What's wrong about it? Feedback pointing in the right direction would be nice.
My guess would be that @michael-mclawhorn is looking more for this example to talk about the values of a dimension, and show that partition and what the result is when combined, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'm confused. I'm pretty sure when I was looking at this code I was reading the metric union documentation in here. I think perhaps there was a gap between my local copy and the current commit.
* +-------------------------+--------------------------+ | ||
* | [2017-01-01/2017-02-01] | [2017-03-01/2017-04-01] | | ||
* +-------------------------+--------------------------+ | ||
* } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documentation highlights my confusion. Partition doesn't use column information at all. It simply uses the availability of child tables without considering columns.
* @param columns The set of configured columns | ||
* @param partitionFunction A function that transform a DataSourceConstraint to a set of | ||
* Availabilities. The function is composed of the following works functions/transformations | ||
* <pre> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically Partition may assume common columns, but it doesn't require it. It would be better to emphasize that the availability is the result of considering the availability of the child tables, full stop.
This PR is reopened here and merged #244 |
No description provided.