Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement PartitionAvailability & PartitionCompositeTable #195

Closed
wants to merge 2 commits into from
Closed

Implement PartitionAvailability & PartitionCompositeTable #195

wants to merge 2 commits into from

Conversation

QubitPi
Copy link
Contributor

@QubitPi QubitPi commented Mar 16, 2017

No description provided.

@QubitPi
Copy link
Contributor Author

QubitPi commented Mar 16, 2017

@cdeszaq @garyluoex @michael-mclawhorn

The current implementation of getAvailableIntervals takes a DataSourceConstraint and pass that constraint to all Availabilities.getAvailableIntervals(DataSourceConstraint). My concern is should we pass a sub-constraint that is specific to an Availability? By sub-constraint I mean something similar used in MetricUnionAvailability

@garyluoex
Copy link
Collaborator

No, we do not need to since we should only call getAvailableInterval on the table in concern while skipping other tables that we do not care and all the tables should have the same schema.

@garyluoex garyluoex force-pushed the UDCore3 branch 2 times, most recently from 74b324b to f046ee2 Compare March 17, 2017 00:04
@QubitPi QubitPi changed the base branch from UDCore3 to FieldTagging March 23, 2017 20:37
@QubitPi QubitPi changed the base branch from FieldTagging to UDCore3 March 23, 2017 20:37
Copy link
Collaborator

@cdeszaq cdeszaq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part 1 review. Didn't get to the tests

* An implementation of {@link com.yahoo.bard.webservice.table.availability.Availability}
* that unions different tables in druid that have the same columns together,
* in a way that the same column in two different table contains different values.
* Therefore if only values in one of the table is requested we only need to consider that table's availability.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should generally be careful about where we talk about tables vs. where we're actually talking about the availablities of tables.

@NotNull Set<Column> columns,
@NotNull Function<DataSourceConstraint, Set<PhysicalTable>> partitionFunction
) {
this.sourceTables = sourceTables;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check that the columns are the same across all tables if that's truly a requirement (as indicated in the JavaDoc).

@NotNull Function<DataSourceConstraint, Set<PhysicalTable>> partitionFunction
) {
this.sourceTables = sourceTables;
this.columns = columns;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be defensively coppied

Copy link
Contributor Author

@QubitPi QubitPi Mar 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdeszaq Wow, beautiful design! Did you mean something like this?

columns.forEach(column -> {
    this.columns.add(new Column(column.getName()));
});

so that columns in PartitionAvailability will stay immutable to outside world?

*/
public class PartitionAvailability implements Availability {

private final Set<PhysicalTable> sourceTables;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This likely should hold Availability objects, since it looks like that's what we're really using.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdeszaq Did you mean private final Set<Availability> sourceAvailabilities;?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so?

* Constructor.
*
* @param sourceTables The set of tables that have the same columns
* @param columns The set of columns in concern of this availability(i.e. ParitionAvailability)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure quite what this means. What does "in concern" mean? We should probably be more detailed here as to how these columns are used.

Also, referring back to this class is probably not as good as just repeating ourselves here if that's what we're trying to do.

@Override
public Set<TableName> getDataSourceNames() {
return sourceTables.stream()
.map(PhysicalTable::getTableName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be pulling from the availabilities of the physical tables, not the physical tables directly.

Also, this is the 2nd time we're seeing this "gather the leaves from multiple tables"... should / can we have a base "compositing availability" class to hold some of these sorts of common behaviors? I'm guessing this isn't the only method has similar behavior.

(if it makes sense to dedupe the code in a follow-on PR once the similar things are merged, that's fine too, just make a techdebt issue to come back to)

Copy link
Contributor Author

@QubitPi QubitPi Mar 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdeszaq Issue #223 has been created for this. I can pick it up once it's approved to be implemented.

}

@Override
public SimplifiedIntervalList getAvailableIntervals(DataSourceConstraint constraints) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is getAllAvailableIntervals() implemented so differently from getAvailableIntervals(DataSourceConstraint)? I would expect that, other than applying the constraints to determine which tables participate, the rest of the logic should be largely the same, right? Something seems off here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I have refactored getAllAvailableIntervals()

for (Map.Entry<Column, List<Interval>> columnSetEntry
: physicalTable.getAvailability().getAllAvailableIntervals().entrySet()) {
Column columnKey = columnSetEntry.getKey();
List<Interval> union = unionedColumnToIntervalMapping.getOrDefault(columnKey, new ArrayList<>());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments in here about what's happening would make this method much more understandable.

Column columnKey = columnSetEntry.getKey();
List<Interval> union = unionedColumnToIntervalMapping.getOrDefault(columnKey, new ArrayList<>());

union.addAll(columnSetEntry.getValue());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a generic List add won't do the right thing. If two columns have the same availability ranges, they will be duplicated here, rather than be collapsed. I think there needs to be a SimplifiedIntervalList in the mix for this to work correctly (possibly just use that instead of ArrayList for the default)

@Override
public Map<Column, List<Interval>> getAllAvailableIntervals() {
return getColumnToIntervalsMapping().entrySet().stream()
.filter(entry -> columns.contains(entry.getKey()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feels like we might be able to use a single stream, rather than dumping to an intermediate collection, applying this filter before we start to aggregate intervals.

@garyluoex garyluoex force-pushed the UDCore3 branch 3 times, most recently from 4c65f95 to a36562f Compare March 24, 2017 21:51
@QubitPi QubitPi changed the base branch from UDCore3 to master March 30, 2017 15:05
@QubitPi QubitPi changed the title Implement PartitionAvailability Implement PartitionAvailability & PartitionCompositeTable Apr 3, 2017
@garyluoex garyluoex self-assigned this Apr 11, 2017
@garyluoex garyluoex self-requested a review April 13, 2017 16:18
@garyluoex garyluoex assigned garyluoex and unassigned garyluoex Apr 13, 2017
Copy link
Collaborator

@cdeszaq cdeszaq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog entry needed.

*
* @param name Name of the physical table as TableName, also used as fact table name
* @param columns The columns for this table
* @param physicalTables A set of <tt>PhysicalTable</tt>s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose or meaning of this set of physical tables? This description, as it stands, does not provide any information that the type of the parameter doesn't give us already.

* @param physicalTables A set of <tt>PhysicalTable</tt>s
* @param logicalToPhysicalColumnNames Mappings from logical to physical names
* @param partitionFunction A function that transform a DataSourceConstraint to a set of
* Availabilities
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any expectation of the way in which this transform is done? I think there likely needs to be more information here about what kind of transform this needs to be. Again, we need more in this description than what we can get from the type signature.

) {
super(
name,
getCoarsestTimeGrain(physicalTables),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than compute the coarsest grain, let's change this contract to simply take a grain, and then (later in the constructor) validate that the provided grain is satisfiable by all of the dependent physicalTables.

And since both this table and the MetricUnion table (once this gets rebased onto master) have that same sort of contract, let's actually extract the common "take a grain and validate it" into the constructor of a common BaseCompositePhysicalTable class.

Note: As part of this, let's also make MetricUnionPhysicalTable simply take a grain, and pass it up to it's new parent class (this new BaseCompositePhysicalTable class) for validation

import javax.validation.constraints.NotNull;

/**
* An implementation of Availability that unions same columns from Druid tables together.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to expand this a little bit to indicate what sort of circumstances where this would be useful, etc., as well as calling out the conditions under which it would not result in bad data. In particular, we should point out that this is for unioning across tables with the same dimension columns who's values are non-overlapping. An example would be a good idea too.

this.columns = new HashSet<>();
columns.forEach(column -> {
this.columns.add(new Column(column.getName()));
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not something simpler like this.columns = new HashSet<>(columns);?

]
availability2.getAllAvailableIntervals() >> [
(column2): ['2018-01-01/2018-02-01']
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to wrap these maps, just inline them since there's only 1 entry

)

Function<DataSourceConstraint, Set<Availability>> partitionFunction = Mock(Function.class)
partitionFunction.apply(_ as DataSourceConstraint) >> Sets.newHashSet(availability1, availability2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Groovier: `([availability1, availability2] as Set)

availableIntervals2.collect{it -> new Interval(it)} as Set
)

Function<DataSourceConstraint, Set<Availability>> partitionFunction = Mock(Function.class)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for .class in groovy. Class names are the class instance.

)

expect:
partitionAvailability.getAvailableIntervals(Mock(DataSourceConstraint)) == new SimplifiedIntervalList(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test that verifies the behavior of the partition stuff? In particular, I would expect a test that has many availabilities, and a partition function that maps to a subset, and the resulting available intervals includes only the subset of available intervals, not the full set of available intervals.

This test looks like it might be doing it, but I don't think it is...

public SimplifiedIntervalList getAvailableIntervals(DataSourceConstraint constraints) {
return new SimplifiedIntervalList(
partitionFunction.apply(constraints).stream()
.map(availability -> availability.getAvailableIntervals(constraints))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the MetricUnionAvailability just a specialized (ie. fixed) form of the PartitionAvailability? @garyluoex or @michael-mclawhorn, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, no.
Metric union contains inner-join semantics.
Partition applies more of an unconstrained outer join behavior.

* For example, two availabilities of the following
* <pre>
* {@code
* +-------------------------+--------------------------+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation is totally wrong.

Copy link
Collaborator

@cdeszaq cdeszaq Apr 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong how? What's wrong about it? Feedback pointing in the right direction would be nice.

My guess would be that @michael-mclawhorn is looking more for this example to talk about the values of a dimension, and show that partition and what the result is when combined, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm confused. I'm pretty sure when I was looking at this code I was reading the metric union documentation in here. I think perhaps there was a gap between my local copy and the current commit.

* +-------------------------+--------------------------+
* | [2017-01-01/2017-02-01] | [2017-03-01/2017-04-01] |
* +-------------------------+--------------------------+
* }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation highlights my confusion. Partition doesn't use column information at all. It simply uses the availability of child tables without considering columns.

* @param columns The set of configured columns
* @param partitionFunction A function that transform a DataSourceConstraint to a set of
* Availabilities. The function is composed of the following works functions/transformations
* <pre>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically Partition may assume common columns, but it doesn't require it. It would be better to emphasize that the availability is the result of considering the availability of the child tables, full stop.

@michael-mclawhorn michael-mclawhorn mentioned this pull request Apr 18, 2017
@garyluoex
Copy link
Collaborator

This PR is reopened here and merged #244

@garyluoex garyluoex closed this Apr 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants