Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3. CompositePhsyicalTable Core Components Refactor #179

Merged
merged 8 commits into from
Mar 27, 2017
Merged

3. CompositePhsyicalTable Core Components Refactor #179

merged 8 commits into from
Mar 27, 2017

Conversation

garyluoex
Copy link
Collaborator

@garyluoex garyluoex commented Feb 22, 2017

Motivation
This PR is created to refactor the existing table availability structure to be more extensible such that it can support composite tables i.e. union datasource in druid. Composite tables depends on other concrete physical tables, and therefore its availability is a result of merging the availability of the underlying concrete physical table's availability. The details of the merging process is implemented in various different kinds of availability, such as metricUnionAvailability. The only implementation in this PR is ConcreteAvailability which resembles the basic table backed by druid directly that is supported in fili already but implemented in a new extensible structure. In addition, availability directly sources segment information from DataSourceMetadataService now instead of SegmentMetadataLoader which is now deprecated.

Detailed Changes

  • Added ConcretePhysicalTable and ConcreteAvailability to model table in druid datasource and its availabillity in the new table availability structure
  • Added method containsColumn to Schema
  • ConfigurationLoader now takes an additional constructor argument DataSourceMetadataService for creating tables
  • findMissingRequestTimeGrainIntervals method in PartialDataHandler now takes DataSourceConstraint
  • Removed deprecated method findMissingRequestTimeGrainIntervals from PartialDataHandler
  • Removed permissive_column_availability feature flag support and corresponding functionality in PartialDataHandler
  • Removed getIntersectSubintervalsForColumns and getUnionSubintervalsForColumns from PartialDataHandler since the logic is pushed into Availability now
  • Removed getIntervalsByColumnName, resetColumns and hasLogicalMapping methods in PhysicalTable since no longer needed with the availability structure
  • Removed deprecated buildTableGroup method in BaseTableLoader
  • getAvailability method on PartialDataHandler is now deprecated since the logic is pushed into Availability
  • Removed getAvailability method on PartialDataHandler since the logic is pushed into Availability

@garyluoex garyluoex added the WIP label Feb 22, 2017
@garyluoex garyluoex force-pushed the UDCore3 branch 3 times, most recently from 24db2d4 to 0ae6540 Compare February 23, 2017 19:31
@garyluoex garyluoex force-pushed the UDCore3 branch 9 times, most recently from da950e6 to aef5c33 Compare February 28, 2017 00:43
@garyluoex garyluoex changed the title CompositePhsyicalTable Core Components Refactor 3. CompositePhsyicalTable Core Components Refactor Feb 28, 2017
@garyluoex garyluoex force-pushed the UDCore3 branch 3 times, most recently from fe4f51c to 540c757 Compare March 6, 2017 19:03
@garyluoex garyluoex force-pushed the UDCore3 branch 2 times, most recently from 74b324b to f046ee2 Compare March 17, 2017 00:04
@@ -156,7 +156,6 @@ public void invoke(JsonNode rootNode) {
if (segmentMetadata.isEmpty()) {
LOG.warn("Empty segment metadata detected when loading table '{}'", table.getFactTableName());
}
table.resetColumns(segmentMetadata, dimensionDictionary);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I just remove the deprecated SegmentMetadata?

Copy link
Contributor

@archolewa archolewa Mar 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, seeing as how removing this line means that this class will not work, we have two choices:

  1. Remove the class completely (so people relying on it get compile time rather than runtime errors).

  2. Find some way of getting the segments pulled out by SegmentMetadataLoader and shoved into the tables in the new scheme to preserve backwards compatibility.

I haven't dug deeply enough into the code yet to see how feasible number 2 is. Number 1 will make for some migration pain, but I think migration pain is unavoidable at this point. The only question is whether that pain should be up-front and found during migration, or subtle and found in production.

Copy link
Collaborator Author

@garyluoex garyluoex Mar 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove it since it was already Deprecated and we are using the new DataSourceMetadataLoader now.

@archolewa
Copy link
Contributor

Taking a look now.

@@ -82,21 +40,21 @@ public SimplifiedIntervalList findMissingRequestTimeGrainIntervals(
* present for a given combination of request metrics and dimensions (pulled from the API request and generated
* druid query) at the specified granularity.
*
* @param columnNames all the column names the request depends on
* @param constraints all the column names the request depends on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This JavaDoc should be updated, since we are now passing in something more general than just column names.

return column;
}
}
return null;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this null in next commit.

) {
return buildTableGroup(apiMetrics, druidMetrics, tableDefinitions, dictionaries);
this.metadataService = metadataService;
this.loadTableDictionary(dictionaries);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setup is awkward.

We've deprecated the old version, which doesn't take the metadataService. This implies that we shouldn't be using the old method at all. Meanwhile, this version of the function just assigns the class level metadataService and then calls into the old version. That's fine as far as preserving backwards compatibility.

However, near as I can tell, the "bridge to preserve backwards compatibility" is now the new version of this class is intended to be used. People are meant to override loadTableDictionary, and then call into the various helper functions to do the heavy lifting for them. However, none of those helper functions take the metadataService as an argument, and instead rely on the metadataService field.

So in other words, if people want to do the "right" thing, and stop using the deprecated method, and they want to do the "right" thing and use the various helper methods, they have to:

  1. Set a field in the super class.
  2. Call their old code and let magic happen.

Instead, I think we should deprecate all the helper methods that rely on metadataService but don't take an argument (loadPhysicalTable, buildPhysicalTable, buildTableGroup), in favor of a version that does take the metadataService as an argument. It will be intended then, that people will call the helper methods that take the metadataService as an argument, and pass it through to them from their version of loadTableDictionary.

The deprecated versions can then call into the new versions, passing in the segmentMetadata field.

That way, the blessed way of using the BaseTableLoader doesn't have to do any awkward superclass field manipulation, and we still have the same backwards compatibility as we do now.

@michael-mclawhorn Does this make sense to you?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think simply removing the deprecated tag to the loadTableDictionary method that people are using today should fix all the problem, since user should not care and does not have to touch the metadataService, instead fili takes care of it in its AbstractBinderFactory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@archolewa I think there's actually a deeper problem. We're setting state on this BaseTableLoader from a method, rather than via the constructor, and setting that into a private variable. The 1st part means that there's no thread safety, and it generally violates the existing state contract of this class. The 2nd part means that subclasses have no way of getting at the metadataService unless they override the method that takes it as a parameter and then hold onto it themselves.

I think, instead, that this class should take the metadataService as a constructor parameter, and then surface it as either a protected final field, or via a getter. That at least addresses the concerns about how the state gets set, as well as how the metadataService should be used.

Additionally, it's not clear what would happen in this class (and subclasses) if this service came in as null.

Set<FieldName> druidMetrics,
Set<PhysicalTableDefinition> tableDefinitions,
ResourceDictionaries dictionaries
public abstract void loadTableDictionary(ResourceDictionaries dictionaries);
Copy link
Contributor

@archolewa archolewa Mar 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to be deprecating this, then we shouldn't make it abstract. We really shouldn't be forcing people to implement a deprecated method.

If we still want to force people to implement something, then we should make the new loadTableDictionary abstract, and include a small snippet in the migration path that basically tells users to do what we do below, or encourage them to use the new versions of the methods described below. That should be a very minor change for people using the blessed approach.

And for people who aren't, well there's no getting around pain for them. Not unless they want to start spitting out NullPointerException when they try to load segment metadata.

It's a little bit of pain for people upgrading to the new version, but I would argue it's better than forcing people to implement cruft, and it's a very minor compile-time pain so whatever.

@michael-mclawhorn Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will simply not deprecate this method since I think it is the best way without any breaking change but still adding the metadataService to it.

@@ -268,7 +261,11 @@ protected PhysicalTable loadPhysicalTable(
PhysicalTable existingTable = dictionaries.getPhysicalDictionary().get(definition.getName().asName());
if (existingTable == null) {
// Build the physical table
existingTable = buildPhysicalTable(definition, metricNames, dictionaries.getDimensionDictionary());
existingTable = buildPhysicalTable(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't need to be wrapped.

Copy link
Collaborator

@cdeszaq cdeszaq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next part

@@ -168,8 +169,11 @@

private ObjectMappersSuite objectMappers;

private static DataSourceMetadataService dataSourceMetadataService = new DataSourceMetadataService();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this static and set up here? I think having an instance variable is good (for the get* method to use), but it should be initialized by the 1st thing that needs to use it. So, the bind call is probably the 1st thing to initialize it.

@@ -226,6 +227,9 @@ protected void configure() {
//Initialize the metrics filter helper
FieldConverterSupplier.metricsFilterSetBuilder = initializeMetricsFilterSetBuilder();

// Build the datasource metadata service containing the data segments
bind(dataSourceMetadataService).to(DataSourceMetadataService.class);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should bind to getDataSourceMetadataService() rather than to the local variable (which is really just a cache for memoizing the method)

return new DataSourceMetadataService();
protected DataSourceMetadataService getDataSourceMetadataService() {
if (Objects.isNull(dataSourceMetadataService)) {
dataSourceMetadataService = new DataSourceMetadataService();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should update the javadoc for this method to reflect that it should get one, which may be different from build one. In fact, it's probably OK to indicate that it will always give back the same one, making one if needed.

DimensionLoader dimensionLoader,
MetricLoader metricLoader,
TableLoader tableLoader
) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't need to wrap

) {
return buildTableGroup(apiMetrics, druidMetrics, tableDefinitions, dictionaries);
}
public abstract void loadTableDictionary(ResourceDictionaries dictionaries);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs @Override

@@ -310,7 +295,8 @@ protected PhysicalTable buildPhysicalTable(
definition.getName(),
definition.getGrain(),
columns,
definition.getLogicalToPhysicalNames()
definition.getLogicalToPhysicalNames(),
metadataService
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should access the metadataService via a getter. (Getter can be protected if you like). Importantly, by accessing the field directly, a.) there's no way for subclasses to manipulate the service used, and b.) there's no way for subclasses to access the service used!

));
}

@Deprecated
protected void setAvailability(Availability availability) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this locally and I can't determine why it's not working as I think it should, so I guess we can leave it as-is for now, since it's not that critical.

return getColumns().stream()
.filter(column -> column.getName().equals(columnName))
.findFirst();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to keep this method, we should make it a default method on the interface, since it's just using the existing method to get the columns, and then doing a filter.

That said, findFirst makes me a little nervous... what happens if there's more than 1 column with a given name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will throw an exception to it, since we only support druid right now so it should be fine for now. If not at least we hit an exception.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has nothing to do with Druid, it has to do with valid states of an object, and what it means to only get the 1st one if there are multiple columns somehow.

The extra work of going through all the columns to see if there may be others every time a single column is requested seems like massive overhead (ie. if we want that constraint, we should enforce it on construction), and throwing an exception is rather drastic behavior for what seems like an otherwise simple method (to the callers).

Since it's only used in tests, and not production code, and there are clearly more questions than answers here, I'd still vote for simply removing it.

@NotNull ZonedTimeGrain timeGrain,
@NotNull Set<Column> columns,
@NotNull Map<String, String> logicalToPhysicalColumnNames,
@NotNull Availability availability
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should restrict this to a ConcreteAvailability to ensure type safety, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that would force people to extend ConcreteAvailability is this ok?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. I can't think of a reason for someone to have a ConcretePhysicalTable that does not also have a ConcreteAvailability, since the primary contract of a ConcreteAvailability is that it is providing availability info for a single backing data source.

Copy link
Contributor

@archolewa archolewa Mar 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be more willing to buy that if there was something special that ConcreteAvailability contributes that Availability doesn't have (i.e. some additional methods, some tweak to the contract on the interface methods). However, that's not the case.

So from a contractual standpoint, ConcreteAvailibility doesn't really add anything, except you're forcing people to extend a class (which you can only extend one of) vs implementing an interface (which people can implement many of) if they want to add their own custom availability rules.

So I would argue that making it a ConcreteAvailability is making it the framework more rigid than it needs to be.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@archolewa There actually is an additional contract that a ConcreteAvailability offers over an Availability, but it's not currently used (and perhaps it should be, and may be indicating a deeper problem). A ConcretePhysicalTable has this same contract extension, so let's look at what that is:

A ConcretePhysicalTable exposes the additional getFactTableName, which is used by the DataSourceMetadataLoader and should also be used by the TableDataSource when we build one of those for the query. This additional method, which is there to not only indicate that there is, but also to simplify retrieval of, the name of the 1 and only 1 backing data source. Non-concrete physical tables cannot make this assertion, and it's an important one to have (1 vs. many).

Now, the possibly smelly part of the current implementation is that ConcretePhysicalTable uses it's name as the name of it's backing data source, rather than getting it from it's contained ConcreteAvailability. This is probably incorrect (ie. it should call down to the concrete availability to get it), and the following use case illustrates it:

Consider that we have a single Druid data source that contains both hourly and daily data, where it's hourly for the 1st 30 days, and then switches to daily from then on. If we wanted to be able to have Fili's table selection logic see that table as both an hourly table and a daily table, we need to have 2 different physical tables. Here's where the problem arises: Only one of those two tables (based on the current implementation) can have it's name match the name of Druid's data source. In order for both tables to have the same underlying data source, the only way to get there would be to have it sourced from the same ConcreteAvailability, which would need to have the equivalent singleton method.

So, given the above, not only should we keep ConcreteAvailability, but we should also fix the abstraction that ConcretePhysicalTable is giving us, by making it source it's factTableName from it's underlying ConcreteAvailability.

P.S. Sorry for the short saga, I'm in a storytelling mood I guess 😉

*
* @return A physical table schema.
* @return tableEntries map of column to set of available intervals
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't make sense... what's tableEntries?

@@ -39,12 +45,11 @@
DateTime getTableAlignment();

/**
* Get the schema for this physical table.
* Schemas contain granularity and column definitions.
* Getter for all possible column available intervals.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop possible. There could be data from the year 5billion, so this isn't "possible", but rather a "Getter for all of the available intervals for the columns in this table"

@Override
public Map<Column, Set<Interval>> getAllAvailableIntervals() {

Map<String, Set<Interval>> allAvailableIntervals = metadataService.getAvailableIntervalsByTable(name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadataService.getAvailableIntervalsByTable(name) is a common piece that's shown up more than once in this class, and while it's simple logic, it's probably worth pulling into a helper method. I think it's also one of the things that one of the follow-on PRs also does, making the case to make it a helper even stronger.

return new SimplifiedIntervalList(
requestColumns.stream()
.map(columnName -> allAvailableIntervals.getOrDefault(columnName, Collections.emptySet()))
.reduce(null, IntervalUtils::getOverlappingSubintervals)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a comment here about why this won't NPE. Since the reduce will return null if (somehow) requestColumns is empty, there's an NPE risk, but as-written, this method doesn't have it, and it's not immediately clear why that's true, so we should make it more clear.

dimensions,
logicalMetrics,
grainName
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can both unwrap

concreteAvailability.getAllAvailableIntervals() == [
(column1): [interval1].toSet(),
(column2): [interval2].toSet(),
(column3): [].toSet(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/NotABlocker

s/.toSet()/ as Set


@Override
protected DataSourceMetadataService getDataSourceMetadataService() {
this.dataSourceMetadataService = new TestDataSourceMetadataService();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only have 1 of these methods, not both (likely the latter one, not the former)

public void loadTableDictionary(ResourceDictionaries dictionaries) {
public void loadTableDictionary(
ResourceDictionaries dictionaries
) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can unwrap I think

dimensionLoader,
metricLoader,
tableLoader
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can unwrap?

CHANGELOG.md Outdated
@@ -48,15 +50,18 @@ Current
- [Support timeouts for lucene search provider](https://github.com/yahoo/fili/pull/183)

### Changed:
- [Refactor DatasourceMetadataService to fit composite table needs](https://github.com/yahoo/fili/pull/173)
- [CompositePhsyicalTable Core Components Refactor](https://github.com/yahoo/fili/pull/179)
* `ConfigurationLoader` now takes an additional constructor argument `DataSourceMetadataService` for creating tables
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change isn't accurate any more. The change is actually to the constructor of BaseTableLoader.

Copy link
Collaborator

@cdeszaq cdeszaq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still a few things to address, but this is getting close. I don't want to merge this, however, until the Permissive ability is usable, so that means #190 likely needs to be ready (and possibly merged into this PR) before this PR (#179) can merge.

Copy link
Collaborator

@cdeszaq cdeszaq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one little question but not blocking.

CHANGELOG.md Outdated
- [Refactor DatasourceMetadataService to fit composite table needs](https://github.com/yahoo/fili/pull/173)
- [CompositePhsyicalTable Core Components Refactor](https://github.com/yahoo/fili/pull/179)
* Added `ConcretePhysicalTable` and `ConcreteAvailability` to model table in druid datasource and its availabillity in the new table availability structure
* Added class variable for `DataSourceMetadataService` and `ConfigurationLoader` for easy access
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added them to what?

Copy link
Contributor

@will-lauer will-lauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor issues, but otherwise this looks good to me.

* @param physicalTables the tables whose column availabilities are checked
* @param requestedIntervals The intervals that may not be fully satisfied
* @param granularity The granularity at which to find missing intervals
*
* @return subintervals of the requested intervals with incomplete data
*/
public SimplifiedIntervalList findMissingTimeGrainIntervals(
Set<String> columnNames,
DataSourceConstraint constraints,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plurality of the type vs the parameter name should really match. Both should be plural or neither plural.

*/
@Deprecated
public TableGroup buildTableGroup(
Copy link
Contributor

@will-lauer will-lauer Mar 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some customers are still using this. what is the appropriate replacement? Or should it really be removed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants