Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Partial Data V2 #215

Open
garyluoex opened this issue Mar 30, 2017 · 5 comments
Open

RFC: Partial Data V2 #215

garyluoex opened this issue Mar 30, 2017 · 5 comments
Assignees
Labels

Comments

@garyluoex
Copy link
Collaborator

garyluoex commented Mar 30, 2017

RFC: Partial Data V2

  • Feature Name: Partial Data V2
  • Start Date: 2017-04-21
  • RFC PR: TBD WIP

Introduction

When querying druid, druid will return any data within the requested intervals that are available while ignoring any missing data in the requested intervals. In many scenarios, this behavior is undesirable, the user is unaware of any missing data returned by druid since druid does not provide any indication of missing data. In order to setup a production level data reporting system, fili fills in the gap to notify users of any missing data using Partial Data V1 where fili will retrieve metadata regarding data availability in druid from the coordinator and behave differently depending on user's expectation of missing data.

Motivation

Recently, a bug was discovered in druid, where the brokers and the coordinators might be inconsistent for a short period of time in terms of data availability due to non-atomic rearranging of data segments between different historical nodes. Broker might not return data in a segment that is loaded in druid but is currently being moved to an other historical node. In this case, coordinator will indicate that the segment containing the requested data is available while Broker will return result that does not contain the corresponding requested data that is in the moving segment without any indication. This bug leads to Fili caching and reporting "bad" data such that result with missing data is returned even if the api user explicitly ask for data only if all data is present. Therefore, additional power is needed for Partial Data to handle this situation, which leads to the idea of Partial Data V2.

Method

In druid version 0.9.0 or later, druid implemented a feature that will return the missing intervals for a given query in the header of the query response from the Broker. Fili never took advantage of this feature since this feature is not documented and Partial Data V1 was believed to be sufficient. Partial Data V2 will take advantage of this feature in addition to the features supported in Partial Data V1 and validate what Fili expects from broker matches what the broker actually returned.

Below is an example of a druid query that requests broker to return the missing intervals:

Content-Type: application/json
{
    "queryType": "groupBy",
    "dataSource": "semiAvailableTable",
    "granularity": "day",
    "dimensions": [ "line_id" ],
    "aggregations": [ { "type": "longSum", "name": "myMetric", "fieldName": "myMetric" } ],
    "intervals": [ "2016-11-21/2017-12-19" ],
    "context": { "uncoveredIntervalsLimit": 10 }
}

Below is the header from the response given by druid from the above druid query:

200 OK
Date:  Mon, 10 Apr 2017 16:24:24 GMT
Content-Type:  application/json
X-Druid-Query-Id:  92c81bed-d9e6-4242-836b-0fcd1efdee9e
X-Druid-Response-Context: {
"uncoveredIntervals": [
    "2016-11-22T00:00:00.000Z/2016-12-18T00:00:00.000Z","2016-12-25T00:00:00.000Z/2017-
    01-03T00:00:00.000Z","2017-01-31T00:00:00.000Z/2017-02-01T00:00:00.000Z","2017-02-
    08T00:00:00.000Z/2017-02-09T00:00:00.000Z","2017-02-10T00:00:00.000Z/2017-02-
    13T00:00:00.000Z","2017-02-16T00:00:00.000Z/2017-02-20T00:00:00.000Z","2017-02-
    22T00:00:00.000Z/2017-02-25T00:00:00.000Z","2017-02-26T00:00:00.000Z/2017-03-
    01T00:00:00.000Z","2017-03-04T00:00:00.000Z/2017-03-05T00:00:00.000Z","2017-03-
    08T00:00:00.000Z/2017-03-09T00:00:00.000Z"
],
"uncoveredIntervalsOverflowed": true
}
Content-Encoding:  gzip
Vary:  Accept-Encoding, User-Agent
Transfer-Encoding:  chunked
Server:  Jetty(9.2.5.v20141112)

In the "context" section of the druid query, a property named "uncoveredIntervalsLimit" is set to let druid know that we want broker to return a list of intervals that are not present in the response shown in uncoveredIntervals property above. The number value 10 indicates to return the first 10 continuous uncovered interval in the header only and set the flag "uncoveredIntervalsOverflowed": true to indicate that there are more uncovered intervals in addition to the first 10 included.

Using the "uncoveredIntervals" header information provided by druid broker response, we can compare it to the missing intervals that fili expects from Partial Data V1. If "uncoveredIntervals" contains any interval that is not present in fili's expected missing interval list, we can send back an error response indicating the mismatch in data availability before the response is cached.

Implementation

The following design is proposed for Partial Data V2 in fili without causing any breaking changes or api change:

  1. Miscellaneous Preparation

    • Add new query context "uncoveredIntervalsLimit" into QueryContext for druid's uncovered interval feature
    • Add a configurable property named druid_uncovered_interval_limit and default it to -1, comment negative means disable
    • Add new response error messages as needed by Partial Data V2
  2. Merge Druid Response Header into Druid Response Body Json Node in AsyncDruidWebServiceImplV2

    • Implement a new AsyncDruidWebServiceImplV2 class that extends AsyncDruidWebServiceImpl which will override the sendRequest method with the following changes in addition to original content from parent class
    • Retrieve "X-Druid-Response-Context" header from the druid response.
    • Add both "X-Druid-Response-Context" parsed as JsonNode and the druid response body that is already parsed into JsonNode into a newly created ObjectNode
    • Return the newly created ObjectNode as JsonNode
    • In AbstractBinderFactory::buildDruidWebService, add a check for druid_uncovered_interval_limit greater than or equals to 0, if yes, use AsyncDruidWebServiceImplV2 else use the original one
  3. Implement PartialDataV2ResponseProcessor implementing FullResponseProcessor

    • Create a new FullResponseProcessor class that extends ResponseProcessor with nothing in it that this PartialDataV2ResponseProcessor implements
    • Check response status code, if 304, invoke next response processor directly following the rules of the last bullet point in this section, if 200, do the following
    • Extract uncoveredIntervalsOverflowed from X-Druid-Response-Context inside the JsonNode passed into PartialDataV2ResponseProcessor::processResponse, if it is true, invoke error response saying limit overflowed
    • Extract uncoveredIntervals from X-Druid-Response-Contex inside the JsonNode passed into PartialDataV2ResponseProcessor::processResponse
    • Parse both the uncoveredIntervals extracted above and allAvailableIntervals extracted from the union of all the query's datasource's availabilities from DataSourceMetadataService into SimplifiedIntervalLists
    • Compare both SimplifiedIntervalLists above, if allAvailableIntervals has any overlap with uncoveredIntervals, invoke error response indicating druid is missing some data that are fili expects to exists.
    • Otherwise, check if the next responseProcessor is a FullResponseProcessor or not, if yes, call the next responseProcessor with the same JsonNode as passed int, otherwise call the next response with the JsonNode being the response body JsonNode instead of the ObjectNode containing the extra "X-Druid_Response-Context"
  4. Implement PartialDataV2RequestHandler implementing DataRequestHandler

    • Add the "uncoveredIntervalsLimit: $druid_uncovered_interval_limit" context into DruidAggregationQuery passed into DataRequestHandler::druidQuery by calling DruidQuery::withContext
    • Pass the above modified DruidQuery into the next request handler instead of the original druid query
    • Append PartialDataV2ResponseProcessor to the current next ResponseProcessor chain
    • Add PartialDataV2RequestHandler to DruidWorkflow between AsyncDruidRequestHandler and CacheV2RequestHandler and include a check for druid_uncovered_interval_limit is greater than or equals to 0
@cdeszaq
Copy link
Collaborator

cdeszaq commented Apr 13, 2017

A few questions:

  1. In which version of Druid is this available?
  2. What is the "uncoveredIntervalsLimit": 10 in the context block of the query doing?
  3. What is the structure of the X-Druid-Response-Context header?
    • What do the relevant sections of that header mean?
  4. Are details about how you're thinking this will get used or get hooked into Fili still being worked on?

@garyluoex garyluoex added REVIEWABLE and removed WIP labels Apr 21, 2017
@cdeszaq
Copy link
Collaborator

cdeszaq commented Apr 24, 2017

This looks pretty solid. 👍

@cdeszaq
Copy link
Collaborator

cdeszaq commented Apr 25, 2017

Some off-line design notes:

img_0382

Note: Hooks up with Cache v3 as well.

@QubitPi
Copy link
Contributor

QubitPi commented May 4, 2017

👍 Good design and article

@QubitPi
Copy link
Contributor

QubitPi commented May 5, 2017

@garyluoex For the 3rd implementation, By FullResponseProcessor class do we mean FullResponseProcessor interface that extends ResponseProcessor interface?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants