-
Notifications
You must be signed in to change notification settings - Fork 77
Issue #97, #98: Aggregate job info per day and per week as part of JobPr... #99
Conversation
… as part of JobProcessing Step in the ETL. Also enable a rest api to fetch this aggregated app info
# hraven_agg_daily - stores daily aggregated job info | ||
# the r column family has a TTL of 30 days | ||
create 'hraven_agg_daily', {NAME => 'i', COMPRESSION => 'LZO', BLOOMFILTER => 'ROWCOL'}, | ||
{NAME => 'r', VERSIONS => 1, COMPRESSION => 'LZO', BLOCKCACHE => false, TTL => '2592000'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a comment explaining how we're using the TTL here and why we're dropping older data.
It will be hard to infer from the code that there will be a column per job for each flow just so that we can check if we've see the job already to get an accurate count. We assume that no flow runs longer than the TTL time and once we have an accurate count, we don't need the job IDs themselves to report the data (the counts themselves are stored elsewhere).
I'm even wondering if we want to use a different column family name to avoid confusion with other raw columns that we certainly do want to keep.
Unsuspecting code change later could end up using the same raw column, only to find out later in production that the data is dropped.
Perhaps we want to call this a 't' (for temp) or 's' (for scratch) or something like that.
…Byte-kind of functions are being used from several classes
Conflicts: hraven-core/src/main/java/com/twitter/hraven/Flow.java hraven-core/src/main/java/com/twitter/hraven/JobDetails.java
|
||
/** | ||
* Encodes the given timestamp for ordering by run ID | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very confusing to call this runId.
This appears to suggest that we aggregate data within a run, which we don't.
We aggregate up to a day or week. So this field is a day or week.
aggregationId may be better, with explanation this could be a day, week, month, or whatever we want to agg by.
The fact that we end up getting this value by rounding down runId (which happen to be timestamps when the flow started)
is a slightly different story.
We do by the way need to clearly describe that all jobs in a flow get aggregated to the start of the first job
or the timestamp when the flow is submitted or something like that.
I can imagine that people might be confused what happens when a flow spans more than one day.
It looks like the pull request is weirded out due to several changes on master and merging with my fork. My pull request now shows files that were not part of this pull request as being changed in this pull request. |
Resubmitted with manually merged conflicts as #113 |
...ocessing Step in the ETL. Also enable a rest api to fetch this aggregated app info