Issue #97, #98: Aggregate job info per day and per week as part of JobPr... #99

vrushalic · 2014-06-16T07:12:37Z

...ocessing Step in the ETL. Also enable a rest api to fetch this aggregated app info

… as part of JobProcessing Step in the ETL. Also enable a rest api to fetch this aggregated app info

jrottinghuis · 2014-06-18T01:25:02Z

bin/create_schema.rb

+# hraven_agg_daily - stores daily aggregated job info
+# the r column family has a TTL of 30 days
+create 'hraven_agg_daily', {NAME => 'i', COMPRESSION => 'LZO', BLOOMFILTER => 'ROWCOL'},
+                           {NAME => 'r', VERSIONS => 1, COMPRESSION => 'LZO', BLOCKCACHE => false, TTL => '2592000'}


We should add a comment explaining how we're using the TTL here and why we're dropping older data.
It will be hard to infer from the code that there will be a column per job for each flow just so that we can check if we've see the job already to get an accurate count. We assume that no flow runs longer than the TTL time and once we have an accurate count, we don't need the job IDs themselves to report the data (the counts themselves are stored elsewhere).
I'm even wondering if we want to use a different column family name to avoid confusion with other raw columns that we certainly do want to keep.
Unsuspecting code change later could end up using the same raw column, only to find out later in production that the data is dropped.
Perhaps we want to call this a 't' (for temp) or 's' (for scratch) or something like that.

…Byte-kind of functions are being used from several classes

Conflicts: hraven-core/src/main/java/com/twitter/hraven/Flow.java hraven-core/src/main/java/com/twitter/hraven/JobDetails.java

jrottinghuis · 2014-06-20T00:22:30Z

hraven-core/src/main/java/com/twitter/hraven/AppAggregationKey.java

+
+  /**
+   * Encodes the given timestamp for ordering by run ID
+   */


It is very confusing to call this runId.
This appears to suggest that we aggregate data within a run, which we don't.
We aggregate up to a day or week. So this field is a day or week.
aggregationId may be better, with explanation this could be a day, week, month, or whatever we want to agg by.
The fact that we end up getting this value by rounding down runId (which happen to be timestamps when the flow started)
is a slightly different story.

We do by the way need to clearly describe that all jobs in a flow get aggregated to the start of the first job
or the timestamp when the flow is submitted or something like that.
I can imagine that people might be confused what happens when a flow spans more than one day.

vrushalic · 2014-07-01T01:55:43Z

It looks like the pull request is weirded out due to several changes on master and merging with my fork. My pull request now shows files that were not part of this pull request as being changed in this pull request.
I will resolve these conflict issues and ensure only my changes show up as changes in this pull.

vrushalic · 2014-07-01T06:18:42Z

Resubmitted with manually merged conflicts as #113

Issue twitter#97, twitter#98: Aggregate job info per day and per week…

355935b

… as part of JobProcessing Step in the ETL. Also enable a rest api to fetch this aggregated app info

jrottinghuis reviewed Jun 18, 2014
View reviewed changes

Vrushali Channapattan added 3 commits June 17, 2014 22:58

Updating the documentation and renaming column family name

69ed89b

Refactoring code from JobDetails to ByteUtil since those getValueFrom…

4fe7e2e

…Byte-kind of functions are being used from several classes

Merge remote-tracking branch 'origin' into etl_aggregate

6bc6dcf

Conflicts: hraven-core/src/main/java/com/twitter/hraven/Flow.java hraven-core/src/main/java/com/twitter/hraven/JobDetails.java

jrottinghuis reviewed Jun 20, 2014
View reviewed changes

vrushalic closed this Jul 1, 2014

vrushalic mentioned this pull request Jul 1, 2014

Issue #97, #98: Aggregate job info per day and per week as part of JobPr... #113

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue #97, #98: Aggregate job info per day and per week as part of JobPr... #99

Issue #97, #98: Aggregate job info per day and per week as part of JobPr... #99

vrushalic commented Jun 16, 2014

jrottinghuis Jun 18, 2014

jrottinghuis Jun 20, 2014

vrushalic commented Jul 1, 2014

vrushalic commented Jul 1, 2014

Issue #97, #98: Aggregate job info per day and per week as part of JobPr... #99

Issue #97, #98: Aggregate job info per day and per week as part of JobPr... #99

Conversation

vrushalic commented Jun 16, 2014

jrottinghuis Jun 18, 2014

Choose a reason for hiding this comment

jrottinghuis Jun 20, 2014

Choose a reason for hiding this comment

vrushalic commented Jul 1, 2014

vrushalic commented Jul 1, 2014