Skip to content
This repository has been archived by the owner on Jan 15, 2022. It is now read-only.

Issue #97, #98: Aggregate job info per day and per week as part of JobPr... #99

Closed
wants to merge 4 commits into from

Conversation

vrushalic
Copy link
Collaborator

...ocessing Step in the ETL. Also enable a rest api to fetch this aggregated app info

… as part of JobProcessing Step in the ETL. Also enable a rest api to fetch this aggregated app info
# hraven_agg_daily - stores daily aggregated job info
# the r column family has a TTL of 30 days
create 'hraven_agg_daily', {NAME => 'i', COMPRESSION => 'LZO', BLOOMFILTER => 'ROWCOL'},
{NAME => 'r', VERSIONS => 1, COMPRESSION => 'LZO', BLOCKCACHE => false, TTL => '2592000'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a comment explaining how we're using the TTL here and why we're dropping older data.
It will be hard to infer from the code that there will be a column per job for each flow just so that we can check if we've see the job already to get an accurate count. We assume that no flow runs longer than the TTL time and once we have an accurate count, we don't need the job IDs themselves to report the data (the counts themselves are stored elsewhere).
I'm even wondering if we want to use a different column family name to avoid confusion with other raw columns that we certainly do want to keep.
Unsuspecting code change later could end up using the same raw column, only to find out later in production that the data is dropped.
Perhaps we want to call this a 't' (for temp) or 's' (for scratch) or something like that.

Vrushali Channapattan added 3 commits June 17, 2014 22:58
…Byte-kind of functions are being used from several classes
Conflicts:
	hraven-core/src/main/java/com/twitter/hraven/Flow.java
	hraven-core/src/main/java/com/twitter/hraven/JobDetails.java

/**
* Encodes the given timestamp for ordering by run ID
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very confusing to call this runId.
This appears to suggest that we aggregate data within a run, which we don't.
We aggregate up to a day or week. So this field is a day or week.
aggregationId may be better, with explanation this could be a day, week, month, or whatever we want to agg by.
The fact that we end up getting this value by rounding down runId (which happen to be timestamps when the flow started)
is a slightly different story.

We do by the way need to clearly describe that all jobs in a flow get aggregated to the start of the first job
or the timestamp when the flow is submitted or something like that.
I can imagine that people might be confused what happens when a flow spans more than one day.

@vrushalic
Copy link
Collaborator Author

It looks like the pull request is weirded out due to several changes on master and merging with my fork. My pull request now shows files that were not part of this pull request as being changed in this pull request.
I will resolve these conflict issues and ensure only my changes show up as changes in this pull.

@vrushalic vrushalic closed this Jul 1, 2014
@vrushalic
Copy link
Collaborator Author

Resubmitted with manually merged conflicts as #113

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants