Skip to content
This repository has been archived by the owner on Jan 15, 2022. It is now read-only.

Issue #97, #98: Aggregate job info per day and per week as part of JobPr... #113

Merged
merged 6 commits into from
Dec 8, 2014

Conversation

vrushalic
Copy link
Collaborator

...ocessing Step in the ETL. Also enable a rest api to fetch this aggregated app info, use check and put to store number of runs, cost and queues,

… as part of JobProcessing Step in the ETL. Also enable a rest api to fetch this aggregated app info, use check and put to store number of runs, cost and queues,
@vrushalic
Copy link
Collaborator Author

Previous pull request for the same: #99
That one was closed due to merge conflicts from my side, due to outdated master on fork

@vrushalic
Copy link
Collaborator Author

This pull request also updates columns in the raw table that specify the status of aggregation for daily and weekly for that job. If the job has already been aggregated, aggregation will not be re-attempted unless the re-aggregate flag is turned on. This helps avoid inadvertent re-aggregation, since aggregation is not idempotent.

Also, it implements Check and Put methods for queue list, job cost and number of runs in the info column family with retries. The intention here is to ensure we update these columns carefully since multiple tasks/jobs may be updating these columns for the same app for that day or that week. More details in the code in comments

# the s column family has a TTL of 30 days, it's used as a scratch col family
# it stores the run ids that are seen for that day
# we assume that a flow will not run for more than 30 days, hence it's fine to "expire" that data
create 'hraven_agg_daily', {NAME => 'i', COMPRESSION => 'LZO', BLOOMFILTER => 'ROWCOL'},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you used "hraven_..." as the names of these new tables? Doesn't look like that is the naming convention followed by other tables?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using an hraven prefix for table names is the better way ahead considering that the hbase datastore can contain tables from other applications as well.
For existing tables, renaming them now may be a bit more complex since hbase does not have a rename command. The recommended way has a few steps, as listed here http://hbase.apache.org/book/table.rename.html
We would need to disable the table and use clone snapshot to do it:

hbase shell> disable 'tableName'
hbase shell> snapshot 'tableName', 'tableSnapshot'
hbase shell> clone_snapshot 'tableSnapshot', 'newTableName'
hbase shell> delete_snapshot 'tableSnapshot'
hbase shell> drop 'tableName'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the table names to be prefixed with job_history so that it is more consistent with other hraven tables

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling ba338ec on vrushalic:etl_aggregate_2 into * on twitter:master*.

* in daily or weekly aggregation table
* @param {@link JobDetails}
*/
public Boolean aggregateJobDetails(JobDetails jobDetails,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is returning the object type Boolean necessary? I think it is just fine to use the primitive type (boolean). Using the object would unnecessarily cause boxing and unboxing and sometimes cause subtle bugs.

newAppsKeys = createNewAppKeysFromResults(scan, startTime, endTime, limit);
} catch (IOException e) {
LOG.error("Caught exception while trying to scan, returning empty list of flows: "
+ e.toString());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it is existing code, it makes me wonder. If this threw an exception, can you proceed to the next? I would think newAppsKeys is null, and you'd get a NullPointerException in line 117.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through the code again, it looks like the function createNewAppKeysFromResults will always return a non-null list (either empty or populated). And I also see some "Long" objects there which can be changed to "long"

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 71d4a5b on vrushalic:etl_aggregate_2 into * on twitter:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 71d4a5b on vrushalic:etl_aggregate_2 into * on twitter:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 71d4a5b on vrushalic:etl_aggregate_2 into * on twitter:master*.

* name of the flag that determines whether or not re-aggregate
* (overrides aggregation status in raw table for that job)
*/
public static String RE_AGGREGATION_FLAG_NAME = "reaggregate";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final?

@sjlee
Copy link
Collaborator

sjlee commented Nov 21, 2014

LGTM. Thanks for your patience @vrushalic! It seems like the travis build is stuck for some reason. It might be good to get a green build. We can merge this once @jrottinghuis gives his +1.

@vrushalic
Copy link
Collaborator Author

Yes, that build isn't even starting up. I tried cancelling it and restarted it. I will keep an eye on this. If nothing changes by today evening, I will make a simple checkin (like a comment update or something) to trigger another build on this branch.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 9661ad2 on vrushalic:etl_aggregate_2 into * on twitter:master*.

@vrushalic
Copy link
Collaborator Author

The travis CI build has passed and is green

jrottinghuis added a commit that referenced this pull request Dec 8, 2014
Issue #97, #98: Aggregate job info per day and per week as part of JobPr...
@jrottinghuis jrottinghuis merged commit a7dd908 into twitter:master Dec 8, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants