Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Added hRaven integration. #117

Closed
wants to merge 3 commits into
from

Conversation

Projects
None yet
5 participants
Contributor

twdima commented Jan 10, 2014

Adding HRavenStatsWriteService , that can be used with any of the existing "notifiers" to send statistics over to hRaven.
For now, I only "plugged" it into the cascading/scalding modules, but the implementation is generic enough, so that integrating with Pig (and hive) should be trivial. So, I put the service itself into ambrose-common to be available for future integrations (which may, probably, be able to benefit form the AmbroseCascadingNotifierFactory implementation I also added, which can be generalized and moved to common at that point as well).

@sagemintblue sagemintblue and 1 other commented on an outdated diff Jan 10, 2014

<mortbay.jetty.version>6.1.25</mortbay.jetty.version>
<mortbay.jetty.servlet.version>2.5-20081211</mortbay.jetty.servlet.version>
+ <hraven.version>0.9.6</hraven.version>
@sagemintblue

sagemintblue Jan 10, 2014

Member

consider s/hraven.version/twitter.hraven.version/

@twdima

twdima Jan 10, 2014

Contributor

done

@billonahill billonahill commented on the diff Jan 10, 2014

.../java/com/twitter/ambrose/cascading/CascadingJob.java
@@ -43,7 +43,10 @@
* It represents FlowStepJob and all related job metrics will be captured using HadoopStepStats class
* @author Ahmed Mohsen
*/
-@JsonTypeName("cascading")
+
+// FIXME: this makes this deserialize into PigJob on the hraven side, so that it does not need
+// to be rebuilt to work with cascading
+@JsonTypeName("pig")
@billonahill

billonahill Jan 10, 2014

Contributor

This need to be cascading. If that requires an hRaven update first, then lets get a PR for hRaven in the queue.

@dvryaboy

dvryaboy Jan 10, 2014

Contributor

Bill, we should get an hRaven issue going to ublock fixing this. Do you think the issue should block us from merging this change?

@twdima

twdima Jan 10, 2014

Contributor

Frankly, I don't think it should be either (cascading or pig) :)
PigJob and CascadingJob are virtually identical, I think, the "right" thing to do here is to make them use the same class.

I don' t think, there is anything that needs to be changed in hRaven for this. It looks like there is some code in science that depends on these annotations. I think, there is another iteration needed here to clean it up completely.

Until then, I thought, it would be nice if this could be made to work with a relatively small and localized "hack" like this ... (now seems like a good week for it ;))

@billonahill

billonahill Jan 10, 2014

Contributor

If we don't need an hRaven change then that makes things easier. CascadingJob and PigJob can both extend from a common type compose similarly, but we need a json distinction between the two. The client needs to know what type of job it has and this is the mechanism. Not all features will apply to different job types. Shipping this as a hack with Cascading in Pig's clothing is not a good idea imo.

@twdima

twdima Jan 10, 2014

Contributor

Can you explain why you think they need to be distinct though? Especially, from the json (client-side) standpoint, they seem to be entirely identical, as in having the same lists of properties (PigJob has two more right now, but, I don't think there is a real reason why they should not be present in cascading as well).

@billonahill

billonahill Jan 13, 2014

Contributor

It's fine if the json ends up looking similar, but each runtime can have support functionality (either now or in the future), so they should advertise themselves properly. The client should know what runtime is being used so view functionality can be registered for one runtime and not another. Also, deserializing a pig type back to java requires the pig classes, which would mean cascading would rely on pig to deseriaize back to java.

The classes need to be distinct since there is a 1:1 in jackson from JsonTypeName to the class that the json deserializes to. They can extend from a common super-class (i.e. MRJob) if that makes sense.

@sagemintblue sagemintblue and 1 other commented on an outdated diff Jan 10, 2014

...mbrose/cascading/AmbroseCascadingNotifierFactory.java
+package com.twitter.ambrose.cascading;
+
+import cascading.flow.Flow;
+import cascading.flow.FlowProcess;
+import cascading.flow.hadoop.HadoopFlowProcess;
+import org.apache.hadoop.conf.Configuration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.FileInputStream;
+import java.io.InputStreamReader;
+import java.util.*;
+
+/** Select and configure the type of listener to commincate cascading job progress to ambrose.
@sagemintblue

sagemintblue Jan 10, 2014

Member

communicate

@twdima

twdima Jan 10, 2014

Contributor

done

@billonahill billonahill and 1 other commented on an outdated diff Jan 10, 2014

...mbrose/cascading/AmbroseCascadingNotifierFactory.java
+ * and then in /ambrose-cascading.properties, the latter taking precedence for properties defined
+ * in both places. Additionally, if a system property ambrose.properties and/or AMBROSE_PROPERTIES
+ * environment variable are set to name(s) of exiting file(s), those are also loade.
+ * Finally, AMBROSE_PROPERTIES_OVERRIDE can be set to a semi-colon separated string of overrides:
+ * "foo.bar=baz;foo.baz=bar;" etc. (System properties are a better way to do this, but hadoop
+ * command isn't passing them into the app from comman line
+ *
+ * All properties will be copied into the job configuration object, passed to the notifier.
+ * One important property that must be set if using HRaven connector is batch.desc, which is the
+ * application id, required by HRaven. Cascading does not set it the same way Pig does, so, it
+ * must be set explicitly in the properties. Set it to a descriptive name of the job being run.
+ */
+
+public class AmbroseCascadingNotifierFactory {
+
+ private static Logger LOG = LoggerFactory.getLogger(AmbroseCascadingNotifierFactory.class);
@billonahill

billonahill Jan 10, 2014

Contributor

Standard for this project is 2 space indents in java code.

@twdima

twdima Jan 10, 2014

Contributor

done (I think :))

@sagemintblue sagemintblue commented on an outdated diff Jan 11, 2014

...ter/ambrose/service/impl/HRavenStatsWriteService.java
+import com.twitter.hraven.datasource.FlowEventService;
+import com.twitter.hraven.datasource.FlowQueueService;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.twitter.ambrose.model.DAGNode;
+import com.twitter.ambrose.model.Event;
+import com.twitter.ambrose.model.Job;
+import com.twitter.ambrose.model.WorkflowId;
+import com.twitter.ambrose.service.StatsWriteService;
+import com.twitter.ambrose.util.JSONUtil;
+
+/** StatsWriteService to send statistics to HRaven
+ * Requires that HRAVEN_HBASE_CONF_DIR environment variable be set to point to
+ * to a directory with hbase-site.xml for configuration.
+ * Also, JoobConf must contain a batch.desc property for the job name, that will

@sagemintblue sagemintblue commented on an outdated diff Jan 11, 2014

...ter/ambrose/service/impl/HRavenStatsWriteService.java
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.twitter.ambrose.model.DAGNode;
+import com.twitter.ambrose.model.Event;
+import com.twitter.ambrose.model.Job;
+import com.twitter.ambrose.model.WorkflowId;
+import com.twitter.ambrose.service.StatsWriteService;
+import com.twitter.ambrose.util.JSONUtil;
+
+/** StatsWriteService to send statistics to HRaven
+ * Requires that HRAVEN_HBASE_CONF_DIR environment variable be set to point to
+ * to a directory with hbase-site.xml for configuration.
+ * Also, JoobConf must contain a batch.desc property for the job name, that will
+ * be used as a part of workflowid.
+ * The subclass and override getJobConf() to return a configuration object (for lazy
@sagemintblue

sagemintblue Jan 11, 2014

Member

Subclass and override...

@sagemintblue sagemintblue commented on an outdated diff Jan 11, 2014

...ter/ambrose/service/impl/HRavenStatsWriteService.java
+import org.slf4j.LoggerFactory;
+
+import com.twitter.ambrose.model.DAGNode;
+import com.twitter.ambrose.model.Event;
+import com.twitter.ambrose.model.Job;
+import com.twitter.ambrose.model.WorkflowId;
+import com.twitter.ambrose.service.StatsWriteService;
+import com.twitter.ambrose.util.JSONUtil;
+
+/** StatsWriteService to send statistics to HRaven
+ * Requires that HRAVEN_HBASE_CONF_DIR environment variable be set to point to
+ * to a directory with hbase-site.xml for configuration.
+ * Also, JoobConf must contain a batch.desc property for the job name, that will
+ * be used as a part of workflowid.
+ * The subclass and override getJobConf() to return a configuration object (for lazy
+ * initialization) or use @link HRavenStatsWriteService#forJob to instantiate a
@sagemintblue

sagemintblue Jan 11, 2014

Member

or use {@link ...} to instantiate

@sagemintblue sagemintblue commented on an outdated diff Jan 11, 2014

...mbrose/cascading/AmbroseCascadingNotifierFactory.java
+ String overrides = System.getenv("AMBROSE_PROPERTIES_OVERRIDE");
+ if(overrides != null) {
+ for(StringTokenizer st = new StringTokenizer(overrides, ";"); st.hasMoreElements();) {
+ String kv[] = st.nextToken().trim().split("=");
+ if(kv.length == 2) props.put(kv[0], kv[1]);
+ }
+ }
+
+ return props;
+ }
+
+ public static AmbroseCascadingNotifier createNotifier(Flow flow) {
+ String clazz = System.getenv("AMBROSE_NOTIFIER_CLASS");
+ try {
+ Properties props = readProperties();
+ if(clazz == null) { clazz = props.getProperty("ambrose.notifer.class"); }

@aniket486 aniket486 referenced this pull request Apr 29, 2014

Merged

Add hraven integration #134

@aniket486 aniket486 closed this May 9, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment