Skip to content

An app built on Cloudera Enterprise for tracking metrics of jobs that run in YARN framework

License

Notifications You must be signed in to change notification settings

yeleid/eagleeye

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Eagle Eye

An app built on Cloudera Enterprise for tracking jobs that run in YARN framework. Those runtime "logs" are important, as they can be used for:

  • real-time monitor, check whether the most critical jobs have gained enough resources;
  • resource usage summary reporting at job/queue granularity, get the deeper insights of your cluster;
  • resource allocation snapshot, quickly check which user, job, or queue is hogging resources for any time period.

As the best Hadoop administration console, Cloudera Manager can fulfill almost all administration requirements, e.g. monitor cluster overall resources usage, service/host status, etc.

kpi

In some circumstances, customers may trace back to check the job status given a specific time window. Currently, they always go to "yarn application" console

yarn

However, there is only summarized information there, it's NOT that straightforward to reconstruct what happened at a specific time spot. To facilitate it, a better solution would be to capture the behavior of a job during its whole lifecycle, index them for searching, and make them available to other analytic applications.

1. Architecture

architecture

The high-level diagram above illustrates a simple setup. More specifically, Eagle Eye service will retrieve all running application metrics and send them to Kafka Channel, which in turn passes them to a MorphlineSolr sink in Flume agent. MorphlineSolr parses the messages, converts them into Solr documents, and sends them to Solr server. After the indexed documents appear in Solr, dashboards (e.g. Hue, or other dashboards) can be built for visualization. Also the metrics data can be sent to Spark (and/or Spark Streaming) for more advanced analysis.

In case there all no Kafka brokers deployed yet, Eagle Eye service is able to send the metrics to Flume (through Http source) directly.

Application metrics (Resource Manager REST API) break down

<app>
    <id>application_1326815542473_0001</id>
    <user>user1</user>
    <name>word count</name>
    ...
    <allocatedMB>0</allocatedMB>
    <allocatedVCores>0</allocatedVCores>
    <runningContainers>0</runningContainers>
    <memorySeconds>151730</memorySeconds>
    <vcoreSeconds>103</vcoreSeconds>
</app>

For simplicity, we only extract allocatedMB and allocatedVCores to track CPU/Memory usage of a job.

2. Prerequisite

Given the requirement discussed above, we need at least Solr, Kafka, and Flume. Fortunately All in one, they are already in CDH (with professional support). So any Java developer with Cloudera Enterprise at hand is able to implement this prototype from scratch within 1 day. What you have to do is to write some simple Java code and morphline/flume configurations.

Deploy Kafka

Add Kafka service in Cloudera Manager Admin console

kafka

Create a topic for eagle eye

bin/kafka/create_topic.sh
After you create the topic, you can run bin/kafka/list_topics.sh to see whether the topic is created correctly and it should display like below:
eagleeye

Deploy Solr

Add solr service in Cloudera Manager Admin console

solr

bin/solr/create_collection.sh 
create solr collection...
Uploading configs from /tmp/eagleeye_configs/conf to 172.31.13.75:2181/solr. This may take up to a minute.

Later you may consider Solr data purge for storage/query efficiency via Solr alias.

Deploy Flume

To get metrics from Kafka, you may use configuration

a1.sinks.k1.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
a1.sinks.k1.morphlineFile = bin/flume/morphline/morphline-json.conf
...
# Use a channel which buffers events in memory
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.parseAsFlumeEvent = false
...

This flume agent will consume data from Kafka directly and leverage morphline to index data in Solr in near real time (NRT). To Start flume agent a1, run the command below:

bin/flume/ng-flume.sh

To receive metrics directly from Eagle Eye service, you may use configuration

a1.sinks.k1.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
a1.sinks.k1.morphlineFile = bin/flume/morphline/morphline-csv.conf
...
a1.sources.r1.type = http
a1.sources.r1.port = 4140

3. Eagle Eye Setup & Demonstration

Let's first look into the properties eagle eye use to configure the app dynamically:

# define resource manager address where to retrieve application information
yarn_applications_path=/ws/v1/cluster/apps?states=RUNNING
rm_address=http://172.31.13.75:8088

# define the sampling frequency, the format of metrics, and how to publish metrics
schedule_period=5
producer=http,logger
# producer=kafka,logger
format=csv
# format=json

# define the kafka broker and topic properties
kafka_topic=eagleeye
kafka_broker_list=172.31.13.76:9092,172.31.13.77:9092,172.31.13.78:9092

# define the http server
http_server=http://172.31.13.75:4140

By now, environments for eagle eys are all set, and you can start this app by typing the command as follows:

bin/eagleeye.sh 
16/02/04 01:40:28 INFO solutions.Main: Eagle eye starts monitoring ...

Eagle eye is good to go, you can run command bin/job-submit.sh to submit several jobs on behalf of different users, e.g. hue, hive, and impala. Every 5 seconds, eagle eye will pull information from Resource Manager, parse it and publish the final metrics into Kafka. During the period, you may see the console logs like:

16/02/04 01:54:09 INFO solutions.Publisher: [{"timestamp":"2016-02-04T01:54:08Z","queue":"root.hive","allocatedMB":1024,"allocatedVCores":1,"applicationId":"application_1454403433606_0061","user":"hive","name":"QuasiMonteCarlo","id":"application_1454403433606_0061$1454550848874"}]

To visualize the metrics, we use Hue, the easiest way to build custom dashboards:

hue

Since the data are all in CDH, with standard/open protocol, you can easily retrieve the partition and build a more dedicated dashboards, e.g. using d3 library.

d3

4. References

Flume solrsink: https://flume.apache.org/FlumeUserGuide.html
Cloudera distribution of Apache Kafka: http://www.cloudera.com/documentation/kafka/latest/topics/kafka_packaging.html
Yarn RM REST api: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html

About

An app built on Cloudera Enterprise for tracking metrics of jobs that run in YARN framework

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published