Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Data Management + Feed Processing Platform over Hadoop
branch: master
Failed to load latest commit information.
acquisition changed version to 0.2-SNAPSHOT
archival changed version to 0.2-SNAPSHOT
client Implementation for supporting feed replication with delays, exposed t…
common Replaced reference to MiniDFSCluster with EmbeddedCluster on all tests
docs Invalid example XML fragment
feed Fix, oozie cannot resolve input-events if the resolved path is before…
messaging Replaced reference to MiniDFSCluster with EmbeddedCluster on all tests
metrics Clean up of repository and dependency management with versions in sup…
monitoring Granting license to ASF
oozie update fix which was created multiple active coord
prism Fixed 1. Blacklisting superusers, hdfs, mapred, oozie, and ivory.
process Replaced reference to MiniDFSCluster with EmbeddedCluster on all tests
replication Add options to bandwidth options to FeedReplicator options
rerun Fix issue with in-memory retry queue implementation #2
retention Fixed 1. Blacklisting superusers, hdfs, mapred, oozie, and ivory.
test-util Replaced reference to MiniDFSCluster with EmbeddedCluster on all tests
webapp Fixed 1. Blacklisting superusers, hdfs, mapred, oozie, and ivory.
.gitignore Clean up of repository and dependency management with versions in sup…
Installation-steps.txt Added oozie download path
LICENSE.txt Move metrics to a separate module
README Removing unnecessary libs from war
backlog latest backlog
pom.xml Allow bandwidth customization

README

Ivory Overview

Ivory is a feed processing and feed management system aimed at making it
easier for end consumers to onboard their feed processing and feed
management on hadoop clusters.

Why?

* Dependencies across various data processing pipelines are not easy to
  establish. Gaps here typically leads to either incorrect/partial
  processing or expensive reprocessing. Repeated duplicate definition of
  a single feed multiple times can lead to inconsistencies / issues.

* Input data may not arrive always on time and it is required to kick off
  the processing without waiting for all data to arrive and accommodate
  late data separately

* Feed management services such as feed retention, replications across
  clusters, archival etc are tasks that are burdensome on individual
  pipeline owners and better offered as a service for all customers.

* It should be easy to onboard new workflows/pipelines

* Smoother integration with metastore/catalog

* Provide notification to end customer based on availability of feed
  groups (logical group of related feeds, which are likely to be used
  together)

Usage

a. Setup cluster definition
   $IVORY_HOME/bin/ivory entity -submit -type cluster -file /cluster/definition.xml -url http://ivory-server:ivory-port

b. Setup feed definition
   $IVORY_HOME/bin/ivory entity -submit -type feed -file /feed1/definition.xml -url http://ivory-server:ivory-port
   $IVORY_HOME/bin/ivory entity -submit -type feed -file /feed2/definition.xml -url http://ivory-server:ivory-port

c. Setup process definition
   $IVORY_HOME/bin/ivory entity -submit -type process -file /process/definition.xml -url http://ivory-server:ivory-port

d. Once submitted, entity definition, status and dependency can be queried.
   $IVORY_HOME/bin/ivory entity -type [cluster|feed|process] -name <<name>> [-definition|-status|-dependency] -url http://ivory-server:ivory-port

   or entities for a particular type can be listed through
   $IVORY_HOME/bin/ivory entity -type [cluster|feed|process] -list

e. Schedule process
   $IVORY_HOME/bin/ivory entity  -type process -name process -schedule -url http://ivory-server:ivory-port

f. Once scheduled entities can be suspended, resumed or deleted (post submit)
   $IVORY_HOME/bin/ivory entity  -type [cluster|feed|process] -name <<name>> [-suspend|-delete|-resume] -url http://ivory-server:ivory-port

g. Once scheduled process instances can be managed through irovy CLI
   $IVORY_HOME/bin/ivory instance -processName <<name>> [-kill|-suspend|-resume|-re-run] -start "yyyy-MM-dd'T'HH:mm'Z'" -url http://ivory-server:ivory-port

Example configurations

Cluster:

<?xml version="1.0"?>
<!--
    Production cluster configuration
  -->

<cluster colo="ua2" description="" name="staging-red" xmlns="uri:ivory:cluster:0.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

    <interfaces>
        <interface type="readonly" endpoint="hftp://gsgw1001.red.ua2.inmobi.com:50070"
                   version="0.20.2-cdh3u0" />

        <interface type="write" endpoint="hdfs://gsgw1001.red.ua2.inmobi.com:54310"
                   version="0.20.2-cdh3u0" />

        <interface type="execute" endpoint="gsgw1001.red.ua2.inmobi.com:54311" version="0.20.2-cdh3u0" />

        <interface type="workflow" endpoint="http://gs1134.blue.ua2.inmobi.com:11000/oozie/"
                   version="3.1.4" />

        <interface type="messaging" endpoint="tcp://gs1134.blue.ua2.inmobi.com:61616?daemon=true"
                   version="5.1.6" />
    </interfaces>

    <locations>
        <location name="staging" path="/projects/ivory/staging" />
        <location name="temp" path="/tmp" />
        <location name="working" path="/projects/ivory/working" />
    </locations>

    <properties/>

</cluster>

Feed:

<?xml version="1.0" encoding="UTF-8"?>
<!--
    Hourly ad carrier summary. Generated by hourly processing of rr logs
  -->

<feed description="RRHourlyAdCarrierSummary" name="RRHourlyAdCarrierSummary" xmlns="uri:ivory:feed:0.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <partitions/>

    <groups>rmchourly</groups>

    <frequency>hours</frequency>
    <periodicity>1</periodicity>

    <late-arrival cut-off="hours(6)" />

    <clusters>
        <cluster name="staging-red" type="source">
            <validity start="2009-01-01T00:00Z" end="2099-12-31T00:00Z" timezone="UTC" />
            <retention limit="months(24)" action="delete" />
        </cluster>
    </clusters>

    <locations>
        <location type="data" path="/projects/bi/rmc/rr/${YEAR}-${MONTH}-${DAY}-${HOUR}.concat/HourlyAdCarrierSummary" />
        <location type="stats" path="/none" />
        <location type="meta" path="/none" />
    </locations>

    <ACL owner="rmcuser" group="users" permission="0755" />

    <schema location="/none" provider="none" />

    <properties/>

</feed>


Process:

<?xml version="1.0" encoding="UTF-8"?>
<!--
    RMC Daily process, produces 34 new feeds
 -->
<process name="rmc-daily">
	<cluster name="staging-red" />

	<frequency>days(1)</frequency>

	<validity start="2012-04-03T06:00Z" end="2022-12-30T00:00Z" timezone="UTC" />

	<inputs>
		<input name="WapAd" feed="WapAd" start="today(0,0)" end="today(0,0)" />
	</inputs>

	<outputs>
	        <output name="TrafficDailyAdSiteSummary" feed="TrafficDailyAdSiteSummary" instance="yesterday(0,0)" />
	</outputs>

	<properties>
		<property name="lastday" value="${formatTime(yesterday(0,0), 'yyyy-MM-dd')}" />
	</properties>

	<workflow engine="oozie" path="/projects/bi/rmc/pipelines/workflow/rmcdaily" />

	<retry policy="backoff" delay="5" delayUnit="minutes" attempts="3" />
</process>
Something went wrong with that request. Please try again.