I Like Tez, DevOps Edition (WIP)

Gopal V edited this page Mar 17, 2014 · 1 revision

I work on Tez, so it would be hard to not like Tez. There's a reason for it too, whenever Tez does something I don't like, I can put my back into it and shove Tez towards that straight & narrow path.

Just before Hortonworks, I was part of the ZCloud division in Zynga - the casual disregard for devs towards operations has hurt my sleep cycle and general peace of mind. I know they're chasing features, but whenever someone puts in a change that takes actual work to rollback, I cringe. And I like how Tez doesn't make the same mistakes here.

First of all, you don't install "Tez" on a cluster. The cluster runs YARN, which means two very important things.

There is no "installing Tez" on your 350 nodes and waiting for it to start up. You throw a few jars into an HDFS directory and write tez-site.xml on exactly one machine pointing to that HDFS path.

This means several important things for a professional deployment of the platform. There's no real pains about rolling upgrades, because there is nothing to restart - all existing queries use the old version, all new queries will automatically use the new version. This is particularly relevant for a 24 hour round-the-clock data insertion pipeline, but perhaps not for a BI centric service where you can bounce it pretty quickly after emailing a few people.

Letting you run different versions of Tez at the same time is very different from how MR used to behave. Personally on a day to day basis, this helps me a lot to share a multi-tenant dev environment & the overall quality of my work - I test everything I write on a big cluster, without worrying about whether I'll nuke anyone else's Tez builds.

Next up, I like how Tez handles failure. You can lose connectivity to half your cluster and the tasks will keep running, perhaps a bit slowly. YARN takes care of bad nodes, cases where the nodes are having disk failures or any such hiccup in the cluster that is normal when you're maxing out 400+ nodes all day long. And coming from the MR school of thought, the task failiure scenario is pretty much easily covered with re-execution mechanisms.

There's something important to be covered here with failure. For any task attempt that accidentally kills a container (like a bad UDF with a memory leak) there is no real data loss for any previous data, because the data already committed in a task is not served out of a container at all. The NodeManager serves all the data across the cluster with its own secure shuffle handlers. As long as the NodeManager is running, you could kill the existing containers on that node and hand off that capacity to another task.

This is very important for busy clusters, because as the aphorism goes "The difference between time and space is that you can re-use space". I guess the same applies to a container holding onto an in-memory structure, waiting for its data to be pulled off to another task.

And any hadoop-2 installation already has node manager alerts/restarts already coded in without needing any new devops work to bring back errant nodes back online.

This brings me to the next bit of error tolerance in the system - the ApplicationMaster. The old problem with hadoop-1.x was that the JobTracker was a somewhat single point of failure for any job. With YARN, that went away entirely with the ApplicationMaster being coded particularly for a task type.

Now most applications do not want to write up all the bits and bobs required to run their own ApplicationMaster. Something like Hive could've built its own ApplicationMaster (rather we could've built it as part of our perf effort) - after all Storm did, HBase did and so did Giraph.

The vision of Tez is that there's a possibe generalization for the problem. Just like MR was a simple distribution mechanism for a bi-partite graph which spawned a huge variety of tools, there exists a way to express more complex graphs in a generic way, building a new assembly language for data driven applications.

Make no misake, Tez is an assembly language at its very core. It is raw and expressive but is an expert's tool, meant to be wielded by compiler developers catering to a tool userland. Pig and Hive already have compilers into this new backend. Cascading and then Scalding will add some API fun to the mix, but the framework sits below all those and consolidates everyone's efforts into a common rich baseline for performance. And there's a secret hidden away MapReduce compiler for Tez as well, which get ignored often.

A generalization is fine, but it is often a limitation as well - nearly every tool listed above want to write small parts of the scheduling mechanisms which allows for custom data routing and connecting up task outputs to task inputs manually (like a bucketed map-join). Tez is meant to be a good generalization to build each application's custom components on top of, but without actually writing any of the complex YARN code required to have error tolerance, rack/host locality and recovery from AM crashes. The VertexManager plugin API is one classic example of how an application can now interfere with how a DAG is scheduled and how its individual tasks are managed.

And last of all, I like how Tez is not self centered, it works towards global utilization ratio on a cluster, not just its own latency figures. It can be built to elastically respond to queue/cluster pressures from other tasks running on the cluster.

People are doing Tez a disfavour by comparing it to framworks which rely on keeping slaves running to not just execute CPU tasks but to hold onto temporary storage as well. On a production cluster, getting 4 fewer containers than you asked for will not stall Tez, because of the way it uses the Shuffle mechanism as a temporary data store between DAG vertexes - it is designed to be all-or-best-effort, instead of waiting for the perfect moment to run your entire query. A single stalling reducer doesn't require any of the other JVMs to stay resident and wait. This isn't a problem for a daemon based multi-tenant cluster, because if there is another job for that cluster it will execute, but for a hadoop ecosystem cluster system built on YARN, this means that your cluster utilization takes a nose-dive due to the inability to acquire or release cluster resources incrementally/elastically during your actual data operation.

Between the frameworks I've played with, that is the real differentiating feature of Tez - Tez does not require containers to be kept running to do anything, just the AM running in the idle periods between different queries. You can hold onto containers, but it is an optimization, not a requirement during idle periods for the session.

I might not exactly be a fan of the user-friendliness of this assembly language layer for hadoop, but the flexibility of this more than compensates.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.