leoneu edited this page Aug 3, 2011 · 8 revisions
Clone this wiki locally

The S4 Roadmap

Always in flux. Wanna help and don't know where to start? Contact leo at GitHub send message.

S4 v0.4

S4 v0.5 (New S4-piper API)

  • Volunteers: brucerobbins, matthieumorel, fpj, anishnair, leoneu

  • Initial design.

  • Make it work with comm layer.

  • Review comm layer API for the new design and modify if necessary but no need to reimplement in 0.4. (However, it would be nice to do so, make it truly pluggable, add protocols, etc.) Consider tighter integration of the comm layer with S4.

  • Make serialization pluggable. We should be able to support any format (Java objects, protobufs, etc.)

  • Port/test all example apps.

  • Review/modify client module.

  • Review checkpointing.

  • Review multithreading design. Clarify the contract between the framework and the app programmer.

  • The framework doesn't know the event types (except that it is an extension of Event). To map an event type to the corresponding process method efficiently, we need to do byte generation. We are already using this approach and need to adapt it to the new design.

  • One of the goals is to enable developers to create high level tools to create applications. We should make sure that the new design makes this possible. Potentially we should have a prototype to showcase this capability.

  • in 0.3, we guarantee that an event is only sent once between nodes even if it is sent with more than one key. In piper, this is a bit trickier, we need to think how this can be done to make sure the design works but there is no need to implement in this version.

  • Make sure [](dynamic balancing) will work with this design.

  • Start learning osgi so we can shape the design with dynamic loading in mind.

  • App packaging and deployment. osgi?

S4 Container for Hadoop

The goal is to run an S4 container inside Hadoop to be able to use the same App or PEs in batch and streaming mode. An experimental implementation by Jon Malkin is here. A typical use case consist in training models using large data sets (years or months of logged data). This is done efficiently in batch mode using Hadoop. In a machine learning system, we must use the exact same feature extractor for training models and to run off-line simulations and at run time. With the S4 container in Hadoop we can achieve this. A second goal is to be able to instantiate PEs using historical event data in Hadoop and deploy the PE objects to the S4 cluster to process future event data. For example, a trading system may analyze historical data over night and push updated PEs to S4 to start processing the new data when the market opens.

Jon's notes:

The basic system works fine, in the sense that I can run speech02 and get the right output. Finishing it off, however, has been somewhat contingent on a few other pieces.

My integration is with pig in particular -- Hadoop should be possible (and will initialize Spring fewer times) but then you need to do a lot more data management stuff. Every time data passes through a PE is a reduce stage, so direct Hadoop code will require a lot more developer effort. Streaming may be easier, but I think there will still be a lot of application-specific glue code.

My current system:

What I have is a Pig UDF that can be initialized with a PE type. It will then be able to process serialized incoming events and return a serialized PE (kind of -- see below), the key for that PE, and a bag of serialized output events.

Since events need to be serialized, I have an event creator as well as an event reader if you need to grab a field from the events for instance a timestamp to be able to keep things time-sorted. (You don't get ordering guarantees in the output of a previous stage.)

The main PE UDF is implemented as an accumulator so that results can be sent in in small batches. It seems like output events are reported out in one block. To capture the output events, I inject a fake emitter when I create the PE. That caches things, and will write them to a local disk if you get more than a certain amount. It'll then read them back in at the end when you need to return them all. Not being able to do that incrementally is my least favorite part.

Pig 0.9 apparently lets you define macros, which would help make the necessary pig script much shorter. It's long right now, but it's mostly a bunch of very repetitive code. I haven't had a chance to play with that, though.

Finally, I'm just using Kryo to serialize things.


There are a few things I don't have working, some of which are more important than others.

While I have a UDF to read Pig data and populate events, at least for simple event types (if you have pre-serialized data, that can be used directly), I haven't had a chance to sit down and figure out how to let you populate keys along with that. This is technically ok, but inefficient -- it requires a M/R pass just to create empty events which then need to be run through, say, a keyless PE.

Next, I'm not truly serializing the PEs themselves since I wasn't sure whether there'd be a better solution than just running it through Kryo. Obviously, something more configurable would be nice, but mostly it should match what we do for checkpointing. (I've been waiting to see what we're doing for pre-initialized PEs/recovering from checkpointing.)

And then, with that problem still not quite solved, I'm not reconstituting PEs yet. It shouldn't actually be all that hard once the basic facility exists in S4 itself. But until then, I can't support cycles.

Communication layer.

Rewrite communication layer. We need to summarize requirements here.