Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Answer how Stratosphere compares to Apache Spark #36

Open
rmetzger opened this issue May 2, 2014 · 0 comments
Open

Answer how Stratosphere compares to Apache Spark #36

rmetzger opened this issue May 2, 2014 · 0 comments

Comments

@rmetzger
Copy link
Member

rmetzger commented May 2, 2014

This message from our mailing list, posted by @fhueske might be a good skeleton:

Similar to Spark, Stratosphere is a complete data processing system, i.e., it has a programming API, a program compiler (optimizer), and an own execution runtime.
It is also an alternative for Hadoop MapReduce and in several design points quite similar to Spark:

  • Programs are executed as DAGs
  • Higher-level programming primitives (compared to Hadoop MR)
  • APIs in Scala and Java
  • Reads data from external data stores (has no own data storage), e.g., HDFS, S3, RDBMS, ...

However, Stratosphere is also different in some aspects:

  • Database-inspired processing using pipelining, gradually going to disk if memory is not sufficient (Hybridhash Joins, external sorts)
  • Sophisticated cost-based optimizer choosing execution strategies (broadcasting vs. partitioning, sort vs. hash joins, ...)
  • Implemented in Java (in contrast to Spark which uses Scala)
  • No intermediate result materialization in memory (this is on the roadmap)

Stratosphere and Spark can be rather seen as alternatives.
We do not build on any of Sparks components as we have our own programming API and execution engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant