vanzin · vanzin · Sep 23, 2015 · pwendell · Sep 23, 2015 · mbautin
diff --git a/docs/rfc-no-assemblies.md b/docs/rfc-no-assemblies.md
@@ -0,0 +1,154 @@
+# Replacing the Spark Assembly with good old jars
+
+Spark, since the 1.0 release (at least), uses assemblies (or “fat jars”) as the approach to deliver
+the shared Spark code to users. In this document I’ll discuss a few problems this approach causes,
+and how avoiding the use of the assemblies makes development and deployment of Spark easier for all.
+
+What does it solve? And at what cost?
+
+The first question to ask is: what problems is the assembly solving?
+
+The assembly provides a convenient package including all the classes needed to run a Spark
+application. Theoretically, you can easily move it from one place to the other; easily cache it
+somewhere like HDFS for reuse; and easily include it as a library in 3rd-party applications.
+
+But that’s a very shallow look at things, and ignores a lot of problems caused by the assembly.
+
+Spark has suffered in the past from problems caused by such a large archive with so many files
+(pyspark incompatibilities with large archives created by newer JDKs). That has been solved by
+moving to JDK 7, though.
+
+The assembly makes dependencies very opaque. When a user adds the Spark assembly to an application,
+what exactly is he pulling in? Is he inadvertently overriding classes needed by his application with
+ones included in the Spark assembly?
+
+The assembly makes development slower. Many tests need currently need an updated assembly to run
+correctly (although SPARK-9284 aims to solve that). Updating a remote cluster is slower than
+necessary - even rsyncing such a large archive is not terribly efficient. It slows down the build,
+because repackaging the assemblies when files change is not exactly fast. Even deciding that there
+is no need to rebuild the assembly takes time.
+
+And it also does not solve the one problem it was meant to solve: it does not include all
+dependencies needed by Spark, because the Datanucleus libraries do not work when included in the
+assembly.
+
+From the point of view of someone trying to embed Spark into their application, things become
+trickier still. The assembly is not a published artifact, so what should the user pick up instead?
+The recommendation has been to use “provided” dependencies and somehow ship the appropriate Spark
+assembly with the user application. But that runs into all the issues above (dependency conflicts et
+al), aside from being a very unnatural way to use dependencies when compared to other maven-based
+projects.
+
+Finally, as someone whose work involves packaging Spark as part of a larger distribution, the
+assembly creates yet more problems. Because all dependencies are included in one big fat jar, it’s
+harder to share libraries that are shipped as part of the distribution. This means packages are
+unnecessarily bloated because of the code duplication, and patching becomes harder since now you
+have to patch multiple components that ship that code.
+
+Hacks were added to the Spark build to filter out such dependencies (all the *-provided profiles),
+but those are brittle, require constant maintenance and policing, and require non-trivial work to
+make sure Spark has all needed libraries at runtime. If you happen to miss a shared dependency, and
+you are unlucky enough to have to patch it later on, you just made your work more complicated
+because now there are two things to patch.
+
+## How to replace it?
+
+Ignoring potential backwards compatibility issues due to code that expects the current layout of a
+Spark distribution, getting rid of the Spark assembly should be rather easy.
+
+With a couple of exceptions that I’ll cover below, there is no code in Spark that actually depends
+on the assembly. Whether the code comes from one or two hundred jars, everything just works. So from
+the packaging side, all that is needed is, instead of having a single jar file, use maven’s built-in
+functionality (and I assume sbt would have something similar) to create a directory with the Spark
+jars and all needed dependencies.
+
+The two parts of the code base that depend on the assembly are:
+
+* The launcher library; fixing it to include all jars in a directory instead of the assembly is
+  trivial.
+* YARN integration.
+
+The YARN backend assumes, by default, that there’s nothing Spark-related installed in the cluster.
+So when you submit an app to YARN, it will try to upload the jar containing the Spark classes
+(normally the assembly) to the cluster. There are config options that can be used to tell the YARN
+backend where to find the assembly (e.g. somewhere in HDFS or on the local filesystem of cluster
+nodes), but those configs assume that Spark is a single file. This is already an issue today when
+trying to run a Spark application that needs Datanucleus jars in cluster mode.
+
+Fixing this is not hard, it just requires a little more code. The YARN backend should be able to
+handle directories / globs as well as the current “single jar” approach to uploading (or
+referencing) dependencies.
+
+Spark has more than one assembly, though, so we need to look at how the other assemblies are used
+too.
+
+The examples assembly can receive a similar treatment. The run-examples script might need some
+tweaking to include the extra jars in the Spark command to run. And running the examples by using
+spark-submit directly might become a little bit more complicated - although even that is fixable, in
+certain cases. The dependencies can be added to the example jar’s manifest, and spark-submit could
+read the manifest and automatically include the jars in the running application.
+
+The streaming backend assemblies could potentially just be removed. With the ivy integration in
+Spark, the original artifacts can be used instead, and dependencies will be automatically handled.
+For those using maven to build their streaming applications, including the dependencies is also
+easy. To help with tidiness, the streaming backends should declare Spark dependencies such as
+spark-core and spark-streaming as provided. There might be some tweaking needed to get the pyspark
+streaming tests to work, since they currently depend on the backend assemblies being built. One last
+thing that needs to be covered are python unit tests; they use the assemblies to avoid having to
+deal with maven / sbt to build their classpath. This could also be easily supported by having the
+dependencies be copied to a known directory under the backend’s build directory - not much different
+from how things work today.
+
+That leaves the YARN shuffle service. This is the only module where I see an assembly really adding
+some benefit - deploying / updating the shuffle service on YARN is just a matter of copying /
+replacing a single file (aside from configuration).
+
+
+## Summary of benefits
+
+Removing the assembly brings forward the following benefits:
+
+* Builds are faster
+* Build code is simplified
+* Spark behaves more like usual maven-based applications w.r.t. building, packaging and deployment
+* Possibility of minor code cleanups in other parts of the code base (launch scripts, code that
+  starts Spark processes)
+* More flexibility when embedding Spark into other applications
+
+The cons of such a move are:
+
+* Backwards compatibility, in case someone really depends on the assembly being there. We can have a
+  dummy jar, but that only solves a trivial part of the compatibility problem.
+* Running examples via spark-submit directly might become a little more complicated.
+* Slightly more complicated code in the YARN backend, to deal with uploading all dependencies
+  (instead of just a single file).
+
+
+## What about the “provided” profiles?
+
+This change would allow most of the “provided” profiles to become obsolete. The only profile that
+should be kept is “hadoop-provided”, since it allows users to easily deploy a Spark package on top
+of any Hadoop distribution.
+
+The other profiles mostly cover avoiding repackaging dependencies for examples (which is not
+crucial) and streaming backends (which would be handled by the suggestions made in the discussion
+above).
+
+## Links to Assembly-related issues
+
+* https://issues.apache.org/jira/browse/OOZIE-2277
+
+Oozie needs to do a lot of gymnastics to get Spark deployed because of the way it needs to run apps.
+Since the Spark assembly is not a maven artifact, it’s unrealistic for Oozie to use it.
+
+* https://issues.apache.org/jira/browse/PIG-4667
+
+Similar to Oozie. When not using an assembly, the YARN backend does the wrong thing (since it will
+just upload the spark-yarn jar). The result is to either depend on the non-existent assembly
+artifact, or do what Oozie does.
+
+* https://issues.apache.org/jira/browse/HIVE-7292
+
+Link is to the umbrella tracking the project; but Hive-on-Spark solves the same problem in yet
+another different way. It would be much simpler if Hive could just depend on Spark directly instead
+of somehow having to embed a Spark installation in it.