Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Removing assemblies from Spark. #2

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions docs/rfc-no-assemblies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# Replacing the Spark Assembly with good old jars

Spark, since the 1.0 release (at least), uses assemblies (or “fat jars”) as the approach to deliver
the shared Spark code to users. In this document I’ll discuss a few problems this approach causes,
and how avoiding the use of the assemblies makes development and deployment of Spark easier for all.

What does it solve? And at what cost?

The first question to ask is: what problems is the assembly solving?

The assembly provides a convenient package including all the classes needed to run a Spark

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original motivation way, way back was that it's nicer when you run "ps -ef" - otherwise you get a huge output that is super long with the whole classpath and it's hard to see things like what arguments are passed to the executables. /cc @mateiz as this was a long time ago. At least bares mentioning here... not saying this justifies having it, though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "ps -ef" problem could be solved by using the CLASSPATH environment variable instead of the JVM argument. One can still use the jinfo command to see the classpath a Java process is running with.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you could either use the env variable or include libdir/* in the classpath, since the JVM handles that fine.

application. Theoretically, you can easily move it from one place to the other; easily cache it
somewhere like HDFS for reuse; and easily include it as a library in 3rd-party applications.

But that’s a very shallow look at things, and ignores a lot of problems caused by the assembly.

Spark has suffered in the past from problems caused by such a large archive with so many files
(pyspark incompatibilities with large archives created by newer JDKs). That has been solved by
moving to JDK 7, though.

The assembly makes dependencies very opaque. When a user adds the Spark assembly to an application,
what exactly is he pulling in? Is he inadvertently overriding classes needed by his application with
ones included in the Spark assembly?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, the user was never supposed to add Spark assemblies to their application. They link to Spark by adding it as a Maven dependency, and then they run their app using spark-submit, which uses the Spark build on their machine. The assembly was mostly meant to be an easy and efficient way to package all the classes on worker nodes, so that you don't have to send around a ton of JARs.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mateiz,

then they run their app using spark-submit

I understand that point, but the fact is that quite a lot of people run Spark embedded in their applications too, without ever using spark-submit.

re: distribution to worker nodes, that mostly applies to YARN, right? Maybe Mesos, I'm not familiar with how to use Spark on it. But on standalone, the Worker instance itself already needs all this code to run, and propagates that code to all executors it starts. Whether the code is a single jar or multiple, it doesn't really make a difference.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For embedded Spark, can't they just rely on the Maven dependencies? I don't see how removing assemblies would fix this, because you still need to add the code for your specific configuration of Spark on the classpath (i.e. the right Spark version, the right Hadoop version, Hive if you wanted that, etc).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can, mostly. Perhaps citing the embedded case is kind of a red herring. The biggest issue someone will have in the embedded case is with YARN, since it currently requires some non-trivial setup to get things right (see Pig and Oozie bugs I mentioned later); I think that's the only change listed in this document that really would help the embedded case.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I just want to make sure that we're not recommending the wrong thing to people. YARN is its own world, but if we make a change like this we should make sure the result works in all resource managers and is worth the change. I don't remember the details exactly, but we actually started with JARs and moved to an assembly later; it might've been to make it easier to push onto YARN and Mesos but I forget. It was not done to let users link to a giant JAR though.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely; I have actually played with a proof of concept for this in YARN and the changes required are actually not that big. I'd have to familiarize myself a bit with the Mesos side, but I don't expect a lot of issues there either.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great if we could simply depend on spark (as a maven dependency), and launch our app on a yarn cluster (without using any installed spark distro or even spark-submit). this would make things much easier.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Mesos I don't think it will cause too much problems, most of the time it's about able to build a Spark distribution tar that holds the spark binaries and dependencies to run the executor via Mesos. Otherwise we don't upload dependencies for users and act much more like Standalone mode.
I think as long as we can allow make-distrubition.sh --tgz to still work it won't be a problem.


The assembly makes development slower. Many tests need currently need an updated assembly to run
correctly (although SPARK-9284 aims to solve that). Updating a remote cluster is slower than
necessary - even rsyncing such a large archive is not terribly efficient. It slows down the build,
because repackaging the assemblies when files change is not exactly fast. Even deciding that there
is no need to rebuild the assembly takes time.

And it also does not solve the one problem it was meant to solve: it does not include all

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • aws-java-sdk-s3 & other things needed for current s3 access.

There's another issue, which is conflict with the bundled artifacts (example: SLF4J) and anything on the classpath, especially when deployed on Hadoop clusters: warnings of conflict appear at the top of all logs.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steveloughran I'm not sure this change by itself would solve the multiple slf4j binding issue, but it definitely would make it easier for people to fix it by customizing their Spark installation.

dependencies needed by Spark, because the Datanucleus libraries do not work when included in the
assembly.

From the point of view of someone trying to embed Spark into their application, things become
trickier still. The assembly is not a published artifact, so what should the user pick up instead?
The recommendation has been to use “provided” dependencies and somehow ship the appropriate Spark
assembly with the user application. But that runs into all the issues above (dependency conflicts et
al), aside from being a very unnatural way to use dependencies when compared to other maven-based
projects.

Finally, as someone whose work involves packaging Spark as part of a larger distribution, the
assembly creates yet more problems. Because all dependencies are included in one big fat jar, it’s
harder to share libraries that are shipped as part of the distribution. This means packages are
unnecessarily bloated because of the code duplication, and patching becomes harder since now you
have to patch multiple components that ship that code.

Hacks were added to the Spark build to filter out such dependencies (all the *-provided profiles),
but those are brittle, require constant maintenance and policing, and require non-trivial work to
make sure Spark has all needed libraries at runtime. If you happen to miss a shared dependency, and
you are unlucky enough to have to patch it later on, you just made your work more complicated
because now there are two things to patch.

## How to replace it?

Ignoring potential backwards compatibility issues due to code that expects the current layout of a
Spark distribution, getting rid of the Spark assembly should be rather easy.

With a couple of exceptions that I’ll cover below, there is no code in Spark that actually depends
on the assembly. Whether the code comes from one or two hundred jars, everything just works. So from
the packaging side, all that is needed is, instead of having a single jar file, use maven’s built-in
functionality (and I assume sbt would have something similar) to create a directory with the Spark
jars and all needed dependencies.

The two parts of the code base that depend on the assembly are:

* The launcher library; fixing it to include all jars in a directory instead of the assembly is

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for lib/*.jar. This addresses the problem today that you can't just drop in support for s3a:// or avs:// just by adding the JAR to the dir, but instead have to remember to explicitly add it on every run

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI for this you can also modify the spark-env.sh on each node to add it to the classpath.

trivial.
* YARN integration.

The YARN backend assumes, by default, that there’s nothing Spark-related installed in the cluster.
So when you submit an app to YARN, it will try to upload the jar containing the Spark classes
(normally the assembly) to the cluster. There are config options that can be used to tell the YARN
backend where to find the assembly (e.g. somewhere in HDFS or on the local filesystem of cluster
nodes), but those configs assume that Spark is a single file. This is already an issue today when
trying to run a Spark application that needs Datanucleus jars in cluster mode.

Fixing this is not hard, it just requires a little more code. The YARN backend should be able to
handle directories / globs as well as the current “single jar” approach to uploading (or

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest supporting it as a tarball if we go that way, more efficient then a bunch of jars in a directory. This could also allow you to include confs and other things in there.
tgz is currently how we are deploy mapreduce from hdfs.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting approach. Kinda similar to how we upload the configs to HDFS currently.

For those who do not have the files cached on HDFS, though, creating the tgz locally on every run would be pretty expensive. Perhaps supporting both ways would be better? (Support multiple jars, but also a cached tgz via a different config option.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I'm fine with that, was talking more about on the uploaded to hdfs side of things.

referencing) dependencies.

Spark has more than one assembly, though, so we need to look at how the other assemblies are used
too.

The examples assembly can receive a similar treatment. The run-examples script might need some
tweaking to include the extra jars in the Spark command to run. And running the examples by using
spark-submit directly might become a little bit more complicated - although even that is fixable, in
certain cases. The dependencies can be added to the example jar’s manifest, and spark-submit could
read the manifest and automatically include the jars in the running application.

The streaming backend assemblies could potentially just be removed. With the ivy integration in
Spark, the original artifacts can be used instead, and dependencies will be automatically handled.
For those using maven to build their streaming applications, including the dependencies is also
easy. To help with tidiness, the streaming backends should declare Spark dependencies such as
spark-core and spark-streaming as provided. There might be some tweaking needed to get the pyspark
streaming tests to work, since they currently depend on the backend assemblies being built. One last
thing that needs to be covered are python unit tests; they use the assemblies to avoid having to
deal with maven / sbt to build their classpath. This could also be easily supported by having the
dependencies be copied to a known directory under the backend’s build directory - not much different
from how things work today.

That leaves the YARN shuffle service. This is the only module where I see an assembly really adding

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could still generate a small assembly for this shuffle service. In fact I think it has it's own assembly separate from the larger one, since it intentionally has a tiny number of dependencies.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what I meant.

some benefit - deploying / updating the shuffle service on YARN is just a matter of copying /
replacing a single file (aside from configuration).


## Summary of benefits

Removing the assembly brings forward the following benefits:

* Builds are faster
* Build code is simplified
* Spark behaves more like usual maven-based applications w.r.t. building, packaging and deployment
* Possibility of minor code cleanups in other parts of the code base (launch scripts, code that
starts Spark processes)
* More flexibility when embedding Spark into other applications

The cons of such a move are:

* Backwards compatibility, in case someone really depends on the assembly being there. We can have a

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it's worth fully understanding all possible compatibility issues. There might be things we aren't thinking of.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BC issues

  1. in a downstream maven build you can't just switch version by changing spark.version=1.x ; you need to declare dependencies on different artifacts.
  2. Same for IDE dev.
  3. uber-JARs are an opportunity to shade. That's always dangerous, but it does get to reduce guava version conflict.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanzin my understanding was that we now shade everything in our core jar, so the uber jar doesn't introduce any new shading. Is that right?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pwendell that's right, we shade on every module, so the uber jar doesn't add any shading. The shaded guava classes are currently embedded in the "network-common" artifact.

@steveloughran re: (1), that's already the case today, since the assembly is not published to maven [1]. you could still just switch one variable with the Spark version for all artifacts.

[1] http://repo1.maven.org/maven2/org/apache/spark/spark-assembly_2.10/ (last one is 1.1.1)

dummy jar, but that only solves a trivial part of the compatibility problem.
* Running examples via spark-submit directly might become a little more complicated.
* Slightly more complicated code in the YARN backend, to deal with uploading all dependencies
(instead of just a single file).


## What about the “provided” profiles?

This change would allow most of the “provided” profiles to become obsolete. The only profile that
should be kept is “hadoop-provided”, since it allows users to easily deploy a Spark package on top
of any Hadoop distribution.

The other profiles mostly cover avoiding repackaging dependencies for examples (which is not
crucial) and streaming backends (which would be handled by the suggestions made in the discussion
above).

## Links to Assembly-related issues

* https://issues.apache.org/jira/browse/OOZIE-2277

Oozie needs to do a lot of gymnastics to get Spark deployed because of the way it needs to run apps.
Since the Spark assembly is not a maven artifact, it’s unrealistic for Oozie to use it.

* https://issues.apache.org/jira/browse/PIG-4667

Similar to Oozie. When not using an assembly, the YARN backend does the wrong thing (since it will
just upload the spark-yarn jar). The result is to either depend on the non-existent assembly
artifact, or do what Oozie does.

* https://issues.apache.org/jira/browse/HIVE-7292

Link is to the umbrella tracking the project; but Hive-on-Spark solves the same problem in yet
another different way. It would be much simpler if Hive could just depend on Spark directly instead
of somehow having to embed a Spark installation in it.