In [6]:
val sparkDummy = spark
import sparkDummy.implicits._

# Comparison with DStream

1. DStream made it hard to deal with late data because the DStream was just discretized batches
1. Same API between batch Datasets and structured streaming
1. Better end-to-end guarantees to handle fault-tolerant data

# Benefits of Type Safety

# Benefits of Tungsten

While Spark has traditionally been focused on optimizing for inter-computer network IO Efficiency.  This was the major bottle neck in mapreduce and other traditional distributed computing systems.  At this point, Spark is now intra-computer memory and CPU bound.

1. **Customized Memory Management:** Spark understands its own memory allocation needs better than the generic JVM garbage collector.  Tungsten takes advantage of this by using `sun.misc.Unsafe`, which exposes C-Style off-heap memory access.  The result is fewer unecessary garbage colleciton events and improved performance.
1. **Binary Encoding:** Spark traditionally needed to serialized data to JVM objects, it now uses the Tungsten binary encoding.  This has two major advantages.  First, it decreases the memory footprint.  While "ABCD" would take 4 bytes of UTF-8 to encode, it would be stored using 48 bytes as a JVM object.  Second, instead of serializing into objects, it's able to perform many actions directly on the the raw binary encoding, reducing the computational overhead for serialization / deserialization.
1. **Code Generation:** Spark historically used generic JVM function evaluation.  Given the virtual function lookups, automated boxing of primitive types, and other JVM overhead, this dramatically slows down the computation.  By using type information, Tungsten is able to generate byte code and speed up performance.
1. **Cache-aware Computation:** Tungsten lays out its memory in a way that takes advantage of CPU cache locality to reduce cache spilling and speed up computations.

For more information, check out [this blog post](https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html) or [this presentation](http://www.slideshare.net/SparkSummit/deep-dive-into-project-tungsten-josh-rosen).

# Tungsten In Action

The following two operations count a million integers in memory.
- The first uses RDDs and must serialize all the objects into Java Objects.
- The second use sTungsten and uses a more compact binary encoding.

Tungsten saves a factor of nearly 4 on memory:

![Tungsten Memory](images/Tungsten_Memory.png)

To reproduce the example, run the following code and goto your Spark UI Viewer.  You can launch a new shell using `make spark-shell` (recommended), this will be at [http://localhost:5050/storage/](http://localhost:5050/storage/).  The default Spark UI port is 4040.

In [4]:
val million = 0 until math.pow(10, 6).toInt

// Using RDDs
sc.parallelize(million).cache.count

// Using Tungsten
million.toDF.cache.count

1000000

# Benefits of Catalyst

https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

Optimizations include:
1. **Constant Folding:** Evaluating constant expressions at compile time rather than at run time.
1. **Predicate Pushdown:** Running operations that reduce the dataload (e.g. selecting columns or filtering rows) earlier in the query.  This reduces the amount of data that needs to be 
1. **Boolean Expression Simplification:** Simplifies boolean expressions
1. **Pipelining Operations:** Combines multiple projection and filter operations into a single map operation.
1. **Cost-based Optimization:** Spark actually builds multiple plans to compute a query, computes their cost, and chooses the cheapest one.
1. **Code Generation:** The final operational plan is transformed into optimized Java bytecode.

# Comparison with Streaming