In [20]:
import org.apache.spark.sql.types._

In [2]:
val df = (spark.read
  .option("header", true)
  .option("inferSchema", true)
  .json("data/people.json"))
  
df.groupBy("age").count()

[age: bigint, count: bigint]

In [None]:
- Introduction to Spark and Datasets (30 mintues)?
    - (resilience, in memory, easy word count, lazy, etc ...)
- Introduction to Spark Structured Streaming
    - reduce, joining, etc ...
- What's New?
    - What's going on under the hood Tungsten / Catalyst
    - Comparison to non-structured streaming
- Building a Stremaing Application
    - Meetup Example

## Benefits of Tungsten

While Spark has traditionally been focused on optimizing for inter-computer network IO Efficiency.  This was the major bottle neck in mapreduce and other traditional distributed computing systems.  At this point, Spark is now intra-computer memory and CPU bound.

1. **Customized Memory Management:** Spark understands its own memory allocation needs better than the generic JVM garbage collector.  Tungsten takes advantage of this by using `sun.misc.Unsafe`, which exposes C-Style off-heap memory access.  The result is fewer unecessary garbage colleciton events and improved performance.
1. **Binary Encoding:** Spark traditionally needed to serialized data to JVM objects, it now uses the Tungsten binary encoding.  This has two major advantages.  First, it decreases the memory footprint.  While "ABCD" would take 4 bytes of UTF-8 to encode, it would be stored using 48 bytes as a JVM object.  Second, instead of serializing into objects, it's able to perform many actions directly on the the raw binary encoding, reducing the computational overhead for serialization / deserialization.
1. **Code Generation:** Spark historically used generic JVM function evaluation.  Given the virtual function lookups, automated boxing of primitive types, and other JVM overhead, this dramatically slows down the computation.  By using type information, Tungsten is able to generate byte code and speed up performance.
1. **Cache-aware Computation:** Tungsten lays out its memory in a way that takes advantage of CPU cache locality to reduce cache spilling and speed up computations.

For more information, check out [this blog post](https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html) or [this presentation](http://www.slideshare.net/SparkSummit/deep-dive-into-project-tungsten-josh-rosen).

In [3]:
val df = (spark.read
  .option("header", true)
  .option("inferSchema", true)
  .json("data/people.json"))

# Reading data from Implied Schema

In [26]:
val explicitSchema = StructType(
  StructField("name", StringType, false) ::
  StructField("city", StringType, true) ::
  StructField("country", StringType, true) ::
  StructField("age", IntegerType, true) :: Nil
)

val inExplicit = (spark.readStream
  .schema(explicitSchema)
  .format("csv")
  .option("header", true)
  .option("maxFilesPerTrigger", 1)
  .load("data/people.csv"))

In [27]:
val inImplied = (spark.read
  .format("csv")
  .option("header", true)
  .option("inferSchema", true)
  .load("data/people.csv"))
  
val impliedSchema = inImplied.schema

println(impliedSchema)
println(explicitSchema)

StructType(StructField(Name,StringType,true), StructField( City,StringType,true), StructField(Country,StringType,true), StructField(Age,IntegerType,true))
StructType(StructField(name,StringType,false), StructField(city,StringType,true), StructField(country,StringType,true), StructField(age,IntegerType,true))


# Reading data using Spark Streaming

In [28]:
import org.apache.spark._
import org.apache.spark.streaming._

val ssc = new StreamingContext(sc, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9091)
val words = lines.flatMap(_.split("\\W+"))
val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)

wordCounts.print()

ssc.start()
Thread.sleep(20000)
ssc.stop()