# 01: Introduction - A First Look at Spark

Our first exercise demonstrates the useful *Spark Shell*, which is a customized version of Scala's REPL (read, eval, print, loop). It allows us to work interactively with our algorithms and data.


## Local Mode Execution

This notebook exploits the fact that we can interactively run Spark expressions, using Scala, Python, or R.

Corresponding _script_ file in the GitHub project is
[Intro1-script.scala](https://github.com/deanwampler/spark-scala-tutorial/blob/master/src/main/scala/sparktutorial/Intro1-script.scala). It can be run using `spark-shell` in the Spark 2.2.0 distribution:

```shell
$SPARK_HOME/bin/spark-shell src/main/scala/sparktutorial/Intro1-script.scala
```

This assumes you're in the directory at the root of the GitHub project. Or, you can start the shell and load this script:

```shell
$SPARK_HOME/bin/spark-shell
...
scala> :load src/main/scala/sparktutorial/Intro1-script.scala
```

Here `scala>` is the Scala "REPL" (interpreter) prompt. See the project README.markdown and Tutorial.markdown for more details, including even more options.

Let's check that several important objects are defined:
* `spark` - the entry point object: `org.apache.spark.sql.SparkSession`
* `sc` - the pre-2.0 Spark entry point of type `org.apache.spark.SparkContext`, which is created by the `SparkSession`

In [1]:
spark

That link is supposed to open the Spark console, but it doesn't route properly when running in the Docker image. However, the `run.sh` and `run.bat` commands tunnel port 4040, so all you have to do is open <a href="http://localhost:4040" target="spark_console">http://localhost:4040</a>. 

Note that if you have several notebooks running, they will compete for port 4040, so subsequent notebook "kernels" will use ports 4041, 4042, etc.

In [2]:
spark.getClass

class org.apache.spark.sql.SparkSession

In [3]:
sc.getClass

class org.apache.spark.SparkContext

In Spark compiled jobs, like most of the examples in the GitHub project `src` tree, you need to start will logic like the following:

```scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext

val spark = SparkSession.builder.        // builder pattern to construct the session
  master("local[*]").                    // run locally on all cores ([*])
  appName("Console").
  config("spark.app.id", "Console").     // to silence Metrics warning
  getOrCreate()                          // create it!

val sc = spark.sparkContext              // get the SparkContext
val sqlContext = spark.sqlContext        // get the old SQLContext, if needed
import sqlContext.implicits._            // import useful stuff
import org.apache.spark.sql.functions._  // here, import min, max, etc.
```

The [SparkSession](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SparkSession) drives everything else. Before Spark 2.0, the [SparkContext](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext) was the entry point. You can still extract it from the `SparkSession` as shown near the end, for convenience.

The `SparkSession.builder` uses the common _builder pattern_ to construct the session object. Note that value passed to `master`. The setting `local[*]` says run locally on this machine, but take all available CPU cores. Replace `*` with a number to limit this to fewer cores. Just using `local` limits execution to 1 core. There are different arguments you would use for Hadoop, Mesos, Kubernetes, etc.

Next we define a read-only variable `input` of type [RDD](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD) by loading the text of the King James Version of the Bible, which has each verse on a line, we then map over the lines converting the text to lower case:

In [4]:
val input = sc.textFile("../data/kjvdat.txt").  // load a text file, each line is a "record" of one field, a string
    map(line => line.toLowerCase)            // map each line to lowercase

input = MapPartitionsRDD[2] at map at <console>:28


MapPartitionsRDD[2] at map at <console>:28

More accurately, the type of `input` is `RDD[String]` (or `RDD<String>` in Java syntax). Think of it as a "collection of strings".

> The `data` directory has a `README` that discusses the files present and where they came from.

Next, we cache the data in memory for faster, repeated retrieval. You shouldn't always do this, as it's wasteful for data that's simply passed through, but when your workflow will repeatedly reread the data, caching provides performance improvements.

In [5]:
input.cache

MapPartitionsRDD[2] at map at <console>:28

Next, we filter the input for just those verses that mention "sin" (recall that the text is now lower case). Then count how many were found.

In [6]:
val sins = input.filter(line => line.contains("sin"))
val count = sins.count()         // How many sins?

sins = MapPartitionsRDD[3] at filter at <console>:29
count = 1292


1292

Next, convert the RDD to a Scala collection in the memory for the _driver_ process JVM. That's the JVM behind this notebook interpreting these Spark expressions. If we weren't running in _local_ mode, the driver would delegate some processing to remote JVMs.

We'll also loop through the first five lines of the array, printing each one, then we'll do this again with the RDD itself.

In [7]:
val array = sins.collect()       // Convert the RDD into a collection (array)
array.take(5).foreach(println)  // Take the first 5, loop through them, and print them 1 per line.

gen|4|7| if thou doest well, shalt thou not be accepted? and if thou doest not well, sin lieth at the door. and unto thee shall be his desire, and thou shalt rule over him.~
gen|10|17| and the hivite, and the arkite, and the sinite,~
gen|12|2| and i will make of thee a great nation, and i will bless thee, and make thy name great; and thou shalt be a blessing:~
gen|13|13| but the men of sodom were wicked and sinners before the lord exceedingly.~
gen|18|20| and the lord said, because the cry of sodom and gomorrah is great, and because their sin is very grievous;~


array = Array(gen|4|7| if thou doest well, shalt thou not be accepted? and if thou doest not well, sin lieth at the door. and unto thee shall be his desire, and thou shalt rule over him.~, gen|10|17| and the hivite, and the arkite, and the sinite,~, gen|12|2| and i will make of thee a great nation, and i will bless thee, and make thy name great; and thou shalt be a blessing:~, gen|13|13| but the men of sodom were wicked and sinners before the lord exceedingly.~, gen|18|20| and the lord said, because the cry of sodom and gomorrah is great, and because their sin is very grievous;~, gen|20|6| and god said unto him in a dream, yea, i know that thou didst this in the integrity of thy heart; for i also withheld thee from sinning against me: therefore suffered i thee not to touc...




In [8]:
sins.take(5).foreach(println)   // ... but we don't have to "collect" first;
                                 // we can just use foreach on the RDD.

gen|4|7| if thou doest well, shalt thou not be accepted? and if thou doest not well, sin lieth at the door. and unto thee shall be his desire, and thou shalt rule over him.~
gen|10|17| and the hivite, and the arkite, and the sinite,~
gen|12|2| and i will make of thee a great nation, and i will bless thee, and make thy name great; and thou shalt be a blessing:~
gen|13|13| but the men of sodom were wicked and sinners before the lord exceedingly.~
gen|18|20| and the lord said, because the cry of sodom and gomorrah is great, and because their sin is very grievous;~


Moving to a new topic, you can define *functions* as *values*. Here we create a separate filter function that we pass as an argument to the filter method. Previously we used an *anonymous function*. Note that `filterFunc` is a value that's a function of type `String` to `Boolean`.

In [9]:
val filterFunc: String => Boolean =
    (s:String) => s.contains("god") || s.contains("christ")

filterFunc = > Boolean = <function1>


<function1>

The following more concise form is equivalent, due to *type inference* of the argument's type:

In [10]:
val filterFunc: String => Boolean =
    s => s.contains("god") || s.contains("christ")

filterFunc = > Boolean = <function1>


<function1>

Now use the filter to find all the `sin` verses that also mention God or Christ, then count them. Note that this time, we drop the "punctuation" in the first line (the comment shows what we dropped), and we drop the parentheses after "count". Parentheses can be omitted when methods take no arguments.

In [11]:
val sinsPlusGodOrChrist  = sins filter filterFunc // same as: sins.filter(filterFunc)
val countPlusGodOrChrist = sinsPlusGodOrChrist.count

sinsPlusGodOrChrist = MapPartitionsRDD[4] at filter at <console>:33
countPlusGodOrChrist = 240


240

Finally, let's do _Word Count_, where we load a corpus of documents, tokenize them into words and count the occurrences of all the words.

First, we'll define a helper method to look at the data. We need to import the RDD type:

In [12]:
import org.apache.spark.rdd.RDD

def peek(rdd: RDD[_], n: Int = 10): Unit = {
  println("RDD type signature: "+rdd+"\n")
  println("=====================")
  rdd.take(n).foreach(println)
  println("=====================")
}

peek: (rdd: org.apache.spark.rdd.RDD[_], n: Int)Unit


In the type signature `RDD[_]`, the `_` means "any type". In other words, we don't care what records this `RDD` is holding, because we're just going to call `toString` on each one. The second argument `n` is the number of records to print. It has a default value of `10`, which means if the caller doesn't provide this argument, we'll print `10` records.

The `peek` function prints the type of the `RDD` by calling `toString` on it (effectively). Then it takes the first `n` records, loops through them, and prints each one on a line.

Let's use `peek` to remind ourselves what the `input` value is. For this and the next few lines, I'll put in the `scala>` prompt, followed by the output:

In [13]:
input

MapPartitionsRDD[2] at map at <console>:28

In [14]:
peek(input)

RDD type signature: MapPartitionsRDD[2] at map at <console>:28

gen|1|1| in the beginning god created the heaven and the earth.~
gen|1|2| and the earth was without form, and void; and darkness was upon the face of the deep. and the spirit of god moved upon the face of the waters.~
gen|1|3| and god said, let there be light: and there was light.~
gen|1|4| and god saw the light, that it was good: and god divided the light from the darkness.~
gen|1|5| and god called the light day, and the darkness he called night. and the evening and the morning were the first day.~
gen|1|6| and god said, let there be a firmament in the midst of the waters, and let it divide the waters from the waters.~
gen|1|7| and god made the firmament, and divided the waters which were under the firmament from the waters which were above the firmament: and it was so.~
gen|1|8| and god called the firmament heaven. and the evening and the morning were the second day.~
gen|1|9| and god said, let the waters under the heave

Note that `input` is a subtype of `RDD` called `MapPartitionsRDD`. and the `RDD[String]` means the "records" are just strings. You might confirm for yourself that the lines shown by `peek(input)` match the input data file.

Now, let's split each line into words. We'll treat any run of characters that don't include alphanumeric characters as the "delimiter":

In [15]:
val words = input.flatMap(line => line.split("""[^\p{IsAlphabetic}]+"""))
peek(words)

RDD type signature: MapPartitionsRDD[5] at flatMap at <console>:32

gen
in
the
beginning
god
created
the
heaven
and
the


words = MapPartitionsRDD[5] at flatMap at <console>:32


MapPartitionsRDD[5] at flatMap at <console>:32

Does the output make sense to you? The type of the `RDD` hasn't changed, but the records are now individual words.

Now let's use our friend from SQL, `GROUPBY`, where we use the words as the "keys":

In [16]:
val wordGroups = words.groupBy(word => word)
peek(wordGroups)

RDD type signature: ShuffledRDD[7] at groupBy at <console>:34

(winefat,CompactBuffer(winefat, winefat))
(honeycomb,CompactBuffer(honeycomb, honeycomb, honeycomb, honeycomb, honeycomb, honeycomb, honeycomb, honeycomb, honeycomb))
(bone,CompactBuffer(bone, bone, bone, bone, bone, bone, bone, bone, bone, bone, bone, bone, bone, bone, bone, bone, bone, bone, bone))
(glorifying,CompactBuffer(glorifying, glorifying, glorifying))
(nobleman,CompactBuffer(nobleman, nobleman, nobleman))
(hodaviah,CompactBuffer(hodaviah, hodaviah, hodaviah))
(raphu,CompactBuffer(raphu))
(hem,CompactBuffer(hem, hem, hem, hem, hem, hem, hem))
(onyx,CompactBuffer(onyx, onyx, onyx, onyx, onyx, onyx, onyx, onyx, onyx, onyx, onyx))
(pigeon,CompactBuffer(pigeon, pigeon))


wordGroups = ShuffledRDD[7] at groupBy at <console>:34


ShuffledRDD[7] at groupBy at <console>:34

Note that the records are now two-element `Tuples`: `(String, Iterable[String])`, where `Iterable` is a Scala abstraction for an underlying, sequential collection. We see that these iterables are actually `CompactBuffers`, a Spark collection that wraps an array of objects. Note that these buffers just hold repeated occurrences of the corresponding keys. This is wasteful, especially at scale! We'll learn a better way to do this calculation shortly.

Finally, let's compute the size of each `CompactBuffer`, which completes the calculation of how many occurrences are there for each word:

In [17]:
val wordCounts1 = wordGroups.map( word_group => (word_group._1, word_group._2.size))
peek(wordCounts1)

RDD type signature: MapPartitionsRDD[8] at map at <console>:36

(winefat,2)
(honeycomb,9)
(bone,19)
(glorifying,3)
(nobleman,3)
(hodaviah,3)
(raphu,1)
(hem,7)
(onyx,11)
(pigeon,2)


wordCounts1 = MapPartitionsRDD[8] at map at <console>:36


MapPartitionsRDD[8] at map at <console>:36

Note that the function passed to `map` expects a single two-element `Tuple` argument. We extract the two elements using the `_1` and `_2` methods. (Tuples index from 1, rather than 0, following historical convention.) The type of `wordCounts1` is `RDD[(String,Int)]`.

There is a more concise syntax we can use for the method, which exploits _pattern matching_ to break up the tuple into its constituents, which are then assigned to the value names:

In [18]:
val wordCounts2 = wordGroups.map{ case (word, group) => (word, group.size) }
peek(wordCounts2)

RDD type signature: MapPartitionsRDD[9] at map at <console>:36

(winefat,2)
(honeycomb,9)
(bone,19)
(glorifying,3)
(nobleman,3)
(hodaviah,3)
(raphu,1)
(hem,7)
(onyx,11)
(pigeon,2)


wordCounts2 = MapPartitionsRDD[9] at map at <console>:36


MapPartitionsRDD[9] at map at <console>:36

The results are exactly the same.

But there is actually an even easier way. Note that we aren't modifying the
_keys_ (the words), so we can use a convenience function `mapValues`, where only
the value part (second tuple element) is passed to the anonymous function and
the keys are retained:

In [19]:
val wordCounts3 = wordGroups.mapValues(group => group.size)
peek(wordCounts3)

RDD type signature: MapPartitionsRDD[10] at mapValues at <console>:36

(winefat,2)
(honeycomb,9)
(bone,19)
(glorifying,3)
(nobleman,3)
(hodaviah,3)
(raphu,1)
(hem,7)
(onyx,11)
(pigeon,2)


wordCounts3 = MapPartitionsRDD[10] at mapValues at <console>:36


MapPartitionsRDD[10] at mapValues at <console>:36

Finally, let's save the results to the file system:

In [21]:
wordCounts3.saveAsTextFile("output/kjv-wc-groupby")  // "output" is a subdirectory of the "notebooks" directory

lastException: Throwable = null


If you look in the directory `output/kjv-wc-groupby` (use the Jupyter file browser), you'll see three files:

* `_SUCCESS`
* `part-00000`
* `part-00001`

The `_SUCCESS` file is empty. It's a marker used by the Hadoop File I/O libraries (which Spark uses) to signal to waiting processes that the file output has completed. The other two files each hold a _partition_ of the data. In this case, we had two partitions.

We're done, but let's finish by noting that a non-script program should shutdown gracefully by calling `spark.stop()`. However, we don't need to do so here, because both our configured `console` environment for local execution and `spark-shell` do this for us:

```scala
// sc.stop()
```

If you exit the REPL immediately, this will happen implicitly. Still, it's a good practice to always call `stop`. Don't do this now; we want to keep the session alive...

### The Spark Web Console

When you have a `SparkContext` running, it provides a web UI with very useful information about how your job is mapped to JVM tasks, metrics about execution, etc.

Before we finish this exercise, visit the Spark driver UI at <a href="http://localhost:4040" target="spark-driver">http://localhost:4040</a>. You will find this console very useful for learning Spark internals and when debugging problems, such as performance bottlenecks. 

> **Tips:** 
> 1. If you can't get to this UI, it either means you aren't tunneling port 4040 out of the Docker container or no Spark drivers are currently running!
> 2. If you have two or more notebooks running, subsequent notebooks will use port 4041, 4042, etc. for the driver UIs. However, these ports aren't tunneled by default.

## Exercises

Solutions for some of the suggested exercises are provided in the [src/main/scala/sparktutorial/solns](https://github.com/deanwampler/spark-scala-tutorial/tree/master/src/main/scala/sparktutorial/solns) directory (athough not for this notebook's suggestions).

### Exercise 1: Try Different Filters

The filter function could match on a regular expression, for example. Note also the line format in the input text files for the bible `book|chapter#|verse#|verse_text`. It would be easy to filter on the book of the bible, etc.

### Exercise: Try different sacred texts in the "data" directory

You can also download other texts from http://www.sacred-texts.com/, or just use any other texts you have. Are character sets like UTF-16 handled well, e.g., the Hebrew text in `../data/t3utf.dat`?