<style>
  h1, h2, h3, h4, h5, p, ul, li {
    color: #2C475C;
  }
  .output_html {
    color: skyblue;
  }
  hr { height: 2px; color: lightblue; }
</style>

# An overview of Scala and Spark

## Introduction

In this chapter we will cover a simple data processing case to give an overview of what Scala and Spark are bringing on the table as a language and data analytics framework.

We will take a sample of stock market end-of-day data, compute some simple transformations and make joins. This will allow to preview some of the valuable features of scala and hint on how spark is working.

## Prelude

Outside the Spark Notebook environment a list of imports is necessary to get every needed properly loaded.

In [ ]:
import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.sql._
import org.apache.spark.sql.SparkSession._
import org.apache.spark.SparkContext._

## Download some stock market data

We start with downloading some data to work with. We call the Yahoo Finance Historical data service, with a simple http query to get csv files containing the requested data, _i.e._ end of day quotes for a selected stock or other financial instrument.

We will download the data using the native `scala.io.Source` API, there are other more specific ways to access the internet, using for instance the `sys.process.ProcessBuilder` to do a system call to `curl` or `get` for instance.

Before that, we set a variable `dataDir` as our base data directory, here by appending `/data/scala4spark` to the `TMP` directory. We will discuss `val` variables and types in a few more instructions.

In [ ]:
val dataDir = sys.props("java.io.tmpdir") + "/data/scala4spark"

We make sure the folder exists.

In [ ]:
new java.io.File(dataDir).mkdirs()

In the following, we creating an helper function to download the online data as _CSV_ (Comma Separated Values) files.

In [ ]:
val base = "https://s3-eu-west-1.amazonaws.com/kensuio-training/data"
def mkUrl(symbol: String) = s"${base}/${symbol}-2017.csv"

In [ ]:
def dl(q:String) {
  val source = scala.io.Source.fromURL(mkUrl(q))
  val f = new java.io.FileWriter(new java.io.File(s"${dataDir}/$q.csv"), false)
  source.foreach(f.append(_))
  f.close
}

Quite a few things to comment on for these instructions. The strings are preceeded with `s`. `s` is a function applied on the string to interpolate its content. This means that the dollar sign and curly braces can be used to define scala expression to be evaluated within the String. 

For exemple `${qparam}` is replaced by the value of the `qparam` variable, same goes for the dataDir variable.

Then use a combination of Scala API and Java API to read the online data (`Source`) and write it into a local file (`FileWriter`)

We can now use the helper to download the data we'll use the remaining of the book.

In [ ]:
dl("GOLD")
dl("COP")

## Read data with Spark

Now that some data is available on the file system, we can start working on it. The first obvious thing to do is to check the content, by looking at a few lines in the file.
We start with opening the file with Spark:

In [ ]:
val oiltxt: RDD[String] = sparkContext.textFile(s"$dataDir/COP.csv")

Here, there again quite a few things to note. 

`oiltxt` is a variable declared as `val`. A `val` is an immutable variable, it must be initialized at creation and it is not possible to change its value. Using immutable variables is a very important characteristic of functional programming like in Scala. Using immutable variables removes the temptation of spreading side effects in scopes that are not well constrained and thus is much less error prone and easier to test.

The type of the `oiltxt` variable is `RDD[String]`. `RDD` stands for Resilient Distributed Dataset. It is a Spark abstraction over a collection of items, here of type `String` type. This abstraction allows to define computations on this collection of items, for execution in the distributed environment.

We explicitly set the variable type here, but the compiler doesn't need it to be set by the developer. The `textFile` function returns a `RDD[String]` type thus the following would be valid:

```scala

val oiltxt = sparkContext.textFile("OIL.csv")

```

`sparkContext` is a variable defined by Spark. It is the gateway to the distributed environment, it contains the configuration for connecting the cluster, and as you already understood, it is used to  access stored data in the distrubuted environment. It is in charge of building the DAG (Directed Acyclic Graph) defining the successions of commputations. It is the DAG that allows resiliency because a computation plan is kept to reconstruct data in case of failure. The SparkContext also schedules computations on the cluster.
 
`textFile` is a function to read some text data, every single line becomes one item of the collection represented by the `RDD`.
 
Execution of this line instantiate the `oiltxt` variable and we have confirmation of its type, `RDD[String]`.

At this point, we have an `RDD` which defines one option: readling lines from a file. But, this optionration is not executed, only its definition is created. If we take a look at the User Interface provided by Spark to check the computations it evaluates, then we'll notice that it actually did nothing. Actually, there is no computations scheduled yet. The Spark UI is generally available on the port `4040` of the host running the program. Using the Spark Notebook you can use the menu entry `View > Open Spark UI`.

> **Distributed (Partitioned) Data** (cf Chapter 07)
>
> The file are certainly used locally in this example.
>
> However nowadays, it is a good exercise while processing data to always make the assertion it is distributed. In fact, the distribution of the dataset is really easy to understand. The data is split into a number of files without intersection stored in different machines, hence their name `Partition`.
>
> This is why, Spark, by default will always split any data in at least two (see `spark.default.parallelism`) partitions based.

## Actions

In [ ]:
oiltxt.count

We downloaded one year of daily data, so 255 points make sense. In order to compute this, the data must be read thus the partitions need to be accessed in the cluster. Looking again at the Spark UI, we have this confirmation.

A single stage was executed, with two step: reading data and counting items. The number of tasks in this stage depend on the number of partitions that were processed, in local and by default it equals to the number of cores on which you are running.

Let's take a look at the content of the data file, for instance taking the first three lines and display their content (as `String`s).

In [ ]:
oiltxt.take(3)

The `take(n: Int)` function is also an action, it triggers computations, we can confirm this looking the Spark UI. In this case, there is only one task, because in order to take the first 3 lines, we only need to process one partition of the data, while we needed the entire dataset to count the total number of items. 

From the distributed dataset, `take(n: Int)` takes the first lines and collects them on the driver (which is running the current code), as an array of Strings (`Array[String]`).

## Filtering data

We have the first line as header and following lines containing the data. We want to do some computations on the numerical values, for this we need to remove the header line. We can do this with a filter:

In [ ]:
oiltxt.filter( (s: String) => ! s.startsWith("Date"))

We called the `filter` function on the `oiltxt` RDD. This function takes an argument which may look "funny" if not cryptic. This argument is actually an anonymous function or lambda. 

`s` is the function argument and the function body comes after the `=>`. So here the function takes one `String` argument and returns a `Boolean`, if the `String` doesn't start with `"Date"`.

This function will be applied on each element of the `oiltxt` RDD, using the item as the function argument. So on each item, `s` will be take the line of text, those lines to satistfy the predicate are kept in the resulting `RDD`, the others are discarted.

Note we define the operations to be applied on every elements, but we doen't tell anything about how to loop over the elements. Also we don't create the resulting `RDD`. Furthermore, we are not specifying if the transformation is executed sequentially (one core), in parallel (several cores) or distributed (cluster), actually the same code applies to all mode when using Spark.

This is one of the advantages of applying the functional programming, the separation between:

* defining computations and 
* the execution plan (scheduling).

Again, the `filter` returns an `RDD` and this actually means that it is lazy, as per `textFile`. Hence, to know the result we need to run an action like `count` or `take` to trigger the computation and see results on the driver.

In [ ]:
oiltxt.filter( s => ! s.startsWith("Date") ).count

As expected, we removed only one element, as there is only 1 header line starting with `"Date"`. 

## Parsing data

Now we will parse the lines to extract some numerical values, for instance the last field which is the close price of the stock:

In [ ]:
oiltxt.filter( s => ! s.startsWith("Date") )
.map( s => {
  val arr = s.split(",")
  arr(arr.size - 2).toDouble // last element is volume
}).first

This is a bit more involved, the `map` function takes itself a function like `filter` as argument which is applied on each element of the `RDD`.

After filtering out the header line, we apply a function on each element to split the line using the "," as separator, this returns an `Array[String]` from wich we take the `last` element and convert to `Double`. Again, very straigthforward.

The result is an `RDD` with `Double` as elements (`RDD[Double]`) from which we extract the date as a `String` and price and and build an `RDD` of Tuples:

In [ ]:
val oil = oiltxt.filter( s => ! s.startsWith("Date") )
                .map{ s =>
                  val tokens = s.split(",")
                  (tokens(1), tokens(tokens.size - 2).toDouble)
                }

Tuples are scala structures used to hold a number of elements of different types, in this case, two elements, a `String` and a `Double`. The type is `(String, Double)`.

Elements of a tuple can be accessed using `_1` (`_2`, `_3`, ...) for the first (second, third, ...resp) element. 

For example, one way to get and RDD of dates would be:

```scala

oil.map( t => t._1 )

```

or

```scala

oil.map( _._1 )

```

Here, because we use the function parameter (`t`) only once, in Scala we can then replace it with the _ placeholder. Because it cannot be mistaken we don't need to explicitely set a variable name. This writing is more compact and still very clear. We will see some examples with two such placeholders (two arguments, each used only once) later.

## Grouping elements


The next thing we will do is to group data to build an histogram of the price distribution. We will compute the mean and standard deviation of the prices using the `mean` and `stdev` functions of spark on `RDD[Double]`.

In [ ]:
val meanoil = oil.map( _._2 ).mean
val sdoil   = oil.map( _._2 ).stdev

Then we group data by computing `z`, the distance to mean price in standard deviation units and we round that number:

In [ ]:
oil.groupBy( t => ((t._2 - meanoil)/sdoil).round )

The type of the tuples in the `RDD` is `(Long, Iterable[(String, Double)])`. The first element ( `Long` ) is the computed rounded distance to average. It is the grouping key. The values corresponding to this grouping keys are accessible as an `Iterable[(String, Double)])`. Each element of this iterable being one element of the original `RDD[(String, Double)]`.

Using a strongly typed language is really helping us to understand what we are manipulating and how a function is transforming our data. The compiler provides us direct feedback on what we are doing and helps to pinpoint errors without running the computations on the data.

From this point,o we can do more in order to get the histogram data, we just have to count the number of elements in each group. A `mapValues` function allows us to apply a function on each value (and Iterator here):

In [ ]:
oil.groupBy( t => ((t._2 - meanoil)/sdoil).round )
   .mapValues( _.size )

Now items are of type `(Long, Int)`, the element of the tuple being the distance to mean and the second element is the size of the group.

We then need to collect that data and sort using the key:

In [ ]:
TableChart(
  oil.groupBy( t => ((t._2 - meanoil)/sdoil).round )
   .mapValues( _.size )
   .collect
   .sortBy( _._1 )
)

You see how we have just constructed, step by step some data transformation, looking at the types returned at each new instruction to make sure we were computing the expected quantities. This is one of the great value in strongly typed languages using with a REPL or notebook. 

## Joins

Now we will read and parse the other file with GOLD (GLD) data and join data on the date, to be able to compare GOLD and OIL prices. This is exacly the same functions as for OIL data:

In [ ]:
val gld = sparkContext.textFile(s"$dataDir/GOLD.csv")
                      .filter( s => ! s.startsWith("Date"))
                      .map{ s =>
                  val tokens = s.split(",")
                  (tokens(1), tokens(tokens.size - 2).toDouble)
                      }

We will compute the GLD prices mean and standdard deviation for scaling purposes:

In [ ]:
val meangold = gld.values.mean
val sdgold = gld.values.stdev

When a tuple has two elements, the first one is called the key and the second one is the value. If two `RDD`s have tuples as elements with keys of the same type, then it is possible to join them. Like here:

In [ ]:
val oilgld = oil.join(gld)

In [ ]:
oilgld.first

The resulting `RDD` has elements of type `(String, (Double, Double))`. 

The key is the common `String`, and the value is a tuple containing the `Double` for OIL and GLD.

We can now collect this data and scale the prices to see a comparision of GOLD and OIL price:

In [ ]:
ScatterChart(
  oilgld.mapValues( t => (t._1 - meanoil, t._2 - meangold))
        .values
        .sortBy(_._1)
        .collect
)

## Spark DAG and resilience

All these computations and the join is done on distributed datasets. 

How are they scheduled? We have seen (in <a href="#Actions">Actions</a> that computations are only triggered when an action is called, ususally requiring data to come back on the driver (collect, take, count, or writing outputs).

So a Direct Acyclic Graph (DAG) of computations is constructed and used to plan the execution of all computations when required. 

Here is how to have information about this DAG:

In [ ]:
println(oilgld.toDebugString)

We clearly see here that two data sets are read, filtered and mapped before being joined. The complete set of computations is defined this way. If any failure occurred at the level of a computing node, it would be possible to reconstruct the data because the plan remains and the resiliency is guaranteed.

You find the same DAG in a more user friendly representation in the Spark UI.

## Caching

We will now work on a bigger dataset and see how working eplicitely with memory (caching) can help with performance. We will prepare a dataset and use it several times.

For this we will define our own class instead of tuples, because using our own class definition  fields accesses easier than with tuples.

In [ ]:
case class Quote(symbol:String, date:String, price:Double)

So we defined here a `case` class. Case classes are much like a Java class, except that they come with a set of very interesting features. The case class signature is also the contructor signature, and accessors to the fields are predefined for us. The instantiation and use of case classes is much easier than vanilla Scala classes. We will see this in action in a minute.

We will read data from a file containing lines like:
``` 
ASTE,2011-12-06,33.93
ASTE,2012-03-14,36.84
```

where the `(symbol, date, price)` fields of the `Quote` case class are clearly identified.

We start with a download of the files we prepared for you:

In [ ]:
if (!new java.io.File(s"${dataDir}/closes.csv").exists) {
  val source = scala.io.Source.fromURL(s"https://s3-eu-west-1.amazonaws.com/spark-notebook-data/closes.csv")
  val f = new java.io.FileWriter(new java.io.File(s"${dataDir}/closes.csv"), false)
  source.foreach(f.append(_))
  f.close
}

Then we will read the file and parse it:

In [ ]:
val closes:RDD[Quote] = sparkContext.textFile(s"${dataDir}/closes.csv")
                                   .map(_.split(",").toList)
                                   .map{ case List(s, d, p) => Quote(s, d, p.toDouble)}

Our resulting `RDD` is of type `RDD[Quote]` as expected.

We will group stock prices per day:

In [ ]:
val byDate:RDD[(String, Quote)] = closes.keyBy(_.date)

You see how we call the accessor function on the `date` field from the `Quote` case class.

The `keyBy` function takes another function telling how to extract a key of each element, thus each element is turned into a tuple where the first element is the extracted key, and the second element is the original element. 

> Note that these objects are used as immutable, changing a value passes by the creation of new instance.

Now we can compute the minimum stocks price per date. To do this we need a function somewhat more sophisticated. without yet entering the details, the idea is that we want to work with symbols and prices for a given date, by selecting the lowest price:

In [ ]:
def minByDate = byDate.groupByKey.mapValues(quotes => quotes.minBy(_.price))

In [ ]:
def minByDate = byDate.combineByKey[(String, Double)](                                                                                           // `def` to force spark recomputing... otherwise it's smart enough to reuse previous RDDs...
  (x:Quote) => (x.symbol, x.price), 
  (d:(String, Double), l:Quote) => 
    if (d._2 < l.price) d else (l.symbol, l.price),
  (d1:(String, Double), d2:(String, Double)) => if (d1._2 < d2._2) d1 else d2
)

`minByDate` is a function definition without argument list such that the computation will be executed at each call. 

It returns an `RDD` of tuple with a date as key `(String)` and the `(symbol, price)` pair as value.

Now we can trigger the computation:

In [ ]:
<pre>{minByDate.take(2).toList.mkString("\n")}</pre>

This computation takes a bit of time. If we do this again:

In [ ]:
<pre>{minByDate.take(2).toList.mkString("\n")}</pre>

The computation time is sensibly the same because the we need to access the data from the disk and to recompute the steps in the DAG each time.

The solution is to cache in memory the result of the computation stages.

We define here another version of the combiner to compute the maximum stock price per day. This time we store the result in a variable and explicitly cache it. The resulting RDD, once computed in the cluster will be stored in memory as much as possible.

In [ ]:
val maxByDate = byDate.groupByKey.mapValues(quotes => quotes.minBy(_.price)).cache()

We trigger the computation, as before:

In [ ]:
<pre>{maxByDate.take(2).toList.mkString("\n")}</pre>

The computation time is not better, because it is the first computation. But now, if we look at the Spark UI under the `storage`tab, we can see that the data is stored in memory. 

We can now expect a much faster access to this result:

In [ ]:
<pre>{maxByDate.take(2).toList.mkString("\n")}</pre>

Indeed the results are obtained in a small fraction of the time.

## Takeovers

We have already covered quite a lot of things in this chapter. The idea was to show some real thing done with scala and spark, showing some of the important feartures of the language as well as some features of spark and how it is working. You don't need to feel comfortable with everithing at this stage. All of this will be coverted in details all along the book but you have already a good overview of what spark and scala can do.