In [ ]:
val a = 1

### Spark Basics: DataFrames and DataSets

There are three different kinds of data modeling primitives that you can use in a Spark application to keep track of transparently distributed collections:  
* RDDs (low-level)  
* DataFrames (conceptually inspired by PyData / Pandas, available in Scala and Python)
* DataSets (compile-time type-safe DataFrames, available in Scala but not in Python)

To exemplify a use case for DataFrames and DataSets, the first thing we are going to do is to define some input data by hand.

In [ ]:
val rawData: String =
  "a,1\nb,7\na,4\nb,3\na,8\nb,2\na,5\nb,4\na,7\na,9\nb,1"

In [ ]:
val dataAfterSplit = rawData.split("\n")

In [ ]:
dataAfterSplit(0)

You'll notice that each record consists of a text key and a numeric value separated by a comma.  
Next, we define a case class to represent this type of data record.

In [ ]:
case class DataRecord(key: String, value: Int)

We also define a function to parse our raw CSV records into type-safe **`DataRecord`**s.

In [ ]:
def parseIntoDataRecord(s: String): DataRecord = {
  val afterSplit = s.split(",")
  DataRecord( afterSplit(0), afterSplit(1).toInt )
}

Next, we create first a **`DataFrame`** and then a **`DataSet`** based on our data.

In [ ]:
val myFirstDataFrame =
  sparkSession.createDataFrame( dataAfterSplit.map( parseIntoDataRecord(_) ) )

In [ ]:
val myFirstDataSet = myFirstDataFrame.as[DataRecord]

The data has now been distributed across the Spark cluster, let's have a look at it.

In [ ]:
myFirstDataSet.cache

In [ ]:
myFirstDataSet.count

We can now work with our data using Spark SQL, which is a SQL-2003 compliant language.  
All we have to do is attach a table name to our **`DataSet`**.

In [ ]:
myFirstDataSet.createOrReplaceTempView("raw_data")

println("Done.")

In [ ]:
sparkSession.sql("""SELECT key,
                           SUM(value) AS sum_value
                    FROM raw_data
                    GROUP BY key
                    ORDER BY key""")

In [ ]:
val aggregatedDataFrame =
  sparkSession.sql("""SELECT key,
                           SUM(value) AS sum_value
                    FROM raw_data
                    GROUP BY key
                    ORDER BY key""")

In [ ]:
val filteredDataFrame =
  sparkSession.sql("SELECT * FROM raw_data WHERE value < 5")

In [ ]:
filteredDataFrame

#### User Defined Functions (UDFs)

User-Defined Functions (UDFs) is a feature of Spark SQL to define new column-based functions.  
UDFs are very effective because they can be materialized anywhere in the cluster, so the function can be at the same physical location as the data, this makes operating on data using UDFs very efficient.

In [ ]:
// First we define a Scala function
def upCaseSqBr(inputStr: String): String = s"[${inputStr.toUpperCase}]"

println( upCaseSqBr("abc") )

In [ ]:
// Then we register its alias in Spark SQL
sparkSession.udf.register[String, String]( "upCaseSqBrUDF", upCaseSqBr(_) )

In [ ]:
val dataFrameAfterUDF =
  sparkSession.sql("""SELECT upCaseSqBrUDF(key) AS up_case_sq_br,
                             value
                      FROM raw_data""")

In [ ]:
dataFrameAfterUDF.cache

In [ ]:
dataFrameAfterUDF.count

In [ ]:
dataFrameAfterUDF.createOrReplaceTempView("data_after_udf")

In [ ]:
val stmt = """
  SELECT ROW_NUMBER() OVER(PARTITION BY up_case_sq_br
                           ORDER BY value DESC) as rownum, *
  FROM data_after_udf"""

In [ ]:
sparkSession.sql(stmt)

In [ ]:
sparkSession.sql(stmt)
            .where("rownum = 1")

Going back to the topic of UDFs, a more realistic example of a UDF is one that generates a UUID.

In [ ]:
import java.util.UUID

sparkSession.udf.register[String]("uuid", () => UUID.randomUUID.toString)

In [ ]:
val dataFrameWithUUID =
  sparkSession.sql("SELECT uuid() AS uuid, * FROM raw_data")

In [ ]:
dataFrameWithUUID.cache

#### Joining Two DataSets

In [ ]:
:sh head -5 /opt/SparkDatasets/geography/cities.csv

In [ ]:
:sh cat /opt/SparkDatasets/geography/cities_header.csv

In [ ]:
case class City (city_id: Long,
                 country_id: Long,
                 city_name: String)

In [ ]:
:sh head -5 /opt/SparkDatasets/geography/countries.csv

In [ ]:
:sh cat /opt/SparkDatasets/geography/countries_header.csv

In [ ]:
case class Country (country_id: Long,
                    continent_id: Long,
                    country_name: String)

In [ ]:
import org.apache.spark.sql.Encoders

val citySchema = Encoders.product[City].schema
val countrySchema = Encoders.product[Country].schema

In [ ]:
val citiesDS =
  sparkSession.read
              .schema(citySchema)
              .csv("/opt/SparkDatasets/geography/cities.csv")
              .as[City]

citiesDS.cache

citiesDS.count

citiesDS.createOrReplaceTempView("cities")

In [ ]:
val countriesDS =
  sparkSession.read
              .schema(countrySchema)
              .csv("/opt/SparkDatasets/geography/countries.csv")
              .as[Country]

countriesDS.cache

countriesDS.count

countriesDS.createOrReplaceTempView("countries")

In [ ]:
sparkSession.sql("""SELECT countries.country_name,
                           cities.city_name
                    FROM cities INNER JOIN countries
                    ON cities.country_id = countries.country_id""")

#### Ingesting well-behaved JSON

Given the population data below,  
and the assumption that zip codes which are close to each other geographically are also close to each other numerically,  
let's compute the top 10 list of the most densely populated "pockets" of zip codes, along with the U.S. state they are in.  
(We arbitrarily define a "pocket" as the geographical proximity indicated by the first 4 digits of the zip code.)

In [ ]:
:sh head -5 /opt/SparkDatasets/zipcodes/zips.json

In [ ]:
sparkSession.read
            .json("/opt/SparkDatasets/zipcodes/zips.json")
            .printSchema

In [ ]:
import org.apache.spark.sql.Encoders

// Define the case class
case class PopulationData(city: String,
                          loc: Array[Double],
                          pop: Long,
                          state: String,
                          zip_code: String)

// Define the schema
val pDataSchema = Encoders.product[PopulationData].schema

In [ ]:
// Ingest the data
val pDataDS =
  sparkSession.read
              .schema(pDataSchema)
              .json("/opt/SparkDatasets/zipcodes/zips.json")
              .as[PopulationData]

pDataDS.cache

// Define the temp view
pDataDS.createOrReplaceTempView("population_data")

pDataDS.count

In [ ]:
// Define the UDF
sparkSession.udf.register[String, String]("first_four", _.take(4) )

// Write the SQL statement
sparkSession.sql("""SELECT first_four(zip_code) AS zc_first_four,
                           state,
                           SUM(pop) AS zone_population
                    FROM population_data
                    GROUP BY zc_first_four, state
                    ORDER BY zone_population DESC""")

#### Ingesting real-world JSON

The dataset below contains DC/OS specific download counts for various open-source big data packages.  
Our goal is to extract a top 5 of the most downloaded packages.  
We'd like to see the package name, the month and the download count.  
First of all, here is the human-friendly representation of the JSON records we are going to ingest:

In [ ]:
:sh cat /opt/SparkDatasets/dcos/universe.json

And here is the exact same data, but in newline-separated JSON format:

In [ ]:
:sh cat /opt/SparkDatasets/dcos/universe.jsonline

Our first instinct would be to do as we did before.  
Let's see if that works.

In [ ]:
sparkSession.read
            .json("/opt/SparkDatasets/dcos/universe.jsonline")
            .printSchema

Attempting to use the same methodology as before leads to a dead end.  
Therefore something better is needed.  
Let's start by breaking down the first record of the dataset, and after that we'll generalize to the entire data set.

In [ ]:
// First record only
val alluxio =
  sparkSession.read
              .text("/opt/SparkDatasets/dcos/universe.jsonline")
              .first

In [ ]:
alluxio.mkString

We parse it with `JSON4S`:

In [ ]:
import org.json4s.native.JsonMethods.parse

parse(alluxio.mkString)

Since we only have to deal with one key-value pair, let's treat it as a tuple:

In [ ]:
val (pkgName, pkgContents) = {
  import org.json4s.DefaultFormats

  implicit val formats = DefaultFormats

  parse(alluxio.mkString).extractOpt[(String,Map[String,Any])]
                         .getOrElse[(String,Map[String,Any])]( ("",Map()) )
}

In [ ]:
case class PackageDownloadCount(packageName: String,
                                month: String,
                                downloadCount: Long)

In [ ]:
pkgContents("downloads")
  .asInstanceOf[Map[String,BigInt]]
  .view
  .map { case (month, downloadCount) =>
           PackageDownloadCount(pkgName,
                                month,
                                downloadCount.longValue) }
  .toSeq

We define our parse function by simply copying and pasting the exploratory code we've written above.

In [ ]:
import org.apache.spark.sql.Row
import org.json4s.native.JsonMethods.parse

def parseOneJSONRecord(inputRow: Row): TraversableOnce[PackageDownloadCount] = {
  import org.json4s.DefaultFormats

  implicit val formats = DefaultFormats
  
  val (pkgName, pkgContents) =
    parse(inputRow.mkString).extractOpt[(String,Map[String,Any])]
                            .getOrElse[(String,Map[String,Any])]( ("",Map()) )
  pkgContents("downloads")
    .asInstanceOf[Map[String,BigInt]]
    .view
    .map { case (month, downloadCount) =>
             PackageDownloadCount(pkgName,
                                  month,
                                  downloadCount.longValue) }
    .toSeq
}

In [ ]:
val downloadCountsDS =
  sparkSession.read
              .text("/opt/SparkDatasets/dcos/universe.jsonline")
              .flatMap( parseOneJSONRecord(_) )
              .as[PackageDownloadCount]

In [ ]:
downloadCountsDS.cache

In [ ]:
downloadCountsDS.count

In [ ]:
downloadCountsDS.orderBy('downloadCount.desc).limit(5)