# Spark basics

The entry point to a spark application is the [`SparkSession`](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SparkSession).

Spark Notebook creates a `SparkSession` instance for us, available with the name `sparkSession`. Unofficial convention on Spark is to name the `sparkSession` as `spark`.

In [ ]:
val spark = sparkSession

In [ ]:
// Provides Encoders for implicit conversions to Datasets
import spark.implicits._

The basic abstraction on Spark to represent distributed data is the [`RDD`](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD), Resilient Distributed Dataset. Represents an immutable, partitioned collection of elements that can be operated on in parallel. They are lazy, 

A [`Dataset`](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset) is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operation. We will use `Dataset`s in our demos.

Each `Dataset` also has an untyped view called a `DataFrame`, which is a `Dataset[Row]`. 

## Spark SQL

We are going to show basic capabilities of Spark by using a tiny use case. Don't hesitate to edit the input data as per your needs, in order to simulate different scenarios!

## Use case: Housing market
For the Housing market, we are interested in the prices of the houses, their size and their location.
For simplification, all the prices are presented in Euro, and the location is represented by the zip code and the country.
We'll be comparing a few countries in Europe.

In [ ]:
case class Home(m2: Int, price: Double, zipCode: String, country: String)

val data = Seq(
  Home(100, 561000, "1024AA", "Netherlands"),
  Home(100, 525000, "3011CD", "Netherlands"),
  Home(100, 916000, "75001", "France"),
  Home(100, 598000, "69002", "France"),
  Home(100, 354000, "28014", "Spain"),
  Home(100, 200000, "36400", "Spain"),
  Home(100, 1180000, "EC1A", "UK"),
  Home(100, 263000, "CF10", "UK"),
  Home(100, 336600, "1000", "Belgium"),
  Home(100, 299000, "9000", "Belgium"),
  Home(100, 820000, "2450", "Luxembourg"),
  Home(100, 570000, "4238", "Luxembourg"),
  Home(100, 375000, "50667", "Germany"),
  Home(100, 218000, "10117", "Germany")
)

## Create a spark [Dataset](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset)

We can create a dataset from a Sequence as shown below.
`toDS()` is part of the `spark.implicits` !!

In [ ]:
val homesDS = data.toDS()
// val homesDS = spark.createDataset(data)

Although we already know the schema in this case, many times we are ingesting data without previous knowledge of a certain schema.

The Dataset API provides functionality for us to see the schema of the data. As well as to show a sample of data, to quickly get a look and feel.

In [ ]:
homesDS.printSchema()

In [ ]:
homesDS.show(numRows = 10, truncate = false)

Conversion to `DataFrame` is direct, with the option of renaming columns at the same time (if needed).

In [ ]:
val homesDF = homesDS.toDF("size", "priceInEuros", "postalCode", "country")

Conversion back to `Dataset` is also direct. Use same column names!

In [ ]:
// val newHomesDS = homesDF.as[Home] // Fails when cannot match the column names
val newHomesDS = homesDS.toDF().as[Home]

Rows in a `DataFrame` have also a schema

In [ ]:
homesDF.printSchema()

In [ ]:
homesDF.show(numRows = 10, truncate = false)

## SQL vs Dataset

Spark SQL provides different approaches to run queries on data.

One can write text with the SQL query, and one can use the Dataset API. We will be comparing how to translate a SQL query into the API.

But, first things first! SQL needs to fetch data from a table, right? So we will need to simulate that in our code...

`createOrReplaceTempView`: Creates a local temporary view using the given name. The lifetime of this temporary view is tied to the `SparkSession` that was used to create this Dataset.

In [ ]:
homesDS.createOrReplaceTempView("homes")

And now we are all ready to run our queries...

## What's our dataset size?

In [ ]:
val allHomes = spark.sql("SELECT COUNT(*) FROM homes")
allHomes.show()

In [ ]:
homesDS.count()

In [ ]:
// Setting up some crazy price for a house, so we can include it in our queries below. Is it not crazy enough for you? You're lucky...
val crazyPrice = 300000

## Which countries have crazy prices in their Housing market?

In [ ]:
val allCrazyCountries = spark.sql(s"SELECT DISTINCT(country) FROM homes where price > $crazyPrice")
allCrazyCountries.show()

In [ ]:
homesDS.select($"country").filter($"price" > crazyPrice).distinct()
// homesDS.select($"country").where($"price" > crazyPrice).distinct()

## Order data

In [ ]:
// We can get the data ordered, by different columns.
// You DON'T want to run this in production! All the data will be shipped to the driver to be printed out
val allSortedData = spark.sql("SELECT * FROM homes ORDER BY country ASC, price DESC")
allSortedData.show()

Different ways of accessing columns - get your `import spark.implicits._` in the scope!

In [ ]:
homesDS.orderBy($"country".asc, $"price".desc)
// homesDS.orderBy(col("country").asc, col("price").desc)
// homesDS.orderBy(homesDS("country").asc, homesDS("price").desc)

## Aggregations

In [ ]:
// Aggregations are also possible

val avgPricePerCountry = spark.sql("SELECT country, AVG(price) FROM homes GROUP BY country")
avgPricePerCountry.show()

In [ ]:
homesDS.groupBy($"country").avg("price")

## Which locations in UK are over the crazy price?

In [ ]:
// Dollar ($) is used for string interpolation here...
// Don't get confused with $"columnName" to access to a column in the Dataset

val crazyUK = spark.sql(s"SELECT zipCode FROM homes WHERE country = 'UK' AND price > $crazyPrice")
crazyUK.show()

In [ ]:
homesDS.select($"zipCode").filter($"country" === "UK" and $"price" > crazyPrice)

## TODO: Top 3 most expensive countries

## TODO: Percentage of houses in France over the crazy price

## TODO: Percentage of houses per country in the dataset over the crazy price

## TODO: Top 3 cheapest locations (country + zipCode)