# Preamble

In [None]:
import $file.sparksession
import sparksession._
import spark.implicits._
import org.apache.spark._
import org.apache.spark.sql.{functions => func, _}
import org.apache.spark.sql.types._, func._

# On `DataFrame`s

We can create datasets from external data sources using different formats, e.g. Json, parquet, CSV, etc. 

In [None]:
val people: DataFrame = spark.read.json("data/people.json")
val prueba: DataFrame = spark.read.csv("data/economic-damage-from-natural-disasters.csv")

In [None]:
val temperature : DataFrame = spark.read.format("csv")
  .option("inferSchema", "true")
  .option("header", "true")
  .load("data/GlobalLandTemperaturesByCountry.csv")
val disaster = spark.read.format("csv")
  .option("inferSchema", "true")
  .option("header", "true")
  .load("data/number-of-natural-disaster-events.csv")

In [None]:
disaster.show
//temperature.limit(20).show

In [None]:
val disas :DataFrame = disaster.select($"Entity",$"Year", $"Number of reported natural disasters (reported disasters)" as "Number")

In [None]:
disas.show

In [None]:
disas.filter($"Entity" =!="All natural disasters")
     .groupBy($"Year")
     .sum("Number")
     .orderBy($"Year")
     .show

disas.filter($"Entity" ==="All natural disasters")
     .show

disas.filter($"Entity" =!="All natural disasters")
     .groupBy($"Year", $"Entity")
     .sum("Number")
     .orderBy($"Year")
     .show

In [None]:
temperature.filter($"AverageTemperature" > -1000000)
            .groupBy($"dt")
            .avg("AverageTemperature")
            .orderBy($"dt")
            .limit(20)
            .show

Note that we created a `DataFrame`, not a `Dataset`. Dataframes are like datasets, i.e. programs to generate distributed data sets, but *dynamically typed*. This means that, at compile time, Scala only knows that a dataframe consists of `Row`s.

In [None]:
people.collect

In [None]:
temperature.limit(100).collect

In fact, a `DataFrame` is defined as an alias of `Dataset`: 

In [None]:
val peopleDs: Dataset[Row] = people

But the type of the information to be processed is there! 

In [None]:
people.schema
people.printSchema

and we can convert a dataframe into a dataset: 

In [None]:
//Esto es lo que tenemos que hacer 
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)


case class Person(name: String, age: Long)

//val peopleDs: Dataset[Person] = people.as[Person]

In [None]:
peopleDs.show
people.show

# Untyped transformations

The `Dataset` API includes a section on _untyped transformations_. These are transformations that are not defined over the Scala types but over the inner Spark SQL types (i.e. `StructType`s). More exactly, these could be named *dynamically typed transformations*.

These transformations are in close corresponde with their SQL counterparts: `SELECT`, `WHERE`, `GROUP BY`, `FROM`, etc. 

### The `select` transformation

For instance, the equivalent to the `map` typed transformation is `select`: 

In [None]:
val ds: Dataset[String] = peopleDs.map(_.name)
ds.collect
ds.show
ds.explain

In [None]:
val df: DataFrame = 
    spark.read.json("data/people.json").select($"name")
df.collect
df.show
df.schema

Note that we lost the column label (`name`) in the case of the dataset transformation. This is not happening with `select`. Moreover, we have more control over the resulting schema: 

In [None]:
peopleDs.map(p => (p.name, p.age + 1, p.name.substring(0,3)))
    .show

In [None]:
people.select($"name", $"age" + 1 as "age", $"name".substr(0,3) as "prefix")
    .show

//por defecto, el nombre de las columnas: la expresion 

In [None]:
temperature.select($"_c2",$"_c3").show
//renombrar las columnas
val d:DataFrame = temperature.toDF("type","code","year","money")
temperature.withColumnRenamed("nombreExistente", "nuevoNombre")

In [None]:
d.select($"type").show

The [org.apache.spark.sql.functions](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$) contains dozens of column operators.

Note that _untyped_, or more properly, _dynamically typed_, character means that the Scala compiler won't complain if we choose a non-existent column:

In [None]:
lazy val df: DataFrame = spark.read.json("data/people.json").select($"nam")

The error will be shown at runtime: 

In [None]:
df

On the contrary, the error in the dataset transformation manifests at compile-time:

In [None]:
peopleDs.map(_.nam)

### The `filter` transformation

This is the equivalent to the typed `filter` transformation:

In [None]:
people.filter($"age" > 2001)
    .show

If we pass a column function not denoting a boolean value, we won't even get a run-time exception:

In [None]:
def df: DataFrame = 
    people.filter($"name" > 2001)

In [None]:
df.show

### The `groupBy` transformation

In [None]:
val students: DataFrame = spark.read.json("data/students.json")

In [None]:
students.groupBy($"degree").count.show

In [None]:
students.groupBy($"degree").mapGroup((key,value) => (key, value.name))

### `Join` transformations

We already discussed joins, but we didn't mention that the resulting type of a join is a dataframe, not a dataset: 

In [None]:
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)

case class Student(name: String, degree: String)

In [None]:
peopleDs.join(students.as[Student], "name")

In [None]:
res31.show

# The problems of `Dataset`s

Datasets are nice because they are type safe, but, unfortunately, they are less efficient than data frames in several respects. This can be best shown by reading from parquet source files. 

Parquet is a _columnar_ format, which means that it stores physically data around columns, allowing us to read only data from a particular column without reading the entire row.

In [None]:
people.write.mode("overwrite").parquet("data/people.parquet")

In [None]:
spark.read.parquet("data/people.parquet").schema

### The `ReadSchema` optimization

Let's create a program that simply read the _name_ column of the people dataset:

In [None]:
val ds: Dataset[String] = 
    spark.read.parquet("data/people.parquet").as[Person]
        .map(_.name)

which works as intended: 

In [None]:
ds.show

We have a problem, however: 

In [None]:
ds.explain

As we can see, the plan includes the directive `ReadSchema: struct<age:bigint,name:string>`, which generates a query to scan the full schema of the parquet file. But we just want to read the names! We can create an optimun program using dataframes:

In [None]:
val df: DataFrame = 
    spark.read.parquet("data/people.parquet").select($"name")

which works similarly: 

In [None]:
df.show

but more efficiently (note the the value of the `ReadSchema` directive):

In [None]:
df.explain

We can empirically check that it actually works using the Spark UI. First, we create a parquet file with enough rows and several columns:

In [None]:
spark.range(0, 1000000)
    .select($"id" as "_1", lit(1) as "_2")
    .write.mode("overwrite").parquet("data/test")

Now, we read the second column using both datasets and dataframes, and check the Spark UI for the _Input Size_ field.

In [None]:
val test = spark.read.parquet("data/test")
test.as[Tuple2[Long, Int]].map(_._2).collect

Using dataframes the input size is much lower since we only read the second column:

In [None]:
test.select($"_2").collect

### The `PushedFilter` optimization

Let's consider the following equivalent dataset and dataframe programs: 

In [None]:
val ds: Dataset[(Long, Int)] = 
    test.as[(Long, Int)]
        .filter(_._1 >= 999995)

val df: DataFrame = 
    test
        .filter($"_1" >= 999995)

Functionally, they are equivalent, but their performance differ significantly:

In [None]:
df.collect
ds.collect

The explanation of this difference lies in another optimization applied by the Spark SQL compiler: the so-called push-down filter optimization. In the previous `ReadSchema` optimization, we skipped certain columns of the dataset; now, we skip rows and read only the ones we are interested in (those that satisfy the predicate). We can check if the push-down filter optimization is actually applied by inspecting the query plan. 

In [None]:
df.explain
ds.explain

### The `PartitionFilters` optimization

Let's create a test file with an additional column: 

In [None]:
spark.range(0, 1000000)
    .select($"id" as "_1", lit(1) as "_2", round(rand() * 10) mod lit(10) as "_3")
    .write.mode("overwrite").parquet("data/test")

In [None]:
val test: DataFrame = spark.read.parquet("data/test")

In [None]:
test.show

Let's suppose that we want to read data with value `_3` equal to `9.0`:

In [None]:
test.filter($"_3" === lit(9.0)).show

A pushed filter optimization is created, but it would be better if we could just read directly those rows with the exact value for the thrid column. We can achieve that as follows:

In [None]:
test.write.mode("overwrite").partitionBy("_3").parquet("data/testP")

As we can see, the parquet file is splitted into ten partitions. Now, if we just want to process data with a particular key, Spark will generate an optimun query: 

In [None]:
val testP: DataFrame = spark.read.parquet("data/testP")

In [None]:
testP.filter($"_3" === lit(9.0)).show

We can inspet the Spark UI to check that we read less data in the last action.