# Chapter 3: DataFrames, Datasets, and Spark SQL

In this notebook, we will explore the DataFrames and Datasets concepts from Spark, as well as other aspects of the SQL API.

In [1]:
import org.apache.spark.sql.{functions => F}
import org.apache.spark.sql.Row

import org.apache.spark.sql.{functions=>F}


## Basic of Schemas: Creating DataFrames and Datasets

In this section, we will explore how to define an Schema, and how to use it to create DataFames and Datasets

### Creating a DataFrame Specifying the Schema

In [2]:
import org.apache.spark.sql.types.{StructType, StructField, IntegerType, StringType, ArrayType, MapType}
import org.apache.spark.sql.Row

In [3]:
val peopleSchema = new StructType(Array(StructField("id", IntegerType, true),
                                        StructField("name", StringType, true),
                                        StructField("age", IntegerType, true)))

peopleSchema = StructType(StructField(id,IntegerType,true), StructField(name,StringType,true), StructField(age,IntegerType,true))


StructType(StructField(id,IntegerType,true), StructField(name,StringType,true), StructField(age,IntegerType,true))

In [4]:
val peopleData = sc.parallelize(Array((1,"John", 23), 
                                      (2,"Maria", 23), 
                                      (3,"Isabella", 25),
                                      (4,"Abe", 25),
                                      (5,"Connor", 47),
                                      (6,"Daniel", 19))).map(x => Row.fromTuple(x))

peopleData = MapPartitionsRDD[1] at map at <console>:36


MapPartitionsRDD[1] at map at <console>:36

In [5]:
val peopleDf = spark.createDataFrame(peopleData, peopleSchema)

peopleDf = [id: int, name: string ... 1 more field]


[id: int, name: string ... 1 more field]

In [6]:
peopleDf.show()

+---+--------+---+
| id|    name|age|
+---+--------+---+
|  1|    John| 23|
|  2|   Maria| 23|
|  3|Isabella| 25|
|  4|     Abe| 25|
|  5|  Connor| 47|
|  6|  Daniel| 19|
+---+--------+---+



### Creating a DataFrame without Specifying the Schema

In [7]:
val peopleDataTuple = sc.parallelize(Array((1,"John", 23), 
                                           (2,"Maria", 23), 
                                           (3,"Isabella", 25),
                                           (4,"Abe", 25),
                                           (5,"Connor", 47),
                                           (6,"Daniel", 19)))

peopleDataTuple = ParallelCollectionRDD[7] at parallelize at <console>:31


ParallelCollectionRDD[7] at parallelize at <console>:31

In [8]:
val peopleDfNoSchema = spark.createDataFrame(peopleDataTuple)

peopleDfNoSchema = [_1: int, _2: string ... 1 more field]


[_1: int, _2: string ... 1 more field]

In [9]:
peopleDfNoSchema.show()

+---+--------+---+
| _1|      _2| _3|
+---+--------+---+
|  1|    John| 23|
|  2|   Maria| 23|
|  3|Isabella| 25|
|  4|     Abe| 25|
|  5|  Connor| 47|
|  6|  Daniel| 19|
+---+--------+---+



### Creating a Dataset / DataFrame from Case Clases

In [10]:
case class Person(id: Integer, Name: String, Age: Integer)

defined class Person


In [11]:
val dataCaseClasses = sc.parallelize(Array(Person(1,"John", 23), Person(2,"Maria", 23),
                                           Person(3,"Isabella", 25), Person(4,"Abe", 25),
                                           Person(5,"Connor", 47), Person(6,"Daniel", 19)))

dataCaseClasses = ParallelCollectionRDD[12] at parallelize at <console>:33


ParallelCollectionRDD[12] at parallelize at <console>:33

In [12]:
val peopleDsCaseClasses = spark.createDataset(dataCaseClasses)

peopleDsCaseClasses = [id: int, Name: string ... 1 more field]


[id: int, Name: string ... 1 more field]

In [13]:
peopleDsCaseClasses.show()

+---+--------+---+
| id|    Name|Age|
+---+--------+---+
|  1|    John| 23|
|  2|   Maria| 23|
|  3|Isabella| 25|
|  4|     Abe| 25|
|  5|  Connor| 47|
|  6|  Daniel| 19|
+---+--------+---+



In [14]:
peopleDsCaseClasses.rdd.take(1)(0).getClass

class $line34.$read$$iw$$iw$Person

In [15]:
val peopleDfCaseClasses = spark.createDataFrame(dataCaseClasses)

peopleDfCaseClasses = [id: int, Name: string ... 1 more field]


[id: int, Name: string ... 1 more field]

In [16]:
peopleDfCaseClasses.show()

+---+--------+---+
| id|    Name|Age|
+---+--------+---+
|  1|    John| 23|
|  2|   Maria| 23|
|  3|Isabella| 25|
|  4|     Abe| 25|
|  5|  Connor| 47|
|  6|  Daniel| 19|
+---+--------+---+



## DataFrame API

In this section, we will explore the DataFrame API.

### Transformations

#### Simple Transformations

We can perform a lot of simple transformations on Dataframes like `filter()`, together with other functionalities like `lit()`

In [17]:
peopleDf.filter(F.col("age") > 24).show()

+---+--------+---+
| id|    name|age|
+---+--------+---+
|  3|Isabella| 25|
|  4|     Abe| 25|
|  5|  Connor| 47|
+---+--------+---+



In [18]:
peopleDf.filter(F.lit(24) >= F.col("age")).show()

+---+------+---+
| id|  name|age|
+---+------+---+
|  1|  John| 23|
|  2| Maria| 23|
|  6|Daniel| 19|
+---+------+---+



Another interesting function is `explode`, which is useful to decompose columns made of lists or map/dicitonaries.

In [19]:
val schema = new StructType(Array(StructField("a", IntegerType, true),
                                  StructField("intList", ArrayType(IntegerType), true),
                                  StructField("mapField", MapType(StringType, StringType), true)))

val eDF_rdd = sc.parallelize(Array(Row(1, Array(1,2,3), Map("a" -> "b"))))

val eDF = spark.createDataFrame(eDF_rdd, schema)
eDF.show()

+---+---------+--------+
|  a|  intList|mapField|
+---+---------+--------+
|  1|[1, 2, 3]|[a -> b]|
+---+---------+--------+



schema = StructType(StructField(a,IntegerType,true), StructField(intList,ArrayType(IntegerType,true),true), StructField(mapField,MapType(StringType,StringType,true),true))
eDF_rdd = ParallelCollectionRDD[31] at parallelize at <console>:37
eDF = [a: int, intList: array<int> ... 1 more field]


[a: int, intList: array<int> ... 1 more field]

In [20]:
eDF.withColumn("anInt", F.explode(F.col("intList"))).show()

+---+---------+--------+-----+
|  a|  intList|mapField|anInt|
+---+---------+--------+-----+
|  1|[1, 2, 3]|[a -> b]|    1|
|  1|[1, 2, 3]|[a -> b]|    2|
|  1|[1, 2, 3]|[a -> b]|    3|
+---+---------+--------+-----+



In [21]:
eDF.select(F.col("a"), F.explode(F.col("mapField"))).show()

+---+---+-----+
|  a|key|value|
+---+---+-----+
|  1|  a|    b|
+---+---+-----+



Another interesting option is to perform "if/else" conditons directly on a DataFarme through the following syntaxis:

In [22]:
import org.apache.spark.sql.functions.when

In [23]:
val peopleDfTag = peopleDf.select(F.col("id"), F.col("name"), F.col("age"),(when(F.col("age") > 45, 2).when(F.col("age") <= 20, 0).otherwise(1)).alias("encodedAge"))

peopleDfTag = [id: int, name: string ... 2 more fields]


[id: int, name: string ... 2 more fields]

In [24]:
peopleDfTag.show()

+---+--------+---+----------+
| id|    name|age|encodedAge|
+---+--------+---+----------+
|  1|    John| 23|         1|
|  2|   Maria| 23|         1|
|  3|Isabella| 25|         1|
|  4|     Abe| 25|         1|
|  5|  Connor| 47|         2|
|  6|  Daniel| 19|         0|
+---+--------+---+----------+



Other important set of functionalities are aggregations, which used together the `groupBy` command and one aggretation function: `aproxCountDistinct`, `avg`, `count`, `countDistinct`, `first`, `last`, `stddev`, `stddev_pop`, `sum`, `sumDistinct`.

In [25]:
peopleDfTag.groupBy(F.col("encodedAge")).agg(F.avg(F.col("age"))).show()

+----------+--------+
|encodedAge|avg(age)|
+----------+--------+
|         1|    24.0|
|         2|    47.0|
|         0|    19.0|
+----------+--------+



In [26]:
peopleDfTag.groupBy(F.col("encodedAge")).agg(F.count(F.col("age"))).show()

+----------+----------+
|encodedAge|count(age)|
+----------+----------+
|         1|         4|
|         2|         1|
|         0|         1|
+----------+----------+



In [27]:
peopleDfTag.groupBy(F.col("encodedAge")).agg(F.countDistinct(F.col("age"))).show()

+----------+-------------------+
|encodedAge|count(DISTINCT age)|
+----------+-------------------+
|         1|                  2|
|         2|                  1|
|         0|                  1|
+----------+-------------------+



`Window` functions allow to calculate individual values for each register in a column based on some aggregate condition. Here we are going to calculate, for each person of the table `people_df_tag`, de difference between their age and the average of the age of its age group (indicated by the field `encodedAge`).

In [28]:
import org.apache.spark.sql.expressions.Window

In [29]:
val windowSpec = Window.partitionBy(peopleDfTag("encodedAge"))

windowSpec = org.apache.spark.sql.expressions.WindowSpec@763d3592


org.apache.spark.sql.expressions.WindowSpec@763d3592

In [30]:
val colAvgAgeDif = F.avg(peopleDfTag("age")).over(windowSpec) - peopleDfTag("age")

colAvgAgeDif = (avg(age) OVER (PARTITION BY encodedAge unspecifiedframe$()) - age)


(avg(age) OVER (PARTITION BY encodedAge unspecifiedframe$()) - age)

In [31]:
peopleDfTag.select(F.col("id"), F.col("name"), 
                   F.col("age"), F.col("encodedage"),
                  colAvgAgeDif.alias("ageDifinGroup")).show()

+---+--------+---+----------+-------------+
| id|    name|age|encodedage|ageDifinGroup|
+---+--------+---+----------+-------------+
|  1|    John| 23|         1|          1.0|
|  2|   Maria| 23|         1|          1.0|
|  3|Isabella| 25|         1|         -1.0|
|  4|     Abe| 25|         1|         -1.0|
|  5|  Connor| 47|         2|          0.0|
|  6|  Daniel| 19|         0|          0.0|
+---+--------+---+----------+-------------+



Finally, another important functionality is orderBy()

In [32]:
peopleDfTag.orderBy(F.col("age")).show()

+---+--------+---+----------+
| id|    name|age|encodedAge|
+---+--------+---+----------+
|  6|  Daniel| 19|         0|
|  1|    John| 23|         1|
|  2|   Maria| 23|         1|
|  3|Isabella| 25|         1|
|  4|     Abe| 25|         1|
|  5|  Connor| 47|         2|
+---+--------+---+----------+



#### Multi-DataFrame Transformations: Set-Like operations

Another important group of transformations are the "set-like" operations, that works over two different dataframes. We can highlight the following set-like operations: `unionAll`, `intersect`, `except`, `distinct`.

In [33]:
val peopleSchema = new StructType(Array(StructField("id", IntegerType, true),
                                        StructField("name", StringType, true),
                                        StructField("age", IntegerType, true),
                                        StructField("encodedAge", IntegerType, true)))


val peopleData = sc.parallelize(Array((5, "Connor", 47, 2), 
                                      (6, "Daniel", 19, 0))).map(x => Row.fromTuple(x))

val peopleDfTag2 = spark.createDataFrame(peopleData, peopleSchema)                                   

peopleSchema = StructType(StructField(id,IntegerType,true), StructField(name,StringType,true), StructField(age,IntegerType,true), StructField(encodedAge,IntegerType,true))
peopleData = MapPartitionsRDD[90] at map at <console>:46
peopleDfTag2 = [id: int, name: string ... 2 more fields]


[id: int, name: string ... 2 more fields]

`unioinAll()` --> to concatenate two dataframes

In [34]:
val peopleDfTagUnion = peopleDfTag.unionAll(peopleDfTag2)

peopleDfTagUnion = [id: int, name: string ... 2 more fields]




[id: int, name: string ... 2 more fields]

In [35]:
peopleDfTagUnion.show()

+---+--------+---+----------+
| id|    name|age|encodedAge|
+---+--------+---+----------+
|  1|    John| 23|         1|
|  2|   Maria| 23|         1|
|  3|Isabella| 25|         1|
|  4|     Abe| 25|         1|
|  5|  Connor| 47|         2|
|  6|  Daniel| 19|         0|
|  5|  Connor| 47|         2|
|  6|  Daniel| 19|         0|
+---+--------+---+----------+



`intersect()` --> to get the intersection between two DataFrames:

In [36]:
peopleDfTag.intersect(peopleDfTag2).show()

+---+------+---+----------+
| id|  name|age|encodedAge|
+---+------+---+----------+
|  6|Daniel| 19|         0|
|  5|Connor| 47|         2|
+---+------+---+----------+



`except()` --> to remove from one DataFrame the elements from other DataFrame

In [37]:
peopleDfTag.except(peopleDfTag2).show()

+---+--------+---+----------+
| id|    name|age|encodedAge|
+---+--------+---+----------+
|  4|     Abe| 25|         1|
|  1|    John| 23|         1|
|  2|   Maria| 23|         1|
|  3|Isabella| 25|         1|
+---+--------+---+----------+



`distict()` --> to get the distinct elements from one DataFrame

In [38]:
peopleDfTagUnion.distinct().show()

+---+--------+---+----------+
| id|    name|age|encodedAge|
+---+--------+---+----------+
|  4|     Abe| 25|         1|
|  6|  Daniel| 19|         0|
|  1|    John| 23|         1|
|  2|   Maria| 23|         1|
|  5|  Connor| 47|         2|
|  3|Isabella| 25|         1|
+---+--------+---+----------+



### Plain Old SQL Queries

It is possible to perform SQL queries using the old plain format instead of the new SQL API. Here we can se one example of how to do it:

In [39]:
peopleDfTag.registerTempTable("people")



In [40]:
spark.sql("SELECT * FROM people WHERE age > 20").show()

+---+--------+---+----------+
| id|    name|age|encodedAge|
+---+--------+---+----------+
|  1|    John| 23|         1|
|  2|   Maria| 23|         1|
|  3|Isabella| 25|         1|
|  4|     Abe| 25|         1|
|  5|  Connor| 47|         2|
+---+--------+---+----------+

