## Spark SQL and DataFrames: Interacción con fuentes de datos externas

Este notebook muestra ejemplos de código del capítulo 5

#### Creación de función, registro deUDF y creación de vista temporal

In [0]:
%scala
val cubed = (s: Long) => {
  s * s * s
}

spark.udf.register("cubed", cubed)

spark.range(1, 9).createOrReplaceTempView("udf_test")

In [0]:
%scala
spark.sql("SELECT id, cubed(id) AS id_cubed FROM udf_test").show()

## Acceleración y distrubución de UDFs de PySpark con UDFs de Panda

In [0]:
import pandas as pd
from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType

def cubed(a: pd.Series) -> pd.Series:
    return a * a * a

cubed_udf = pandas_udf(cubed, returnType=LongType())

#### Usando los DataFrame de Pandas

In [0]:
x = pd.Series([1, 2, 3])

print(cubed(x)) 

0     1
1     8
2    27
dtype: int64


#### Usando los DataFrame de Spark

In [0]:
df = spark.range(1, 4)

df.select("id", cubed_udf(col("id"))).show()

+---+---------+
| id|cubed(id)|
+---+---------+
|  1|        1|
|  2|        8|
|  3|       27|
+---+---------+



## Funciones de orden superior en DataFrames y Spark SQL

In [0]:
%scala
import org.apache.spark.sql._
import org.apache.spark.sql.types._

val arrayData = Seq(
  Row(1, List(1, 2, 3)),
  Row(2, List(2, 3, 4)),
  Row(3, List(3, 4, 5))
)

val arraySchema = new StructType().add("id", IntegerType).add("values", ArrayType(IntegerType))

val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayData), arraySchema)
df.createOrReplaceTempView("table")
df.printSchema()
df.show()

#### Opción 1: Explode y Collect

En esta sentencia SQL anidada, primero `explode(values)` que crea una nueva fila (con el id) para cada elemento (`value`) dentro de values.

In [0]:
%scala
spark.sql("""SELECT id, collect_list(value + 1) AS newValues FROM (SELECT id, explode(values) AS value FROM table) x GROUP BY id""").show()

#### Opción 2: Función definida por el usuario

Para realizar la misma tarea (añadir un valor de 1 a cada elemento de `values`), también podemos crear una función definida por el usuario (UDF) que utilice map para iterar por cada elemento (`values`) para realizar la operación de adición.

In [0]:
%scala
def addOne(values: Seq[Int]): Seq[Int] = {
  values.map(value => value + 1)
}

val plusOneInt = spark.udf.register("plusOneInt", addOne(_: Seq[Int]): Seq[Int])

spark.sql("SELECT id, plusOneInt(values) AS values from table").show()

## Funciones de orden superior

Además de las funciones incorporadas mencionadas anteriormente, existen funciones de alto orden que toman como argumentos funciones lambda anónimas.

In [0]:
%scala
val t1 = Array(35, 36, 32, 30, 40, 42, 38)
val t2 = Array(31, 32, 34, 55, 56)
val tC = Seq(t1, t2).toDF("celsius")
tC.createOrReplaceTempView("tC")

tC.show()

#### Transform

`transform(array<T>, function<T, U>): array<U>`

La función de transform produce un array aplicando una función a cada elemento de un array de entrada (similar a una función de mapa).

In [0]:
%scala
spark.sql("""SELECT celsius, transform(celsius, t -> ((t * 9) div 5) + 32) AS fahrenheit FROM tC""").show

#### Filter

`filter(array<T>, function<T, Boolean>): array<T>`

La función de filter produce un array donde la función booleana es verdadera.

In [0]:
%scala
spark.sql("""SELECT celsius, filter(celsius, t -> t > 38) as high FROM tC""").show()

#### Exists

`exists(array<T>, function<T, V, Boolean>): Boolean`

La función exists devuelve verdadero si la función booleana es válida para cualquier elemento de la matriz de entrada.

In [0]:
%scala
spark.sql("""SELECT celsius, exists(celsius, t -> t = 38) AS threshold FROM tC""").show()

#### Reduce

`reduce(array<T>, B, function<B, T, B>, function<B, R>)`

La función reduce reduce los elementos de la matriz a un único valor fusionando los elementos en un buffer B mediante la función<B, T, B> y aplicando una función de acabado<B, R> en el buffer final.

In [0]:
%scala
spark.sql("""SELECT celsius, reduce(
          celsius,
          0,
          (t, acc) -> t + acc,
          acc -> (acc div size(celsius) * 9 div 5) + 32) 
          AS avgFahrenheit FROM tC""").show()

## Operadores relacionales comunes de DataFrames y Spark SQL

En esta sección, nos vamos a enfocar en los siguientes operadores:
* Unions and Joins
* Windowing
* Modifications

In [0]:
%scala
import org.apache.spark.sql.functions._

val delaysPath = "/databricks-datasets/learning-spark-v2/flights/departuredelays.csv"
val airportsPath = "/databricks-datasets/learning-spark-v2/flights/airport-codes-na.txt"

val airports = spark.read.options(
    Map(
      "header" -> "true", 
      "inferSchema" ->  "true", 
      "sep" -> "\t")
  ).csv(airportsPath)

airports.createOrReplaceTempView("airports_na")

val delays = spark.read.option("header", "true").csv(delaysPath)
  .withColumn("delay", expr("CAST(delay AS INT) AS delay"))
  .withColumn("distance", expr("CAST(distance AS INT) AS distance"))

delays.createOrReplaceTempView("departureDelays")

val foo = delays.filter(expr("""
  origin == 'SEA' AND
  destination == 'SFO' AND
  date like '01010%' AND delay > 0"""))

foo.createOrReplaceTempView("foo")

In [0]:
%scala
spark.sql("SELECT * FROM airports_na LIMIT 10").show()

In [0]:
%scala
spark.sql("SELECT * FROM departureDelays LIMIT 10").show()

In [0]:
%scala
spark.sql("SELECT * FROM foo LIMIT 10").show()

## Unions

In [0]:
%scala
val bar = delays.union(foo)
bar.createOrReplaceTempView("bar")
bar.filter(expr("origin == 'SEA' AND destination == 'SFO' AND date LIKE '01010%' AND delay > 0")).show

In [0]:
%scala
spark.sql("""SELECT * FROM bar WHERE origin = 'SEA' AND destination = 'SFO' AND date LIKE '01010%' AND delay > 0""").show()

## Joins

In [0]:
%scala
foo.join(airports.as('air), $"air.IATA" === $"origin").select("City", "State", "date", "delay", "distance", "destination").show()

In [0]:
%scala
spark.sql("""SELECT a.City, a.State, f.date, f.delay, f.distance, f.destination FROM foo f
    JOIN airports_na a
    ON a.IATA = f.origin
""").show()

## Funciones ventana

In [0]:
%scala
spark.sql("DROP TABLE IF EXISTS departureDelaysWindow")
spark.sql("""CREATE TABLE departureDelaysWindow AS
SELECT origin, destination, sum(delay) as TotalDelays 
  FROM departureDelays 
 WHERE origin IN ('SEA', 'SFO', 'JFK') 
   AND destination IN ('SEA', 'SFO', 'JFK', 'DEN', 'ORD', 'LAX', 'ATL') 
 GROUP BY origin, destination
""")

spark.sql("""SELECT * FROM departureDelaysWindow""").show()

¿Cuáles son los 3 destinos con más retraso con ciudad origen en SEA, SFO y JFK?

In [0]:
%scala
spark.sql("""SELECT origin, destination, TotalDelays, rank FROM (
             SELECT origin, destination, TotalDelays, dense_rank()
             OVER (PARTITION BY origin ORDER BY TotalDelays DESC) AS rank
             FROM departureDelaysWindow) t WHERE rank <= 3""").show()

## Modificaciones

#### Añadiendo nuevas columnas

In [0]:
%scala
val foo2 = foo.withColumn("status", expr("CASE WHEN delay <= 10 THEN 'On-time' ELSE 'Delayed' END"))
foo2.show()

#### Eliminando columnas

In [0]:
%scala
val foo3 = foo2.drop("delay")
foo3.show()

#### Renombrando columnas

In [0]:
%scala
val foo4 = foo3.withColumnRenamed("status", "flight_status")
foo4.show()

#### Pivotando

In [0]:
%scala
spark.sql("""SELECT destination, CAST(SUBSTRING(date, 0, 2) AS int) AS month, delay FROM departureDelays WHERE origin = 'SEA'""").show(10)

In [0]:
%scala
spark.sql("""SELECT * FROM (
SELECT destination, CAST(SUBSTRING(date, 0, 2) AS int) AS month, delay 
  FROM departureDelays WHERE origin = 'SEA' 
) PIVOT (
  CAST(AVG(delay) AS DECIMAL(4, 2)) as AvgDelay, MAX(delay) as MaxDelay
  FOR month IN (1 JAN, 2 FEB, 3 MAR)
)
ORDER BY destination""").show

In [0]:
%scala
spark.sql("""SELECT * FROM (
SELECT destination, CAST(SUBSTRING(date, 0, 2) AS int) AS month, delay FROM departureDelays 
WHERE origin = 'SEA' 
) PIVOT (
  CAST(AVG(delay) AS DECIMAL(4, 2)) as AvgDelay, MAX(delay) as MaxDelay
  FOR month IN (1 JAN, 2 FEB)
)
ORDER BY destination""").show()