## Spark Data Sources

Este notebook muestra como usar la interfaz API de Spark Data Sources para leer formatos de ficheros:
* Parquet
 * JSON
 * CSV
 * Avro
 * ORC
 * Image
 * Binary

#### Definimos los distintos ficheros

In [0]:
%scala
val parquetFile = "/databricks-datasets/learning-spark-v2/flights/summary-data/parquet/2010-summary.parquet"
val jsonFile = "/databricks-datasets/learning-spark-v2/flights/summary-data/json/*"
val csvFile = "/databricks-datasets/learning-spark-v2/flights/summary-data/csv/*"
val orcFile = "/databricks-datasets/learning-spark-v2/flights/summary-data/orc/*"
val avroFile = "/databricks-datasets/learning-spark-v2/flights/summary-data/avro/*"
val schema = "DEST_COUNTRY_NAME STRING, ORIGIN_COUNTRY_NAME STRING, count INT"

#### Parquet Data Source

In [0]:
%scala
val parquetDf = spark.read.format("parquet").option("path", parquetFile).load()

#### Otra forma de leer el fichero usando una variación de esta API

In [0]:
%scala
val parquetDf2 = spark.read.parquet(parquetFile)

In [0]:
%scala
parquetDf.show(10, false)

## SQL

#### Crear una vista temporal sin gestión

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl USING parquet OPTIONS (path "/databricks-datasets/definitive-guide/data/flight-data/parquet/2010-summary.parquet")

Utilizamos SQL para visualizar la tabla

In [0]:
%scala
spark.sql("SELECT * FROM us_delay_flights_tbl").show(10, false)

## JSON Data Source

In [0]:
%scala
val jsonDf = spark.read.format("json").option("path", jsonFile).load()

In [0]:
%scala
jsonDf.show(10, false)

## SQL

#### Crear una vista temporal sin gestión

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl USING json OPTIONS (path "/databricks-datasets/learning-spark-v2/flights/summary-data/json/*")

Utilizamos SQL para visualizar la tabla

In [0]:
%scala
spark.sql("SELECT * FROM us_delay_flights_tbl").show(10, false)

## CSV Data Source

In [0]:
%scala
val csvDf = spark.read.format("csv").option("header", "true").schema(schema)
  .option("mode", "FAILFAST") //finaliza la ejecución si hay algún error
  .option("nullValue", "") //si hay algún valor nulo lo reemplaza por ""
  .option("path", csvFile).load()

In [0]:
%scala
csvDf.show(10, false)

In [0]:
%scala
csvDf.write.format("parquet").mode("overwrite").option("path", "/tmp/data/parquet/df_parquet").option("compression", "snappy").save()

In [0]:
%fs ls /tmp/data/parquet/df_parquet

path,name,size,modificationTime
dbfs:/tmp/data/parquet/df_parquet/_SUCCESS,_SUCCESS,0,1651228165000
dbfs:/tmp/data/parquet/df_parquet/_committed_4648300368014824127,_committed_4648300368014824127,624,1651228130000
dbfs:/tmp/data/parquet/df_parquet/_committed_5275549232530391348,_committed_5275549232530391348,1234,1651228164000
dbfs:/tmp/data/parquet/df_parquet/_started_4648300368014824127,_started_4648300368014824127,0,1651228128000
dbfs:/tmp/data/parquet/df_parquet/_started_5275549232530391348,_started_5275549232530391348,0,1651228163000
dbfs:/tmp/data/parquet/df_parquet/part-00000-tid-5275549232530391348-ad596182-f78f-4221-9e32-8e89691513bd-335-1-c000.snappy.parquet,part-00000-tid-5275549232530391348-ad596182-f78f-4221-9e32-8e89691513bd-335-1-c000.snappy.parquet,5449,1651228163000
dbfs:/tmp/data/parquet/df_parquet/part-00001-tid-5275549232530391348-ad596182-f78f-4221-9e32-8e89691513bd-336-1-c000.snappy.parquet,part-00001-tid-5275549232530391348-ad596182-f78f-4221-9e32-8e89691513bd-336-1-c000.snappy.parquet,5409,1651228163000
dbfs:/tmp/data/parquet/df_parquet/part-00002-tid-5275549232530391348-ad596182-f78f-4221-9e32-8e89691513bd-337-1-c000.snappy.parquet,part-00002-tid-5275549232530391348-ad596182-f78f-4221-9e32-8e89691513bd-337-1-c000.snappy.parquet,5363,1651228163000
dbfs:/tmp/data/parquet/df_parquet/part-00003-tid-5275549232530391348-ad596182-f78f-4221-9e32-8e89691513bd-338-1-c000.snappy.parquet,part-00003-tid-5275549232530391348-ad596182-f78f-4221-9e32-8e89691513bd-338-1-c000.snappy.parquet,5412,1651228163000
dbfs:/tmp/data/parquet/df_parquet/part-00004-tid-5275549232530391348-ad596182-f78f-4221-9e32-8e89691513bd-339-1-c000.snappy.parquet,part-00004-tid-5275549232530391348-ad596182-f78f-4221-9e32-8e89691513bd-339-1-c000.snappy.parquet,5319,1651228163000


In [0]:
%scala
val csvDf2 = spark.read.option("header", "true")
  .option("mode", "FAILFAST") //finaliza la ejecución si hay algún error
  .option("nullValue", "") //si hay algún valor nulo lo reemplaza por ""
  .schema(schema).csv(csvFile)

In [0]:
%scala
csvDf2.show(10, false)

## SQL

#### Crear una vista temporal sin gestión

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl USING csv OPTIONS (
    path "/databricks-datasets/learning-spark-v2/flights/summary-data/csv/*",
    header "true",
    inferSchema "true",
    mode "FAILFAST"
  )

Utilizamos SQL para visualizar la tabla

In [0]:
%scala
spark.sql("SELECT * FROM us_delay_flights_tbl").show(10, false)

## ORC Data Source

In [0]:
%scala
val orcDf = spark.read.format("orc").option("path", orcFile).load()

In [0]:
%scala
orcDf.show(10, false)

## SQL

#### Crear una vista temporal sin gestión

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl USING orc OPTIONS (path "/databricks-datasets/learning-spark-v2/flights/summary-data/orc/*")

Utilizamos SQL para visualizar la tabla

In [0]:
%scala
spark.sql("SELECT * FROM us_delay_flights_tbl").show(10, false)

## Avro Data Source

In [0]:
%scala
val avroDf = spark.read.format("avro").option("path", avroFile).load()

In [0]:
%scala
avroDf.show(10, false)

## SQL

#### Crear una vista temporal sin gestión

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl USING avro OPTIONS (path "/databricks-datasets/learning-spark-v2/flights/summary-data/avro/*")

Utilizamos SQL para visualizar la tabla

In [0]:
%scala
spark.sql("SELECT * FROM us_delay_flights_tbl").show(10, false)

## Image

In [0]:
%scala
import org.apache.spark.ml.source.image

val imageDir = "/databricks-datasets/cctvVideos/train_images/"
val imagesDf = spark.read.format("image").load(imageDir)

imagesDf.printSchema
imagesDf.select("image.height", "image.width", "image.nChannels", "image.mode", "label").show(5, false)

## Binary

In [0]:
%scala
val path = "/databricks-datasets/learning-spark-v2/cctvVideos/train_images/"
val binaryFilesDf = spark.read.format("binaryFile").option("pathGlobFilter", "*.jpg").load(path)

binaryFilesDf.show(5)

Para ignorar cualquier particion de datos en un directorio, podemos poner la opción `recursiveFileLookup` a `true`

In [0]:
%scala
val binaryFilesDf = spark.read.format("binaryFile").option("pathGlobFilter", "*.jpg").option("recursiveFileLookup", "true").load(path)

binaryFilesDf.show(5)