## Spark Data Sources

Este notebook muestra como usar la interfaz API de Spark Data Sources para leer formatos de ficheros:
 * Parquet
 * JSON
 * CSV
 * Avro
 * ORC
 * Image
 * Binary

#### Definimos los distintos ficheros

In [0]:
parquet_file = "/databricks-datasets/learning-spark-v2/flights/summary-data/parquet/2010-summary.parquet"
json_file = "/databricks-datasets/learning-spark-v2/flights/summary-data/json/*"
csv_file = "/databricks-datasets/learning-spark-v2/flights/summary-data/csv/*"
orc_file = "/databricks-datasets/learning-spark-v2/flights/summary-data/orc/*"
avro_file = "/databricks-datasets/learning-spark-v2/flights/summary-data/avro/*"
schema = "DEST_COUNTRY_NAME STRING, ORIGIN_COUNTRY_NAME STRING, count INT"

## Parquet Data Source

In [0]:
parquet_df = spark.read.format("parquet").option("path", parquet_file).load()

#### Otra forma de leer el fichero usando una variación de esta API

In [0]:
parquet_df2 = spark.read.parquet(parquet_file)

In [0]:
parquet_df.show(10, False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## SQL

#### Crear una vista temporal sin gestión

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl USING parquet OPTIONS (path "/databricks-datasets/definitive-guide/data/flight-data/parquet/2010-summary.parquet")

Utilizamos SQL para visualizar la tabla

In [0]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show(10, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## JSON Data Source

In [0]:
json_df = spark.read.format("json").option("path", json_file).load()

In [0]:
json_df.show(10, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
|Egypt            |United States      |15   |
|United States    |India              |62   |
|United States    |Singapore          |1    |
|United States    |Grenada            |62   |
|Costa Rica       |United States      |588  |
|Senegal          |United States      |40   |
|Moldova          |United States      |1    |
+-----------------+-------------------+-----+
only showing top 10 rows



#### Otra forma de leer el fichero usando una variación de esta API

In [0]:
json_df2 = spark.read.json(json_file)

In [0]:
json_df2.show(10, False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
|Egypt            |United States      |15   |
|United States    |India              |62   |
|United States    |Singapore          |1    |
|United States    |Grenada            |62   |
|Costa Rica       |United States      |588  |
|Senegal          |United States      |40   |
|Moldova          |United States      |1    |
+-----------------+-------------------+-----+
only showing top 10 rows



## SQL

#### Crear una vista temporal sin gestión

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl USING json OPTIONS (path "/databricks-datasets/learning-spark-v2/flights/summary-data/json/*")

Utilizamos SQL para visualizar la tabla

In [0]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show(10, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
|Egypt            |United States      |15   |
|United States    |India              |62   |
|United States    |Singapore          |1    |
|United States    |Grenada            |62   |
|Costa Rica       |United States      |588  |
|Senegal          |United States      |40   |
|Moldova          |United States      |1    |
+-----------------+-------------------+-----+
only showing top 10 rows



## CSV Data Source

In [0]:
csv_df = (spark.read.format("csv").option("header", "true").schema(schema)
         .option("mode", "FAILFAST") #finaliza la ejecución si hay algún error
         .option("nullValue", "") #si hay algún valor nulo lo reemplaza por ""
         .option("path", csv_file).load())

In [0]:
csv_df.show(10, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



In [0]:
csv_df.write.format("parquet").mode("overwrite").option("path", "/tmp/data/parquet/df_parquet").option("compression", "snappy").save()

In [0]:
%fs ls /tmp/data/parquet/df_parquet

path,name,size,modificationTime
dbfs:/tmp/data/parquet/df_parquet/_committed_1586011079372287289,_committed_1586011079372287289,1223,1651236065000
dbfs:/tmp/data/parquet/df_parquet/_committed_4648300368014824127,_committed_4648300368014824127,624,1651228130000
dbfs:/tmp/data/parquet/df_parquet/_committed_5275549232530391348,_committed_5275549232530391348,1234,1651228164000
dbfs:/tmp/data/parquet/df_parquet/_committed_696789220469223954,_committed_696789220469223954,1217,1651236104000
dbfs:/tmp/data/parquet/df_parquet/_committed_vacuum8453186849780295908,_committed_vacuum8453186849780295908,96,1651236067000
dbfs:/tmp/data/parquet/df_parquet/_started_1586011079372287289,_started_1586011079372287289,0,1651236064000
dbfs:/tmp/data/parquet/df_parquet/_started_696789220469223954,_started_696789220469223954,0,1651236103000
dbfs:/tmp/data/parquet/df_parquet/part-00000-tid-696789220469223954-c980be47-e6ec-4748-b962-414d077a70a3-833-1-c000.snappy.parquet,part-00000-tid-696789220469223954-c980be47-e6ec-4748-b962-414d077a70a3-833-1-c000.snappy.parquet,5449,1651236103000
dbfs:/tmp/data/parquet/df_parquet/part-00001-tid-696789220469223954-c980be47-e6ec-4748-b962-414d077a70a3-834-1-c000.snappy.parquet,part-00001-tid-696789220469223954-c980be47-e6ec-4748-b962-414d077a70a3-834-1-c000.snappy.parquet,5409,1651236103000
dbfs:/tmp/data/parquet/df_parquet/part-00002-tid-696789220469223954-c980be47-e6ec-4748-b962-414d077a70a3-835-1-c000.snappy.parquet,part-00002-tid-696789220469223954-c980be47-e6ec-4748-b962-414d077a70a3-835-1-c000.snappy.parquet,5363,1651236103000


In [0]:
csv_df2 = (spark.read.option("header", "true")
  .option("mode", "FAILFAST") # finaliza la ejecución si hay algún error
  .option("nullValue", "") # si hay algún valor nulo lo reemplaza por ""
  .schema(schema).csv(csv_file))

In [0]:
csv_df2.show(10, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## SQL

#### Crear una vista temporal sin gestión

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl USING csv OPTIONS (
      path "/databricks-datasets/learning-spark-v2/flights/summary-data/csv/*",
      header "true",
      inferSchema "true",
      mode "FAILFAST"
    )

Utilizamos SQL para visualizar la tabla

In [0]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show(10, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## ORC Data Source

In [0]:
orc_df = spark.read.format("orc").option("path", orc_file).load()

In [0]:
orc_df.show(10, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## SQL

#### Crear una vista temporal sin gestión

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl USING orc OPTIONS (path "/databricks-datasets/learning-spark-v2/flights/summary-data/orc/*")

Utilizamos SQL para visualizar la tabla

In [0]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show(10, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## Avro Data Source

In [0]:
avro_df = spark.read.format("avro").option("path", avro_file).load()

In [0]:
avro_df.show(10, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## SQL

#### Crear una vista temporal sin gestión

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl USING avro OPTIONS (path "/databricks-datasets/learning-spark-v2/flights/summary-data/avro/*")

Utilizamos SQL para visualizar la tabla

In [0]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show(10, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## Image

In [0]:
from pyspark.ml import image

image_dir = "/databricks-datasets/cctvVideos/train_images/"
images_df = spark.read.format("image").load(image_dir)
images_df.printSchema()

images_df.select("image.height", "image.width", "image.nChannels", "image.mode", "label").show(5, truncate=False)

root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)
 |-- label: integer (nullable = true)

+------+-----+---------+----+-----+
|height|width|nChannels|mode|label|
+------+-----+---------+----+-----+
|288   |384  |3        |16  |0    |
|288   |384  |3        |16  |1    |
|288   |384  |3        |16  |0    |
|288   |384  |3        |16  |0    |
|288   |384  |3        |16  |0    |
+------+-----+---------+----+-----+
only showing top 5 rows



## Binary

In [0]:
path = "/databricks-datasets/learning-spark-v2/cctvVideos/train_images/"
binary_files_df = spark.read.format("binaryFile").option("pathGlobFilter", "*.jpg").load(path)

binary_files_df.show(5)

+--------------------+-------------------+------+--------------------+-----+
|                path|   modificationTime|length|             content|label|
+--------------------+-------------------+------+--------------------+-----+
|dbfs:/databricks-...|2020-01-02 20:42:21| 55037|[FF D8 FF E0 00 1...|    0|
|dbfs:/databricks-...|2020-01-02 20:42:31| 54634|[FF D8 FF E0 00 1...|    1|
|dbfs:/databricks-...|2020-01-02 20:42:21| 54624|[FF D8 FF E0 00 1...|    0|
|dbfs:/databricks-...|2020-01-02 20:42:22| 54505|[FF D8 FF E0 00 1...|    0|
|dbfs:/databricks-...|2020-01-02 20:42:22| 54475|[FF D8 FF E0 00 1...|    0|
+--------------------+-------------------+------+--------------------+-----+
only showing top 5 rows



Para ignorar cualquier particion de datos en un directorio, podemos poner la opción `recursiveFileLookup` a `true`

In [0]:
binary_files_df = spark.read.format("binaryFile").option("pathGlobFilter", "*.jpg").option("recursiveFileLookup", "true").load(path)

binary_files_df.show(5)

+--------------------+-------------------+------+--------------------+
|                path|   modificationTime|length|             content|
+--------------------+-------------------+------+--------------------+
|dbfs:/databricks-...|2020-01-02 20:42:21| 55037|[FF D8 FF E0 00 1...|
|dbfs:/databricks-...|2020-01-02 20:42:31| 54634|[FF D8 FF E0 00 1...|
|dbfs:/databricks-...|2020-01-02 20:42:21| 54624|[FF D8 FF E0 00 1...|
|dbfs:/databricks-...|2020-01-02 20:42:22| 54505|[FF D8 FF E0 00 1...|
|dbfs:/databricks-...|2020-01-02 20:42:22| 54475|[FF D8 FF E0 00 1...|
+--------------------+-------------------+------+--------------------+
only showing top 5 rows

