## Creación de Bases de datos, Tablas y View's

Creamos una base de datos. Y le indicamos a spark que use dicho DB. Señalar que toda tabla creada a partir de ahora se guardará bajo la la DB creada.

In [16]:
spark.sql("""CREATE DATABASE learn_spark_db""")
spark.sql("""USE learn_spark_db""")

res14: org.apache.spark.sql.DataFrame = []


#### Creamos una managed table o tabla administrada

Señalar que en este caso Spark administra tanto los metadatos como los datos.

In [29]:
//spark.sql("CREATE TABLE managed_us_delay_flights_tbl (date STRING, delay INT, distance INT, origin STRING, destination STRING)")

In [17]:
val schema = "date STRING, delay INT, distance INT, origin STRING, destination STRING"
val flights_df = spark.read.schema(schema).csv("departuredelays.csv")

schema: String = date STRING, delay INT, distance INT, origin STRING, destination STRING
flights_df: org.apache.spark.sql.DataFrame = [date: string, delay: int ... 3 more fields]


In [18]:
flights_df.write.saveAsTable("managed_us_delay_fights_tble")

In [19]:
spark.sql(""" SELECT * FROM managed_us_delay_fights_tble ORDER BY DATE asc""").show()

+--------+-----+--------+------+-----------+
|    date|delay|distance|origin|destination|
+--------+-----+--------+------+-----------+
|01010005|   -8|    2024|   LAX|        PBI|
|01010010|   -6|    1980|   SEA|        CLT|
|01010020|   -2|    1995|   SFO|        CLT|
|01010020|    0|    1273|   SFO|        DFW|
|01010023|   14|    1421|   SFO|        IAH|
|01010025|   -3|    1452|   PHX|        DTW|
|01010025|   33|    1198|   LAX|        IAH|
|01010029|   49|    1061|   LAS|        IAH|
|01010030|   -7|    2191|   SFO|        PHL|
|01010030|   -2|    1983|   PDX|        CLT|
|01010030|   -8|    1518|   LAS|        ATL|
|01010035|   -1|    1259|   ANC|        SEA|
|01010035|   -5|    1846|   LAX|        CLT|
|01010040|   -6|    1382|   SLC|        ATL|
|01010043|   18|    1413|   DEN|        JFK|
|01010045|  -11|    1891|   LAS|        PHL|
|01010050|   -6|    1340|   ANC|        PDX|
|01010053|   14|    1259|   ANC|        SEA|
|01010055|   -2|    2087|   LAX|        PHL|
|01010059|

#### Vayamos ahora con una unmanaged table o tabla no administrada

En lenguaje SQL

In [64]:
spark.sql("""CREATE TABLE unmanaged_us_delay_flights_tbl(date STRING, delay INT, distance INT, origin STRING, destination STRING)
USING csv OPTIONS (PATH "departuredelays.csv") """)

res61: org.apache.spark.sql.DataFrame = []


In [21]:
spark.sql("SELECT date FROM unmanaged_us_delay_flights_tbl").show()

+----+
|date|
+----+
+----+



Usando la API de DataFrame

In [30]:
flights_df.write.option("path", "spark-warehouse/unmanaged_us_flights_delay").saveAsTable("unmanaged_us_delay_flights_tbl2")

#### Creación de una View

Lo que en SQL sería "CREATE OR REPLACE GLOBAL TEMP VIEW us_origin_airport_SFO_global_tmp_view AS SELECT (...)". 
Aquí se hace de la siguiente manera.

In [55]:
spark.sql("""SELECT date, delay, origin, destination FROM managed_us_delay_fights_tble WHERE origin == "SFO" """).createOrReplaceGlobalTempView("us_origin_airport_SFO_global_tmp_view")
spark.sql("""SELECT date, delay, origin, destination FROM managed_us_delay_fights_tble WHERE origin == "JFK" """).createOrReplaceTempView("us_origin_airport_JFK_tmp_view")

En principio la única diferencia entre una view global y otra que no lo es, es que la global se puede consultar desde
cualquier sparksession dentro del cluster. Mientras que la no global sólo en la sparkSession actual. Para acceder a la view global basta añadir global_temp.nombreDeView

In [35]:
spark.sql("""SELECT * FROM global_temp.us_origin_airport_SFO_global_tmp_view""").show()
spark.sql("""SELECT * FROM us_origin_airport_JFK_tmp_view ORDER BY date DESC""").show()
spark.read.table("us_origin_airport_JFK_tmp_view").show()

+--------+-----+------+-----------+
|    date|delay|origin|destination|
+--------+-----+------+-----------+
|01011250|   55|   SFO|        JFK|
|01012230|    0|   SFO|        JFK|
|01010705|   -7|   SFO|        JFK|
|01010620|   -3|   SFO|        MIA|
|01010915|   -3|   SFO|        LAX|
|01011005|   -8|   SFO|        DFW|
|01011800|    0|   SFO|        ORD|
|01011740|   -7|   SFO|        LAX|
|01012015|   -7|   SFO|        LAX|
|01012110|   -1|   SFO|        MIA|
|01011610|  134|   SFO|        DFW|
|01011240|   -6|   SFO|        MIA|
|01010755|   -3|   SFO|        DFW|
|01010020|    0|   SFO|        DFW|
|01010705|   -6|   SFO|        LAX|
|01010925|   -3|   SFO|        ORD|
|01010555|   -6|   SFO|        ORD|
|01011105|   -8|   SFO|        DFW|
|01012330|   32|   SFO|        ORD|
|01011330|    3|   SFO|        DFW|
+--------+-----+------+-----------+
only showing top 20 rows

+--------+-----+------+-----------+
|    date|delay|origin|destination|
+--------+-----+------+-----------+
|0

#### Catalog: metastore de Spark SQL

In [45]:
spark.catalog.listDatabases().show(truncate=false)

+--------------+----------------+--------------------------------------------------------+
|name          |description     |locationUri                                             |
+--------------+----------------+--------------------------------------------------------+
|default       |default database|file:/home/jovyan/work/spark-warehouse                  |
|learn_spark_db|                |file:/home/jovyan/work/spark-warehouse/learn_spark_db.db|
+--------------+----------------+--------------------------------------------------------+



In [56]:
spark.catalog.listTables().show(truncate=false)
spark.sql("show tables").show(truncate=false)

+-------------------------------+--------------+-----------+---------+-----------+
|name                           |database      |description|tableType|isTemporary|
+-------------------------------+--------------+-----------+---------+-----------+
|managed_us_delay_fights_tble   |learn_spark_db|null       |MANAGED  |false      |
|unmanaged_us_delay_flights_tbl |learn_spark_db|null       |EXTERNAL |false      |
|unmanaged_us_delay_flights_tbl2|learn_spark_db|null       |EXTERNAL |false      |
|unmanaged_us_delay_flights_tbl3|learn_spark_db|null       |EXTERNAL |false      |
|unmanaged_us_delay_flights_tble|learn_spark_db|null       |EXTERNAL |false      |
|us_origin_airport_jfk_tmp_view |null          |null       |TEMPORARY|true       |
+-------------------------------+--------------+-----------+---------+-----------+

+--------------+-------------------------------+-----------+
|namespace     |tableName                      |isTemporary|
+--------------+-------------------------------

In [47]:
spark.catalog.listColumns("managed_us_delay_fights_tble").show(truncate=false)

+-----------+-----------+--------+--------+-----------+--------+
|name       |description|dataType|nullable|isPartition|isBucket|
+-----------+-----------+--------+--------+-----------+--------+
|date       |null       |string  |true    |false      |false   |
|delay      |null       |int     |true    |false      |false   |
|distance   |null       |int     |true    |false      |false   |
|origin     |null       |string  |true    |false      |false   |
|destination|null       |string  |true    |false      |false   |
+-----------+-----------+--------+--------+-----------+--------+



#### Mencionamos el almacenamiento de las tablas en cache

Simplemente mencionar cómo se guarda. Las estrategias de hacerlo se discutirán en el capítulo 12

In [49]:
spark.sql("""CACHE TABLE managed_us_delay_fights_tble""")
spark.sql("""UNCACHE TABLE managed_us_delay_fights_tble""")

res46: org.apache.spark.sql.DataFrame = []


Mencionar que entre CACHE y TABLE puede ir LAZY lo cual hace que la tabla se cargue en memoria sólo cuando se ejecute la acción.

In [62]:
spark.sql("""DROP TABLE unmanaged_us_delay_flights_tbl3""")
spark.sql("""DROP TABLE unmanaged_us_delay_flights_tbl2""")
spark.sql("""DROP TABLE unmanaged_us_delay_flights_tbl""")
spark.sql("""DROP TABLE unmanaged_us_delay_flights_tble""")

res59: org.apache.spark.sql.DataFrame = []


In [67]:
spark.sql("""show tables""").show(truncate=false)

+--------------+------------------------------+-----------+
|namespace     |tableName                     |isTemporary|
+--------------+------------------------------+-----------+
|learn_spark_db|managed_us_delay_fights_tble  |false      |
|learn_spark_db|unmanaged_us_delay_flights_tbl|false      |
|              |us_origin_airport_jfk_tmp_view|true       |
+--------------+------------------------------+-----------+



#### .table() Util para trasladar a un Dataframe una tabla de la db

In [68]:
spark.table("managed_us_delay_fights_tble").show()

+--------+-----+--------+------+-----------+
|    date|delay|distance|origin|destination|
+--------+-----+--------+------+-----------+
|01010630|  -10|     928|   RSW|        EWR|
|01021029|   87|     974|   RSW|        ORD|
|01021346|    0|     928|   RSW|        EWR|
|01021044|   18|     928|   RSW|        EWR|
|01021730|   29|     748|   RSW|        IAH|
|01020535|  605|     974|   RSW|        ORD|
|01021820|   71|     974|   RSW|        ORD|
|01021743|    0|     928|   RSW|        EWR|
|01022017|    0|     928|   RSW|        EWR|
|01020600|   -2|     748|   RSW|        IAH|
|01021214|   29|     891|   RSW|        CLE|
|01020630|   -5|     928|   RSW|        EWR|
|01031029|   13|     974|   RSW|        ORD|
|01031346|  279|     928|   RSW|        EWR|
|01031740|   29|     748|   RSW|        IAH|
|01030535|    0|     974|   RSW|        ORD|
|01031808|   -3|     974|   RSW|        ORD|
|01031516|   -2|    1396|   RSW|        DEN|
|01032017|   14|     928|   RSW|        EWR|
|01031214|