# Objective: Get used to the jupyter and Spark environment.

> Have ScalaDoc handy. Open this url https://spark.apache.org/docs/latest/api/scala/index.html#package in a new tab.

## Which Spark version are we using?

In [2]:
spark.version

2.4.5

> The Spark Web UI can be accessed from http://localhost:4040. Keep this open in another tab.

## Let us check important spark objects:

1. `SparkSession`: entry point. You can have multiple sessions created for a single Spark Application.
2. `SparkContext`: entry point before Spark 2.0
3. `spark.Catalog`: interface to the Spark's *current* metastore. i.e. data catalog of relational entities like databases, tables, views, table columns & user-defined functions (UDF) 

In [3]:
spark

org.apache.spark.sql.SparkSession@9ea10b2

In [4]:
sc

org.apache.spark.SparkContext@331abc9d

In [5]:
spark.sparkContext

org.apache.spark.SparkContext@331abc9d

In [6]:
spark.catalog

org.apache.spark.sql.internal.CatalogImpl@3346091b

## SparkSession

`SparkSession` can be used to 

1. Create `DataFrame` or `DataSet`
2. Interface with internal metastore, a.k.a data catalog
3. Execute sql queries
4. Access `DataFrameReader` to load datasets

**1. Create DataFrame**

In [7]:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

//Define data as collection
val data=Seq(
    Row(1, "apple"),
    Row(2, "banana"),
    Row(3, "cherry")
    )

//Create a distributed collection i.e. RDD
val rdd = sc.parallelize(data)

//Define schema
val schema = StructType(
    Seq(
          StructField("id", IntegerType, true),
          StructField("fruit", StringType, true)
    )
)

//Create DF
val df = spark.createDataFrame(rdd, schema)

df.show()

+---+------+
| id| fruit|
+---+------+
|  1| apple|
|  2|banana|
|  3|cherry|
+---+------+



data = List([1,apple], [2,banana], [3,cherry])
rdd = ParallelCollectionRDD[0] at parallelize at <console>:40
schema = StructType(StructField(id,IntegerType,true), StructField(fruit,StringType,true))
df = [id: int, fruit: string]


[id: int, fruit: string]

**2. Interface with internal metastore, a.k.a data catalog**

Let us list databases and tables. 

In [8]:
spark.catalog.listDatabases().show(10, truncate=false)

+-------+----------------+-------------------------------------------+
|name   |description     |locationUri                                |
+-------+----------------+-------------------------------------------+
|default|default database|file:/home/jovyan/notebooks/spark-warehouse|
+-------+----------------+-------------------------------------------+



In [9]:
spark.catalog.listTables().show()

+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
+----+--------+-----------+---------+-----------+



Check if a database or table exists

In [19]:
val dbExists = spark.catalog.databaseExists("default")
val tblExists = spark.catalog.tableExists("test_tbl")

+----------+--------+-----------+------------------------------------------------------------+-----------+
|name      |database|description|className                                                   |isTemporary|
+----------+--------+-----------+------------------------------------------------------------+-----------+
|!         |null    |null       |org.apache.spark.sql.catalyst.expressions.Not               |true       |
|%         |null    |null       |org.apache.spark.sql.catalyst.expressions.Remainder         |true       |
|&         |null    |null       |org.apache.spark.sql.catalyst.expressions.BitwiseAnd        |true       |
|*         |null    |null       |org.apache.spark.sql.catalyst.expressions.Multiply          |true       |
|+         |null    |null       |org.apache.spark.sql.catalyst.expressions.Add               |true       |
|-         |null    |null       |org.apache.spark.sql.catalyst.expressions.Subtract          |true       |
|/         |null    |null       |org.

dbExists = true


true

List all the functions in the "default" database

In [21]:
spark.catalog.listFunctions("default").show(20, truncate=false)

+----------+--------+-----------+------------------------------------------------------------+-----------+
|name      |database|description|className                                                   |isTemporary|
+----------+--------+-----------+------------------------------------------------------------+-----------+
|!         |null    |null       |org.apache.spark.sql.catalyst.expressions.Not               |true       |
|%         |null    |null       |org.apache.spark.sql.catalyst.expressions.Remainder         |true       |
|&         |null    |null       |org.apache.spark.sql.catalyst.expressions.BitwiseAnd        |true       |
|*         |null    |null       |org.apache.spark.sql.catalyst.expressions.Multiply          |true       |
|+         |null    |null       |org.apache.spark.sql.catalyst.expressions.Add               |true       |
|-         |null    |null       |org.apache.spark.sql.catalyst.expressions.Subtract          |true       |
|/         |null    |null       |org.

**3. Execute sql queries**

In [1]:
spark.sql("SHOW DATABASES").show()
spark.sql("SHOW TABLES").show()

Name: Error parsing magics!
Message: Magics [sql] do not exist!
StackTrace: 

**4. Access DataFrameReader**

In [25]:
val reader = spark.read

val userDF = reader.option("header", "true").csv("../data/users.csv.gz")
userDF.show(5, truncate=false)

+---+------------------+-------------+-------------+----------+
|id |name              |department_id|date_of_birth|company_id|
+---+------------------+-------------+-------------+----------+
|155|Dulce Rosse       |1            |1991-02-04   |26147     |
|354|Imogene Marchand  |1            |1991-02-04   |146257    |
|499|Cleveland Bonet   |1            |1993-12-04   |218370    |
|505|Sharice Landrith  |1            |1991-02-04   |307860    |
|511|Jacquelynn Gadbois|1            |1993-12-04   |24097     |
+---+------------------+-------------+-------------+----------+
only showing top 5 rows



reader = org.apache.spark.sql.DataFrameReader@65bafc4a
userDF = [id: string, name: string ... 3 more fields]


lastException: Throwable = null


[id: string, name: string ... 3 more fields]

## Session runtime configuration

We can modify (retrieve) `spark.sql.*` configuration parameters at runtime and per session using `spark.conf.set()`.

NOTE: Depending on the deployment mode, some options (most notably spark.*.extraJavaOptions) cannot be set using this method, and can be modified only through `spark-submit` arguments or using configuration files.

In [35]:
val conf = spark.conf

val keyVal = conf.getAll

keyVal.foreach(entry => println(entry._1 +" = "+entry._2))

spark.driver.host = a9c6572bcfd9
spark.driver.port = 45969
spark.repl.class.uri = spark://a9c6572bcfd9:45969/classes
spark.jars = file:/opt/conda/share/jupyter/kernels/apache_toree_scala/lib/toree-assembly-0.3.0-incubating.jar
spark.repl.class.outputDir = /tmp/spark-6fcb36d9-9b79-4778-8d34-4886927c41d1/repl-61d11760-9d1b-45c9-a776-e3e3b1196d38
spark.app.name = Apache Toree
spark.executor.id = driver
spark.driver.extraJavaOptions = -Dlog4j.logLevel=info
spark.submit.deployMode = client
spark.master = local[*]
spark.app.id = local-1590885055845


conf = org.apache.spark.sql.RuntimeConfig@5174a669
keyVal = Map(spark.driver.host -> a9c6572bcfd9, spark.driver.port -> 45969, spark.repl.class.uri -> spark://a9c6572bcfd9:45969/classes, spark.jars -> file:/opt/conda/share/jupyter/kernels/apache_toree_scala/lib/toree-assembly-0.3.0-incubating.jar, spark.repl.class.outputDir -> /tmp/spark-6fcb36d9-9b79-4778-8d34-4886927c41d1/repl-61d11760-9d1b-45c9-a776-e3e3b1196d38, spark.app.name -> Apache Toree, spark.executor.id -> driver, spark.driver.extraJavaOptions -> -Dlog4j.logLevel=info, spark.submit.deployMode -> client, spark.master -> local[*], spark.app.id -> local-1590885055845)


Map(spark.driver.host -> a9c6572bcfd9, spark.driver.port -> 45969, spark.repl.class.uri -> spark://a9c6572bcfd9:45969/classes, spark.jars -> file:/opt/conda/share/jupyter/kernels/apache_toree_scala/lib/toree-assembly-0.3.0-incubating.jar, spark.repl.class.outputDir -> /tmp/spark-6fcb36d9-9b79-4778-8d34-4886927c41d1/repl-61d11760-9d1b-45c9-a776-e3e3b1196d38, spark.app.name -> Apache Toree, spark.executor.id -> driver, spark.driver.extraJavaOptions -> -Dlog4j.logLevel=info, spark.submit.deployMode -> client, spark.master -> local[*], spark.app.id -> local-1590885055845)

## Spark and Metadata Catalogs

One of the needs of structured data analysis is metadata management (location of the data, comments, statistics or schema etc). The metadata can be **temporary metadata** like temp table, registered udfs on SQL context or **permanent metadata**. Spark provides `Catalog` abstraction (since 2.0.0) to interact with the metadata stores like Hive meta store or HCatalog.

Spark, by default, comes with the following three implementations of the `Catalog` abstraction

1. InMemoryCatalog
    - ephemeral, used for learning & testing
2. HiveExternalCatalog
    - Used when you `enableHiveSupport()` on `SparkSession`
    - Internally uses `HiveClient` & `HiveClientImpl` to interact with Hive metastore
    - Wrapper on top of Hadoop package `org.apache.hadoop.hive.ql.metadata`
3. ExternalCatalogWithListener
    - Since 2.4.0
    - Just a wrapper around #1 or #2
    - Added functionality of posting catalog related events (Table created, deleted etc.) to the Spark's *listener bus*
    
A Hive **metastore warehouse** (aka `spark-warehouse`) is the directory where Spark SQL persists **data** in tables whereas a Hive **metastore** (aka `metastore_db`) is a relational database to manage the **metadata** of the persistent relational entities, e.g. databases, tables, columns, partitions.

Hive works 
By default, Spark SQL uses the embedded deployment mode of a Hive metastore with a Apache Derby database.

In [4]:
println(spark.conf.get("spark.sql.warehouse.dir"))
println(spark.conf.get("spark.sql.catalogImplementation"))
println(spark.conf.get(""))
//println(spark.conf.get(""))
//spark.sql.hive.metastore.jars

file:/home/jovyan/notebooks/spark-warehouse/
in-memory


lastException = null


Name: java.util.NoSuchElementException
Message: 
StackTrace:   at org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:2042)
  at org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:2042)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.internal.SQLConf.getConfString(SQLConf.scala:2042)
  at org.apache.spark.sql.RuntimeConfig.get(RuntimeConfig.scala:74)