# Spark Session in Azure Spark Notebook

This notebook explains and demonstrates how **Spark Pool, Spark Instance** and **SparkSession**. 
Spark session gives a unified view of all the contexts and isolation of configuration and environment.

**`Each notebook gets its own separate SparkSession, SparkContext and SQLContext`**

The default spark session is already created for us and made available in notebook with the variable name `spark`.

## Default session

In [1]:
println("Spark Session 1 from variable spark: "+ spark)
val defaultSession = spark

// SparkSession encapsulates spark context, hive context, SQL context
println("SparkContext from spark session 1: "+spark.sparkContext)
println("SQLContext from spark session 1: "+spark.sqlContext)

The default SparkContext is also made available for you with the variable name `sc`

In [2]:
println("SparkContext from variable sc: "+sc)

## Create a NEW session

In [3]:
val session2 = spark.newSession()

In [4]:
println("NEW Spark Session 2: "+ session2)

println("SparkContext from spark session 2: "+session2.sparkContext)
println("SQLContext from spark session 2: "+session2.sqlContext)

// Observe that the SparkContext is still the same as old one
assert(defaultSession.sparkContext == session2.sparkContext)

// SQLContexts might differ
assert(defaultSession.sqlContext != session2.sqlContext)

## Demo - Session Scoping & Isolation

In [5]:
val data = Seq(
                (1, "Aravind"),
                (2, "Triveni"),
                (3, "Esha"),
                (4, "Rishik")
            )

### Session level views

Since this DF was created using defaultSession, creating a view on this DF is only visible in that session's scope

In [6]:
val familyDf1 = defaultSession.createDataFrame(defaultSession.sparkContext.parallelize(data))
familyDf1.show()

familyDf1.createOrReplaceTempView("family_tbl")

println("Visible in 'defaultSession' scope")
defaultSession.sql("show tables like 'family*'").show()

println("NOT visible in 'session2' scope")
session2.sql("show tables like 'family*'").show()
//display(session2.sql("show tables"))

If you want to have the same data available in session 2 then you need to create a new dataframe using session2. Once created you can register with the same name.

In [8]:
val familyDf2 = session2.createDataFrame(session2.sparkContext.parallelize(data))
familyDf2.show()

familyDf2.createOrReplaceTempView("family_tbl")

println("Visible in 'session2' scope")
defaultSession.sql("show tables like 'family*'").show()

println("'defaultSession' also has the same DF that was registered earlier under the same name")
session2.sql("show tables like 'family*'").show()

## Demo - Session level configuration



In [13]:
println("Value of 'spark.sql.crossJoin.enabled' prop on both sessions")
println("defaultSession: "+defaultSession.conf.get("spark.sql.crossJoin.enabled"))
println("session2: "+session2.conf.get("spark.sql.crossJoin.enabled"))

println("\nSet it to false on session 2")
session2.conf.set("spark.sql.crossJoin.enabled", false)

println("\nAfter setting it to false on session 2")
println("defaultSession: "+defaultSession.conf.get("spark.sql.crossJoin.enabled"))
println("session2: "+session2.conf.get("spark.sql.crossJoin.enabled"))

## Demo - Close NEW session

In [None]:
session2.close

## Spark SQL Session Mgmt commands

- RESET
- SET: The SET command sets a property, returns the value of an existing property or returns all 
SQLConf properties with value and meaning.

In [2]:
%%sql
SET spark.sql.crossJoin.enabled;

In [3]:
%%sql
SET spark.sql.crossJoin.enabled=false;
SET spark.sql.crossJoin.enabled

## Which session is currently active?

This method first checks whether there is a valid thread-local SparkSession, and if yes, return that one. 
**It then checks whether there is a valid global default SparkSession, and if yes, return that one.** If no valid global 
default SparkSession exists, the method creates a new SparkSession and assigns the newly created SparkSession as the 
global default. In case an existing SparkSession is returned, the config options specified in this builder will be applied 
to the existing SparkSession.

In [None]:
import org.apache.spark.sql.SparkSession
assert(defaultSession ==SparkSession.builder.getOrCreate())

## What happens when you run the first cell of the notebook?

When you run the first cell of a notebook a new `SparkSession` object is instantiated. That object can be referenced in the notebook with name: `spark`. 
This object is our entry point to the running spark jobs. Behind the scenes, 

1. Notebook submits the code to Livy job server
2. Livy creates a new Livy ID representing the `SparkSession` for the current notebook
3. Livy creates a new **Spark Instance** based on the **Spark Pool** definition provided by the pool the notebook is attached to (Livy uses YARN RM)
4. Livy creates a new *app tag* with the conventions `<notebook-name>_<spark-pool-name>_<unix-epoch-timestamp>` and saves it in Zoo Keeper (this step is for resiliency). 
This app tag is passed to YARN as value to proprty `spark.yarn.tags`
5. Livy submits the job to YARN which creates a *app id* and returns to Livy
6. Livy associates the *app tag* (from step 4) with *app id* (from step 5)
7. Livy returns the result of the job

## Additional Reading

- [Synapse Job Server Architecture](https://github.com/yaravind/technical-reference/blob/d280f647f7bbaafaefbbb9f90626da3eac30b1fc/src/spark/Synapse%20Job%20Service.png)
- [Synapse Spark Livy Endpoint Reference](https://docs.microsoft.com/en-us/rest/api/synapse/data-plane/spark-session)
- [Spark Pool](https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-concepts#spark-pools)
- [Spark Instance](https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-concepts#spark-instances)
- [Best practices for Spark memory management](https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/)
- [Spark SQL Session commands](https://spark.apache.org/docs/latest/sql-ref-syntax-aux-conf-mgmt-reset.html)