# Spark Configuration

3 locations to configure the system

1. Spark Properties
    - Programatically by creating `org.apache.spark.SparkConf` or through Java System Properties. 
    Properties directly set on `SparkConf` take precedence over Java System Properties.
    - We can set arbitrary (our own) properties using `set(...)`
2. Environment variables
    - Per machine settings
    - Set through the `conf/spark-env.sh` script on each node
3. Logging
    - Configured through log4j2.properties

**Note**
```
Synapse Spark uses YARN in cluster mode. For YARN environment variables need to be set using the 
spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file. 
Environment variables that are set in spark-env.sh will not be reflected in the YARN Application Master process in cluster mode. 
```

## View version information

In [1]:
import com.tccc.dna.synapse.spark.SynapseSpark
SynapseSpark.printVersions

## View App information

In [2]:

SynapseSpark.printAppInfo

## Viewing Spark Properties

- Can be viewed through 
    1. From the **Environment tab** of the submitted application web UI at http://<driver>:4040 or 
    2. Programatically through `SparkConf.getXXX(...)` methods. 
- Note that only values **explicitly** specified through `spark-defaults.conf`, `SparkConf`, or the `command line` will appear.
 For all other configuration properties, you can assume the default value is used.

### Viewing programatically

You can retrieve the properties either through SparkSession or SparkContext. They both contain the same properties. `com.tccc.dna.synapse.spark.SynapseSpark` encapsulates the required logic to view all configured properties. This notebock will illustrate its usage.

In [3]:
{
    import org.apache.spark.SparkConf

    val confThruSession = spark.conf.getAll
    val confThruContext = sc.getConf.getAll

    println("Conf HashCode from SparkSession/RuntimeConfig:"+ confThruSession.hashCode)
    println("Conf HashCode from SparkContext: "+ confThruContext.hashCode)

    println("Conf size from SparkSession/RuntimeConfig:"+ confThruSession.size)
    println("Conf size from SparkContext: "+ confThruContext.size)
}

#### Get properties from SparkSession

In [4]:
SynapseSpark.printAppInfo

#### Get properties from SparkContext

In [5]:
println("App Name (from context): "+ sc.appName)
println("App Id (from context): "+ sc.applicationId)
SynapseSpark.printAppInfo

### Explicitly accessed props

***
```
Note that only values explicitly specified through spark-defaults.conf, SparkConf, or the command line will appear. 
For all other configuration properties, you can assume the default value is used or no value is set or set through 
other configuration locations.
```
***

*spark.sql.sources.default* 

- Default: parquet
- Used when:
    - Reading (DataFrameWriter) or writing (DataFrameReader) datasets
    - Creating external table from a path (in Catalog.createExternalTable)
    - Reading (DataStreamReader) or writing (DataStreamWriter) in Structured Streaming

Reference: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-properties.html

In [6]:
println("--AQE")
println("spark.sql.adaptive.enabled (When true, re-optimizes the query plan in the middle of query execution, based on accurate runtime statistics) = "+spark.conf.get("spark.sql.adaptive.enabled"))
println("spark.sql.adaptive.forceOptimizeSkewedJoin = "+spark.conf.getOption("spark.sql.adaptive.forceOptimizeSkewedJoin"))

println("\n--CSV")
println("spark.sql.csv.filterPushdown.enabled (When true, enable filter pushdown to CSV datasource.) = "+spark.conf.get("spark.sql.csv.filterPushdown.enabled"))

println("\n--JSON")
println("spark.sql.json.filterPushdown.enabled = "+spark.conf.get("spark.sql.json.filterPushdown.enabled"))

println("\n--Parquet")
println("spark.sql.parquet.aggregatePushdown = "+spark.conf.getOption("spark.sql.parquet.aggregatePushdown"))
println("spark.sql.parquet.filterPushdown = "+spark.conf.get("spark.sql.parquet.filterPushdown"))
println("spark.sql.parquet.outputTimestampType = "+spark.conf.get("spark.sql.parquet.outputTimestampType"))
println("spark.sql.parquet.recordLevelFilter.enabled = "+spark.conf.get("spark.sql.parquet.recordLevelFilter.enabled"))

println("\n--SQL DML/DDL")
/*
When true, Spark replaces CHAR type with VARCHAR type in CREATE/REPLACE/ALTER TABLE commands, so that newly created/updated tables will not have CHAR 
type columns/fields. Existing tables with CHAR type columns/fields are not affected by this config.
*/
println("spark.sql.charAsVarchar = "+spark.conf.getOption("spark.sql.charAsVarchar"))
println("spark.sql.groupByAliases (When true, aliases in a select list can be used in group by clauses. When false, an analysis exception is thrown in the case.) = "+spark.conf.get("spark.sql.groupByAliases"))
println("spark.sql.groupByOrdinal = "+spark.conf.get("spark.sql.groupByOrdinal"))

println("\n--Maven")
println("spark.sql.maven.additionalRemoteRepositories = "+spark.conf.get("spark.sql.maven.additionalRemoteRepositories"))

//Thse are not set in Synapse spark
println("\n--Task")
println("spark.task.cpus (The number of CPU cores to schedule (allocate) to a task) = "+spark.conf.getOption("spark.task.cpus"))
println("spark.task.maxFailures (Number of failures of a single task (of a TaskSet) before giving up on the entire TaskSet and then the job) = "+spark.conf.getOption("spark.task.maxFailures"))

println("\n--Memory")
println("spark.memory.fraction (Fraction of JVM heap space used for execution and storage) = "+spark.conf.getOption("spark.memory.fraction"))
println("spark.memory.offHeap.enabled (Controls whether Tungsten memory will be allocated on the JVM heap (false) or off-heap) = "+spark.conf.getOption("spark.memory.offHeap.enabled"))
println("spark.memory.offHeap.size (Maximum memory (in bytes) for off-heap memory allocation) = "+spark.conf.getOption("spark.memory.offHeap.size"))

println("\n--Shuffle")
println("spark.sql.shuffle.partitions = "+spark.conf.get("spark.sql.shuffle.partitions"))
println("spark.shuffle.compress (Controls whether to compress shuffle output when stored) = "+spark.conf.getOption("spark.shuffle.compress"))
println("spark.shuffle.manager = "+spark.conf.getOption("spark.shuffle.manager"))
println("spark.plugins (A comma-separated list of class names implementing org.apache.spark.api.plugin.SparkPlugin to load into a Spark application.) = "+spark.conf.getOption("spark.plugins"))

println("\n--Driver")
println("spark.driver.log.dfsDir = "+spark.conf.getOption("spark.driver.log.dfsDir"))

println("\n--Other")
println("spark.sql.defaultCatalog = "+spark.conf.get("spark.sql.defaultCatalog"))
println("spark.sql.ansi.enabled = "+spark.conf.get("spark.sql.ansi.enabled"))
println("spark.sql.columnNameOfCorruptRecord (The name of internal column for storing raw/un-parsed JSON and CSV records that fail to parse.) = "+spark.conf.get("spark.sql.columnNameOfCorruptRecord"))
//Default number of partitions in resilient distributed datasets (RDDs) returned by transformations like join, reduceByKey, and parallelize when no partition number is set by the user.
println("spark.default.parallelism (Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize) = "+spark.conf.getOption("spark.default.parallelism"))
println("spark.sql.leafNodeDefaultParallelism = "+spark.conf.get("spark.sql.leafNodeDefaultParallelism"))
println("spark.sql.sources.default = "+spark.conf.get("spark.sql.sources.default"))
println("spark.sql.files.maxRecordsPerFile (Maximum number of records to write out to a single file. If this value is zero or negative, there is no limit.) = "+spark.conf.get("spark.sql.files.maxRecordsPerFile"))
println("spark.sql.files.minPartitionNum = "+spark.conf.get("spark.sql.files.minPartitionNum"))

### Master props

In [7]:
SynapseSpark.printSparkConfMasterProps

### Driver props

In [8]:
SynapseSpark.printSparkConfDriverProps

### Executor props

In [9]:
SynapseSpark.printSparkConfExecutorProps

### Application props

In [10]:
SynapseSpark.printSparkConfAppProps

### YARN App props

In [11]:
SynapseSpark.printSparkConfYARNAppProps

### YARN props

In [12]:
SynapseSpark.printSparkConfYARNProps

### Livy props

In [13]:
SynapseSpark.printSparkConfLivyProps

### Spark SQL props

In [14]:
SynapseSpark.printSparkConfSQLProps

### Azure props

In [15]:
SynapseSpark.printSparkConfAzureProps

### Microsoft props

In [16]:
SynapseSpark.printSparkConfMicrosoftProps

### Synapse props

In [17]:
SynapseSpark.printSparkConfSynapseProps

### Spark UI props

In [18]:
SynapseSpark.printSparkConfSparkUIProps

### Spark History Server props

In [19]:
SynapseSpark.printSparkConfHistoryServerProps

### Cluster props

In [20]:
SynapseSpark.printSparkConfClusterProps

### Shuffle props

In [21]:
SynapseSpark.printSparkConfShuffleProps

### Dynamic Allocation props

In [22]:
SynapseSpark.printSparkConfDynamicAllocProps

### Synapse History Server conf

In [23]:
SynapseSpark.printSparkConfSynapseHistoryServerProps

### External Hive Metastore props

In [24]:
SynapseSpark.printSparkConfHiveMetastoreProps

### Event log props

In [25]:
SynapseSpark.printSparkConfEventLogProps

### Data Locality Props

In [26]:
SynapseSpark.printSparkConfLocalityProps()

### Other props

In [27]:
SynapseSpark.printSparkConfOtherProps

### Delta props

In [28]:
SynapseSpark.printSparkConfDeltaProps()

### Databricks props

In [30]:
SynapseSpark.printSparkConfDatabricksProps()

### Unsupported properties

In [31]:
SynapseSpark.getSparkConfProp("spark.databricks.delta.schema.autoMerge.enabled")
SynapseSpark.getSparkConfProp("spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite")
SynapseSpark.getSparkConfProp("spark.databricks.delta.properties.defaults.autoOptimize.autoCompact")

## Additional Reading

- [Spark Configuration documentation](https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties)
- [External Metastore](https://docs.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore#set-up-an-external-metastore-using-the-ui)
- [YARN properties](https://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties)