d-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 400px">
</div>

# Partitioning
1. Get partitions and cores
1. Repartition DataFrames
1. Configure default shuffle partitions

##### Methods
- DataFrame (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html" target="_blank">Scala</a>): `repartition`, `coalesce`, `rdd.getNumPartitions`
- SparkConf (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=sparkconf#pyspark.SparkConf" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkConf.html" target="_blank">Scala</a>): `get`, `set`
- SparkSession (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SparkSession.html" target="_blank">Scala</a>): `sparkContext.defaultParallelism`

##### SparkConf parameters
- `spark.sql.shuffle.partitions`, `spark.sql.adaptive.enabled`

In [0]:
%run ./Includes/Classroom-Setup

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Get partitions and cores

Use an `rdd` method to get the number of DataFrame partitions

In [0]:
df = spark.read.parquet(eventsPath)
df.rdd.getNumPartitions()

Access SparkContext through SparkSession to get the number of cores or slots

SparkContext is also provided in Databricks notebooks as the variable `sc`

In [0]:
print(spark.sparkContext.defaultParallelism)
#print(sc.defaultParallelism)

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Repartition DataFrame

#### `repartition`
Returns a new DataFrame that has exactly `n` partitions.

In [0]:
repartitionedDF = df.repartition(8)

In [0]:
repartitionedDF.rdd.getNumPartitions()

#### `coalesce`
Returns a new DataFrame that has exactly `n` partitions, when the fewer partitions are requested

If a larger number of partitions is requested, it will stay at the current number of partitions

In [0]:
coalesceDF = df.coalesce(8)

In [0]:
coalesceDF.rdd.getNumPartitions()

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Configure default shuffle partitions

Use `SparkConf` to access the spark configuration parameter for default shuffle partitions

In [0]:
spark.conf.get("spark.sql.shuffle.partitions")

Configure default shuffle partitions to match the number of cores

In [0]:
spark.conf.set("spark.sql.shuffle.partitions", "8")

###![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Adaptive Query Execution

In Spark 3, <a href="https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution" target="_blank">AQE</a> is now able to <a href="https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html" target="_blank"> dynamically coalesce shuffle partitions</a> at runtime

Spark SQL can use `spark.sql.adaptive.enabled` to control whether AQE is turned on/off (disabled by default)

In [0]:
spark.conf.get("spark.sql.adaptive.enabled")

### Clean up classroom

In [0]:
%run ./Includes/Classroom-Cleanup
