For this notebook, it might be preferable to increase the number of executors for the session. With the `livedemo` pool, you can allocate up to 9 executors.

###### Load the NYC Safety Dataset from Azure Open Datasets, which contains roughly 10 million rows.

In [1]:
# This is a package in preview.
from azureml.opendatasets import NycSafety

from datetime import datetime
from dateutil import parser

start_date = parser.parse("2015-01-01")
end_date = parser.parse("2019-01-01")

safety = NycSafety(start_date=start_date, end_date=end_date)
safety = safety.to_spark_dataframe()

StatementMeta(livedemo, 11, 1, Finished, Available)



In [2]:
print(safety.count())

StatementMeta(livedemo, 11, 2, Finished, Available)

9931558

Determine the number of partitions and view how the DataFrame's contents are distributed among the partitions.

In [3]:
print(safety.rdd.getNumPartitions())

StatementMeta(livedemo, 11, 3, Finished, Available)

33

In [4]:
print('Data Distribution: ' + str(safety.rdd.glom().map(len).collect()))

StatementMeta(livedemo, 11, 4, Finished, Available)

Data Distribution: [297665, 300079, 297454, 300552, 0, 1240727, 0, 0, 1241253, 0, 0, 1242133, 0, 0, 1241342, 0, 0, 1242400, 0, 0, 1242229, 0, 0, 1241482, 0, 0, 0, 0, 0, 0, 0, 0, 44242]

The following code cells will repartition the `safety` DataFrame and save the partitions as Parquet files. Please edit the `PARTITIONS_EXPERIMENT_ROOT` before running the cells below. Specify your ADLS Gen2 account and the root container name.

In [5]:
PARTITIONS_EXPERIMENT_ROOT = 'abfss://users@saisynapsebloglake.dfs.core.windows.net/SafetyDataPartitions/'

StatementMeta(livedemo, 11, 5, Finished, Available)



In [6]:
safety_1 = safety.repartition(1)
safety_1.write.mode('overwrite').parquet(PARTITIONS_EXPERIMENT_ROOT + '1')

StatementMeta(livedemo, 11, 6, Finished, Available)



In [7]:
safety_10 = safety.repartition(10)
safety_10.write.mode('overwrite').parquet(PARTITIONS_EXPERIMENT_ROOT + '10')

StatementMeta(livedemo, 11, 7, Finished, Available)



In [8]:
safety_100 = safety.repartition(100)
safety_100.write.mode('overwrite').parquet(PARTITIONS_EXPERIMENT_ROOT + '100')

StatementMeta(livedemo, 11, 8, Finished, Available)



In [1]:
safety_1k = safety.repartition(1000)
safety_1k.write.mode('overwrite').parquet(PARTITIONS_EXPERIMENT_ROOT + '1k')

StatementMeta(livedemo, 11, 9, Finished, Available)



In [2]:
safety_10k = safety.repartition(10000)
safety_10k.write.mode('overwrite').parquet(PARTITIONS_EXPERIMENT_ROOT + '10k')

StatementMeta(livedemo, 11, 10, Finished, Available)



In [3]:
safety_100k = safety.repartition(100000)
safety_100k.write.mode('overwrite').parquet(PARTITIONS_EXPERIMENT_ROOT + '100k')

StatementMeta(livedemo, 11, 11, Submitted, Running)



Here is a summary of what you should see from the running the previous cells. These results assume that your notebook is allocated nine executors with eight cores each.
- When dealing with 1,000 or less partitions, the cells should execute within one minute (results within 30 s are likely)
- The 10,000 partitions cell may require slightly more time, above one minute
- The 100,000 partitions cell needs 6 minutes to execute, a considerable performance decrease

For the sake of brevity, we did not test this on Small or Large executors. However, the negative performance impact of writing a very large number of partitions is evident among all executor sizes.

Now, test read performance by executing the cells below.

In [4]:
safety_read_1 = spark.read.load(PARTITIONS_EXPERIMENT_ROOT + '1', format='parquet')
print(safety_read_1.count())

StatementMeta(livedemo, 11, 12, Finished, Available)

9931558

In [5]:
safety_read_10 = spark.read.load(PARTITIONS_EXPERIMENT_ROOT + '10', format='parquet')
print(safety_read_10.count())

StatementMeta(livedemo, 11, 13, Finished, Available)

9931558

In [6]:
safety_read_100 = spark.read.load(PARTITIONS_EXPERIMENT_ROOT + '100', format='parquet')
print(safety_read_100.count())

StatementMeta(livedemo, 11, 14, Finished, Available)

9931558

In [7]:
safety_read_1k = spark.read.load(PARTITIONS_EXPERIMENT_ROOT + '1k', format='parquet')
print(safety_read_1k.count())

StatementMeta(livedemo, 11, 15, Finished, Available)

9931558

In [1]:
safety_read_10k = spark.read.load(PARTITIONS_EXPERIMENT_ROOT + '10k', format='parquet')
print(safety_read_10k.count())

StatementMeta(livedemo, 11, 16, Finished, Available)

9931558

In [2]:
safety_read_100k = spark.read.load(PARTITIONS_EXPERIMENT_ROOT + '100k', format='parquet')
print(safety_read_100k.count())

StatementMeta(livedemo, 11, 17, Finished, Available)

9931558

Again, with nine executors and eight cores per executor, expect the following read results:
- For 1 to 100 partitions, the time is negligible, at roughly 1 second
- For 1,000 partitions, the read time increases to 3 seconds
- For 10,000 partitions, the read time increases to 14 seconds
- For 100,000 partitions, the read time increases to about 2 minutes

Lastly, let's consider the effects of column partitions. In this case, in the hierarchical structure of Azure Data Lake Storage Gen2, each individual unique combination of `dataSubtype`, `category`, `subcategory`, and `status` is represented as a Parquet file.

With nine executors and eight cores per executor, the command will take roughly 8 minutes. It is the least performant operation, though removing periods from the selected column names may also play a role in the poor performance (recall Spark's lazy execution model).

In [1]:
display(safety.limit(10))

StatementMeta(livedemo, 11, 18, Finished, Available)

SynapseWidget(Synapse.DataFrame, 2cfe28e5-bcca-425d-99fc-1462a40c6fe1)

To avoid any problems when writing to ADLS Gen2, remove periods. To save time, we are replacing them with empty characters, and only performing this operation on the columns in the `partitionBy()` clause.

In [3]:
from pyspark.sql.functions import translate

safety = safety.withColumn('dataSubtype', translate('dataSubtype', '.', '')) \
                .withColumn('category', translate('category', '.', '')) \
                .withColumn('subcategory', translate('subcategory', '.', '')) \
                .withColumn('status', translate('status', '.', ''))

StatementMeta(livedemo, 11, 20, Finished, Available)



In [4]:
safety.write.mode('overwrite').partitionBy('dataSubtype', 'category', 'subcategory', 'status').parquet(PARTITIONS_EXPERIMENT_ROOT + '4Parts')

StatementMeta(livedemo, 11, 21, Finished, Available)



Read performance is not very poor. You may see different performance penalities with other datasets. The goal of this notebook is simply to introduce you to the performance considerations Apache Spark developers must keep in mind and provides some guidance for you to benchmark your own data engineering workflows.

In [8]:
safety_4parts = spark.read.load(PARTITIONS_EXPERIMENT_ROOT + '4Parts')
print(safety_4parts.rdd.getNumPartitions())

StatementMeta(livedemo, 11, 25, Finished, Available)

852