The code from this repository presents two secondary sort implementations using pySpark:
- using Spark groupByKey
- using Spark repartitionAndSortWithinPartitions
The applications use the taxi trips reported to the City of Chicago (https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew/data) in order to compute the total downtime of each taxi cab.
Each implementation has two pySpark apps:
- runs a secondary sort and just prints the sessionized data to disk
- runs a secondary sort, computes the total downtime of each taxi cab and prints the result to disk
The code is explained in detail in https://www.qwertee.io/blog/spark-secondary-sort/.
The docker template used for the spark environment can be found at https://github.com/sebrestin/spark-docker-deployment.