Workshop for PyDay BCN 2019
These are the slides and notebook for a workshop The magic of PySpark for PyDay Barcelona 2019.
This presentation is formatted in Markdown and prepared to be used with Deckset. The drawings were done on an iPad Pro using Procreate. Here only the final PDF and the source Markdown are available. Sadly the animated gifs are just static images in the PDF.
You can run the notebook in Binder
Note: The Arrow optimisation does not work in Binder. I'll try to fix it, but it won't be ready for the workshop. Check the output notebook to see the impact of these.
- Thanks FiveThirtyEight for making their datasets available.
- The Binder start scripts are based on Hyukjin Kwon's for his Spark Summit Europe 2019 talk
- Some of the images of the slides and explanations come from previous talks, Internals of Speeding Up Pyspark with Arrow (Spark Summit Europe 2019) and Welcome to Apache Spark (SoCraTesUK 2018 with Carlos Peña)
- PyBCN for the event and Affectv for supporting it
Details on running the notebook
To take full advantage of the workshop without using Binder (locally) you'll need
- PySpark installed (anything more recent than 2.3 should be fine)
- Jupyter installed
- Pandas and Arrow installed
- All able to talk to each other
- One or more datasets
The TL;DR if you don't want to use Docker should just be:
pip install pyarrow pandas pyspark numpy jupyter
You can install
pip install pyspark, installing it in the same
environment you have Jupyter installed should make them talk to each other just
fine. You should also run
pip install pyarrow, although if this one fails for
some reason it's not a big problem. To make analysis more entertaining, also run
pip install pandas, again, all in the same environment. You can also run these
in conda, with
conda install -c conda-forge pyspark although it might be more
convenient to use
pyspark can get easily confused with many python
If you are familiar enough with Docker, I recommend using a Docker container instead.
Run this before the workshop:
docker pull rberenguel/pyspark_workshop
During the workshop (or before) you can use this docker container with
docker run --name pyspark_workshop -d -p 8888:8888 -p 4040:4040 -p 4041:4041 -v "$PWD":/home/jovyan rberenguel/pyspark_workshop
in the folder you want to create your Jupyter notebook. To open it in your browser,
docker logs pyspark_workshop
and open the URL provided in the logs (should look like
This container installs Arrow on top of the usual
jupyter/pyspark, to allow
for some additional optimisations in Spark.