Internals of Speeding Up Pyspark with Arrow
Presentation I (@berenguel) gave at the PyBCN meetup on June 2018, Spark London on September 2018, Spark Barcelona and Spark Summit Europe 2019 to explain how Spark 2.3/2.4 has optimised UDFs for Pandas use as well as how PySpark works. A recording of this talk (the one given in Python Barcelona, in English) is available here, the recording from Spark Summit is available here. You can find the slides here (some images might look slightly blurry). I recommend you check the version with presenter notes which is only available here.
This presentation is formatted in Markdown and prepared to be used with Deckset. The drawings were done on an iPad Pro using Procreate. Here only the final PDF and the source Markdown are available. Sadly the animated gifs are just static images in the PDF.
You can find an exported version using reveal.js of the version given at Spark Summit here. It is not 100% faithful to the PDF/Deckset version but is close enough (and animated gifs play). The export was generated with this and tweaked to add a footer.