- Session 1: Introduction to Spark and ShARC (HPC)
- Session 2: RDD, DataFrame, ML pipeline, & parallelization
- Session 3: Scalable matrix factorisation for collaborative filtering recommender systems
- Session 4: Scalable K-means clustering
- Session 5: Scalable PCA for dimensionality reduction (and data types in Spark)
The second half will be taught by Mauricio A Álvarez.
The materials are built with references to the following sources:
- The PySpark tutorial by Wenqiang Feng: PDF - Learning Apache Spark with Python Release v1.0, GitHub Project Page
- The official Apach Spark documentations
- The Introduction to Apache Spark course by Prof. Anthony D. Joseph, University of California, Berkeley
Many thanks to
- Mike Croucher, Neil Lawrence, Will Furnass, and Twin Karmakharm for their inputs and inspirations.
- Mauricio A Álvarez for jointly working on this course since we joined Sheffield together.
- Our teaching assistants (demonstrators) and students who have contributed in various ways.