COM6012 Scalable Machine Learning - University of Sheffield

Spring 2019 by Haiping Lu for the first half (five sessions)

Session 1: Introduction to Spark and ShARC (HPC)
Session 2: RDD, DataFrame, ML pipeline, & parallelization
Session 3: Scalable matrix factorisation for collaborative filtering recommender systems
Session 4: Scalable K-means clustering
Session 5: Scalable PCA for dimensionality reduction (and data types in Spark)

The second half will be taught by Mauricio A Álvarez.

The materials are built with references to the following sources:

Many thanks to

Mike Croucher, Neil Lawrence, Will Furnass, and Twin Karmakharm for their inputs and inspirations.
Mauricio A Álvarez for jointly working on this course since we joined Sheffield together.
Our teaching assistants (demonstrators) and students who have contributed in various ways.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
Code		Code
Data		Data
HPC		HPC
Slides		Slides
.gitattributes		.gitattributes
.gitignore		.gitignore
Lab 1 - Introduction to Spark and HPC.ipynb		Lab 1 - Introduction to Spark and HPC.ipynb
Lab 2 - RDD, DataFrame, ML pipeline, and parallelization.ipynb		Lab 2 - RDD, DataFrame, ML pipeline, and parallelization.ipynb
Lab 3 - Scalable matrix factorisation for collaborative filtering.ipynb		Lab 3 - Scalable matrix factorisation for collaborative filtering.ipynb
Lab 4 - Scalable K-means clustering.ipynb		Lab 4 - Scalable K-means clustering.ipynb
Lab 5 - Scalable PCA for dimensionality reduction.ipynb		Lab 5 - Scalable PCA for dimensionality reduction.ipynb
README.md		README.md