Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Spark Workshop

In this workshop you will learn how to:

  • use a notebook environment
  • write simple Apache Spark queries to filter and transform a dataset
  • do very simple outlier detection

The example dataset we will use is the Amazon Electronics reviews dataset:

Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
R. He, J. McAuley
WWW, 2016 []

Installation instructions (~15 minutes)

To make things smoother and avoid installation woes, I created a Docker container that will have all we need for this workshop pre-installed and separate from your system.

Please follow the instructions below to start up the container:

  1. If you don't have a Docker ID account yet, go to Docker Hub and create an account.

  2. Install Docker. On Mac OS X, you can download Docker for Mac for an easy-to-install desktop app.

  3. Open Docker and enter your Docker Hub credentials.

  4. Click on the Docker app icon and select "Preferences...". Under "Advanced", increase the memory available to containers to 8.0 GB.

prefs mem

  1. Clone this repository using git clone:
$ git clone
  1. Open a terminal, navigate to the spark_workshop directory, and run:
$ sh

from a terminal to download a JSON dataset and start the Docker container.

  1. You should be able to navigate to with your browser and see a Jupyter notebook instance. The password is spark.

  2. You can exit the Docker session using Ctrl+C.


If step 6 fails with error unauthorized: incorrect username or password., run

$ docker login

and enter your DockerHub credentials (username and password; username is not your email).


No description, website, or topics provided.






No releases published


No packages published