Tutorial for Deploying Anaconda Cluster and PySpark on top of Red Hat Storage GlusterFS
Clone or download
Latest commit f8b6e5a Feb 14, 2015
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README.md Fixed Download Link Feb 13, 2015
solution.png Updating Tutorial Feb 13, 2015
spark-wordcount.py Updating Tutorial Feb 13, 2015

README.md

pyspark-tutorial

This tutorial describes how to deploy Anaconda Cluster and PySpark on top of Red Hat Storage GlusterFS.

alt tag

Pre-Requisites

You'll first need a GlusterFS cluster. You can get the commercial version from Red Hat or you can use a vagrant script from Jay Vyas to spin up a simple 2 Node GlusterFS cluster and volume.

Designate a random node in your GlusterFS Cluster to be your Anaconda Cluster Master. From the master, setup passwordless SSH access into the other GlusterFS nodes in the cluster and into itself.

Install Anaconda on your client (laptop)

Install Anaconda 2.7 on a laptop (or client machine) that can reach the cluster. You can download this here and its free.

Install Anaconda Cluster on your client (laptop)

A license token can be obtained for Anaconda Cluster can contacting Continuum Analytics at this link. Once you have the token, Anaconda Cluster can be installed by running the following commands on your client machine.

# TOKEN={ You will need to get this from Continuum Analytics }
# conda install -c https://conda.binstar.org/t/$TOKEN/conda-cluster conda-cluster

Configuring your Cluster in Conda Cluster

  • From the server you designated to be the Anaconda Cluster Master, copy its private SSH Key to your client to ~/.conda/clusters.d/rhs-spark.key and set the permissions on it to chmod 600.

  • From the client, create the Cluster YAML file for your intended Spark Cluster that conda cluster will be managing:

The head IP listed below is the Anaconda Cluster Master we designated. The private_key value is its private key that we copied from the Master onto the client. The compute IP is the other node in the GlusterFS cluster. Add it but ignore the simple_aws provider. Lastly, the name of the file and the name of the cluster (rhs-spark) must match.

# vi ~/.conda/clusters.d/rhs-spark.yaml 
rhs-spark:
    private_key         : ~/.conda/clusters.d/rhs-spark.key
    user                : root
    machines            :
         head      : ['192.168.58.219']
         compute   : ['192.168.58.218']
    provider            : simple_aws
  • From the client, verify conda cluster can find the newly defined cluster:

# conda cluster list; conda cluster manage rhs-spark status

  • From the client, bootstrap the cluster (This install conda cluster on GlusterFS servers)

# conda cluster manage rhs-spark bootstrap --conda --loglevel DEBUG

  • From the client, deploy the PySpark runtime on the GlusterFS Servers.

# conda cluster manage rhs-spark bootstrap --spark --loglevel DEBUG

You have now completed setting up the Cluster. It is ready to ready PySpark Jobs.

Running PySpark Jobs on GlusterFS

  • Copy some data (“Grimm’s Fairy Tales”) into the GlusterFS Volume so that we can analyze with a PySpark Job:
# mkdir -p /mnt/glusterfs/grimm
# cd /mnt/glusterfs/grimm
# wget https://www.gutenberg.org/cache/epub/2591/pg2591.txt --no-check-certificate
  • Write a PySpark Job and copy it up to the Anaconda Master in the cluster. I have included a spark-wordcount.py in this repo that you can use. Note that it expects files to exist within the /mnt/glusterfs/grimm directory in the GlusterFS Volume. You will need to edit the file to change the setMaster value to your Anaconda Master hostname.

  • From the Anaconda Master, run the PySpark Job

# python spark-wordcount.py