# Preparation

Install Python: 

Recommendation is to use Anaconda:

    http://continuum.io/downloads


Install Spark:

    $ wget http://apache.mirrors.hoobly.com/spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.6.tgz
    
    $ tar -xzvf spark-1.4.1-bin-hadoop2.6.tgz
    
**(Remove all *.pyc files in SPARK_HOME/python after checkout)**
    
Note the location of your Spark binaries:

Install SAGA-Hadoop:

    $ pip install saga-hadoop

    
Setup SSH port forward:

    ssh -fCN -L 8888:localhost:8888 sp
    ssh -fND 4223 sp 
    
SSH Config


    $HOME/.ssh/config

    Host sp
        Hostname stampede.tacc.utexas.edu
        User <user>

## Start Spark Cluster on Stampede


Commandline:

    saga-hadoop --resource=slurm://localhost --queue=normal --framework spark --walltime=59 --number_cores=16 --project=TG-MCB090174

Code:

In [1]:
import os, sys

%run ../util/init_spark.py
%run ../env.py

from pilot_hadoop import PilotComputeService
from IPython.display import HTML

os.environ["SAGA_VERBOSE"]="100"

Cleanup before Starting Spark Cluster

In [4]:
!echo "Check environment"
!rm work/spark_started
!scancel -u tg804093
!java -version
!echo $JAVA_HOME

Check environment
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
/scratch/projects/xsede/jdk64/jdk1.8.0_25


In [2]:
pilot_compute_description = {
                            "resource_url":"slurm+ssh://stampede",
                            "number_cores": 32,
                            "cores_per_node":16,
                            "working_directory": os.getcwd(),
                            "project": "TG-CCR140028",
                            "queue": "normal", 
                            "walltime": 59,
                            "type":"spark"
                            }
pilot = PilotComputeService.create_pilot(pilot_compute_description);

# print out details of Pilot-Spark
details = pilot.get_details()
HTML("<a target='blank' href='%s'>Spark Web UI</a>"%details["web_ui_url"])

Cleanup old Spark Installation
Starting Spark bootstrap job...

**** Job: [slurm+ssh://stampede]-[5575168] State : Pending
SPARK installation directory: /home1/01131/tg804093/src/supercomputing2015-tutorial/02_hadoop_on_hpc/work/spark-1.4.1-bin-hadoop2.6
(please allow some time until the SPARK cluster is completely initialized)
export PATH=/home1/01131/tg804093/src/supercomputing2015-tutorial/02_hadoop_on_hpc/work/spark-1.4.1-bin-hadoop2.6/bin:$PATH
Spark Web URL: http://c551-404:8080
Open master file: /home1/01131/tg804093/src/supercomputing2015-tutorial/02_hadoop_on_hpc/work/spark-1.4.1-bin-hadoop2.6/conf/masters
Create Spark Context for URL: spark://c551-404:7077


In [6]:
sc = pilot.get_spark_context()
sc.version

u'1.4.1'

In [None]:
rdd = sc.parallelize([1,2,3])

In [None]:
rdd.count()

## Random Graph Generation

    count    24056.000000
    mean         5.971566
    std          1.305737
    min          2.000000
    25%          5.000000
    50%          6.000000
    75%          7.000000
    max         12.000000

In [None]:
import numpy as np
import pandas as pd
number_of_nodes = [10]

for number in number_of_nodes:
    degree_vector = np.random.normal(series.mean(), series.std(), number)


In [9]:
from pyspark.mllib.random import RandomRDDs
r = RandomRDDs.normalRDD(sc, 10).map(lambda v: int(round(5.97 + 1.3057 * v)))

In [10]:
r.collect()

[8, 4, 7, 5, 8, 6, 8, 7, 6, 5]

In [14]:
r.map(lambda v: [(i,0) for i in range(0, v)]).map(lambda v: v).collect()

[[(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0), (7, 0)],
 [(0, 0), (1, 0), (2, 0), (3, 0)],
 [(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0)],
 [(0, 0), (1, 0), (2, 0), (3, 0), (4, 0)],
 [(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0), (7, 0)],
 [(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0)],
 [(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0), (7, 0)],
 [(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0)],
 [(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0)],
 [(0, 0), (1, 0), (2, 0), (3, 0), (4, 0)]]

In [None]:
!ls work

In [None]:
os.environ["SPARK_HOME"]

# FAQ

1. OSError: [Errno 2] No such file or directory

Very likely the Java Environment for initializing the Spark client is not correctly configured. On Stampede execute:

    module load jdk64

In [None]:
!pyspark

# Scratchpad

In [None]:
from pyspark import SparkContext, SparkConf, Accumulator, AccumulatorParam
spark_context = SparkContext("spark://129.114.66.28:7077",
                             "Pilot-Spark",
                               sparkHome="/home1/01131/tg804093/src/supercomputing2015-tutorial/01_hadoop_on_hpc/work/spark-1.4.1-bin-hadoop2.6")