# Getting Started with Pilot-Streaming on Stampede

In the first step we need to import all required packages and modules into the Python Path

Pilot-Streaming utilizes [SAGA-Python](http://saga-python.readthedocs.io/en/latest/tutorial/part3.html) to manage the Spark cluster environment. All attributes of the SAGA Job map 1-to-1 to the Pilot Compute Description. 

`resource`: URL of the Local Resource Manager. All SAGA adaptors are supported. Examples:

* `slurm://localhost`: Submit to local SLURM resource manager, e.g. on master node of Wrangler or Stampede
* `slurm+ssh://login1.wrangler.tacc.utexas.edu`: Submit to Wrangler master node SLURM via SSH (e.g. on node running a job)

`type:` The `type` attributes specifies the cluster environment. It can be: `Spark`, `Dask` or `Kafka`.


Note: This is not required anymore on Stampede 2

Depending on the resource there might be other configurations necessary, e.g. to ensure that the correct subnet is used the Spark driver can be configured using various environment variables:   os.environ["SPARK_LOCAL_IP"]='129.114.58.2'

In [None]:
# System LibrariesPilot-Edge-HelloWorld.ipynb
import sys, os
sys.path.append("..")
import pandas as pd

## logging
import logging
logging.basicConfig(level=logging.DEBUG)
logging.getLogger().setLevel(logging.ERROR)
logging.getLogger("py4j").setLevel(logging.ERROR)
 

# Pilot-Streaming
import pilot.streaming
sys.modules['pilot.streaming']

In [None]:
RESOURCE_URL="slurm+ssh://login4.stampede2.tacc.utexas.edu"
WORKING_DIRECTORY=os.path.join(os.environ["HOME"], "work")

# 1. Kafka

In [None]:
pilot_compute_description = {
    "resource":RESOURCE_URL,
    "working_directory": WORKING_DIRECTORY,
    "number_of_nodes": 1,
    "cores_per_node": 48,
    "project": "TG-MCB090174",
    "queue": "normal",
    "config_name": "stampede",
    "walltime": 59,
    "type":"kafka"
}

In [None]:
%%time
kafka_pilot = pilot.streaming.PilotComputeService.create_pilot(pilot_compute_description)
kafka_pilot.wait()

In [None]:
kafka_pilot.get_details()

In [None]:
kafka_pilot.cancel()

# 2. Dask

In [None]:
import distributed

pilot_compute_description = {
    "resource":RESOURCE_URL,
    "working_directory": WORKING_DIRECTORY,
    "number_of_nodes": 1,
    "cores_per_node": 48,
    "dask_cores" : 24,
    "project": "TG-MCB090174",
    "queue": "normal",
    "walltime": 359,
    "type":"dask"
}

In [None]:
%%time
dask_pilot = pilot.streaming.PilotComputeService.create_pilot(pilot_compute_description)
dask_pilot.wait()

In [None]:
dask_pilot.get_details()

In [None]:
import distributed
dask_client  = distributed.Client(dask_pilot.get_details()['master_url'])
dask_client.scheduler_info()

In [None]:
dask_client.gather(dask_client.map(lambda a: a*a, range(10)))

# 3 Spark

In [None]:
### Required Spark configuration that needs to be provided before pyspark is imported and JVM started
#os.environ["SPARK_LOCAL_IP"]='129.114.58.101' #must be done before pyspark is loaded
import os
import pyspark

pilot_compute_description = {
   "resource":RESOURCE_URL,
    "working_directory": WORKING_DIRECTORY,
    "number_of_nodes": 1,
    "cores_per_node": 48,
    "project": "TG-MCB090174",
    "queue": "normal",
    "walltime": 359,
    "type":"spark"
}

Start Spark Cluster and Wait for Startup Completion

In [None]:
%%time
spark_pilot = pilot.streaming.PilotComputeService.create_pilot(pilot_compute_description)
spark_pilot.wait()

In [None]:
spark_pilot.get_details()

In [None]:
#conf=pyspark.SparkConf()
#conf.set("spark.driver.bindAddress", "129.114.58.101")
#sc = pyspark.SparkContext(master="spark://129.114.58.102:7077", appName="dfas")

In [None]:
#os.environ["SPARK_LOCAL_IP"]="129.114.58.101"
sc = spark_pilot.get_context()

In [None]:
rdd = sc.parallelize([1,2,3])
rdd.map(lambda a: a*a).collect()

In [None]:
spark_pilot.cancel()