# Spark Job Example
* Requires you to have a spark master running
    * Install Spark locally on your node (https://spark.apache.org/docs/latest/spark-standalone.html)
    * Start up a spark master instance on your node: `$ $(SPARK_HOME)/sbin/start-master.sh`
    * Start up at least one slave: `$ $(SPARK_HOME)/sbin/start-worker.sh spark://localhost:7077`
    * Verify that your master and one slave is up at http://localhost:8088
    * NOTE: If you are using this with a Spark standalone cluster you must ensure that the installed version (including minor version) matches the PySpark version or you may experience odd errors.
    * NOTE: You may run into issues with ports if you're on a corp network or VPN.  See https://stackoverflow.com/questions/52133731/how-to-solve-cant-assign-requested-address-service-sparkdriver-failed-after
* Requires you to have aws configured to allow pushing files to S3
    * Install the awscli: `$ pip install awscli` 
    * Configure your credentials: `$ aws configure`

## Illustrates how to:
* Submit a Spark job from a task
* Have the Spark job read from a bundle with s3 paths
* Have the Spark job write to s3 paths contained in the output bundle


In [1]:
%load_ext autoreload
%autoreload 1

In [2]:
import disdat.api as api
from disdatluigi.api import apply
from disdat.api import Bundle
import pandas as pd
import pickle
import time
import luigi
from pyspark.sql import SparkSession
from pyspark import SparkConf
import os

%aimport pipelines.spark_tasks

# Make a bundle with s3 paths
NOTE: Requires a remote context to push to 

In [3]:
data_context = 'example-context'
remote_context_url = 's3://disdat-cdo-prd/'  # <------ Replace with the location of your Disdat contexts on S3
api.context(data_context)
api.remote(data_context, data_context, remote_context_url)

with Bundle(data_context, name="s3_files") as b:
    f1 = b.get_file("file_1.txt")
    with open(f1, mode='w') as f:
        f.write("This is our first file!")
    b.add_data(f1)

# Push and remove the local file
b.commit().push(delocalize=True)

Context already bound to remote at s3://disdat-cdo-prd/
Pushed committed bundle None uuid 2697d5ad-3100-409a-894c-1517faa075a7 to remote s3://disdat-cdo-prd/context


<disdat.api.Bundle at 0x7f88c0664e50>

In [8]:
spark_master = 'spark://localhost:7077'   # Fill in your spark URL (available at web page localhost:8080)
app_name = "testapp"

In [None]:
apply("example-context",
          pipelines.spark_tasks.RunSparkJob,
         params={'spark_master': spark_master,
                'app_name': app_name},
         incremental_push=True,
         force=True)

DEBUG: Checking if RunSparkJob(spark_master=spark://localhost:7077, app_name=testapp, input_bundle_name=s3_files) is complete
DEBUG: Checking if ExternalDepTask(uuid=2697d5ad-3100-409a-894c-1517faa075a7, processing_name=_d41d8cd98f_d41d8cd98f) is complete
INFO: Informed scheduler that task   RunSparkJob_testapp_s3_files_spark___localhos_5ac5543660   has status   PENDING
INFO: Informed scheduler that task   ExternalDepTask__d41d8cd98f_d41d_2697d5ad_3100_40_57dd6189d4   has status   DONE
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 10905] Worker Worker(salt=542291413, workers=1, host=intuitdepe1ea6, username=kyocum, pid=10905) running   RunSparkJob(spark_master=spark://localhost:7077, app_name=testapp, input_bundle_name=s3_files)
22/01/14 18:49:21 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master localhost:7077
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	