# Spark README

This notebook shows how to run Spark in both local and yarn-client modes within TAP.

Several [Spark examples](/tree/examples/spark) are included with TAP.

More examples are available on the Spark website: http://spark.apache.org/examples.html

PySpark API documentation: http://spark.apache.org/docs/latest/api/python/

## Create a SparkContext in local mode

In local mode no cluster resources are used.  It is easy to setup and is good for small scale testing.

In [None]:
import pyspark

# Create a SparkContext in local mode
sc = pyspark.SparkContext("local")

In [None]:
# Test the context is working by creating an RDD and performing a simple operation
rdd = sc.parallelize(range(10))
print rdd.count()

In [None]:
# Find out ore information about your SparkContext
print 'Python Version: ' + sc.pythonVer
print 'Spark Version: ' + sc.version
print 'Spark Master: ' + sc.master
print 'Spark Home: ' + str(sc.sparkHome)
print 'Spark User: ' + str(sc.sparkUser())
print 'Application Name: ' + sc.appName
print 'Application Id: ' + sc.applicationId

In [None]:
# Stop the context when you are done with it. When you stop the SparkContext resources 
# are released and no further operations can be performed within that context.
sc.stop()

## Create a SparkContext in yarn-client mode

In yarn-client mode, a Spark job is launched in the cluster.  This is needed to work with big data.

In [None]:
import pyspark

# create a configuration
conf = pyspark.SparkConf()

# set the master to "yarn-client"
conf.setMaster("yarn-client")

# set other options as desired
conf.set("spark.driver.memory", "512mb")
conf.set("spark.executor.memory", "1g")

# create the context
sc = pyspark.SparkContext(conf=conf)


In [None]:
# Test the context is working by creating an RDD and performing a simple operation
rdd = sc.parallelize(range(10))
print rdd.count()

In [None]:
# Find out ore information about your SparkContext
print 'Python Version: ' + sc.pythonVer
print 'Spark Version: ' + sc.version
print 'Spark Master: ' + sc.master
print 'Spark Home: ' + str(sc.sparkHome)
print 'Spark User: ' + str(sc.sparkUser())
print 'Application Name: ' + sc.appName
print 'Application Id: ' + sc.applicationId

In [None]:
# Stop the context when you are done with it. When you stop the SparkContext resources 
# are released and no further operations can be performed within that context.
sc.stop()

## Using Spark Submit

It is possible to upload jars via Jupyter and use Spark Submit to run them.

In [None]:
# Call spark-submit to run the SparkPi example that ships with Spark.
# We didn't need to upload this jar because it is already loaded on the system.
!spark-submit --class org.apache.spark.examples.SparkPi \
    --master local \
    /usr/local/spark/lib/spark-examples-*.jar \
    10

Alternatively, you can access the [Jupyter dashboard](/tree) and then choose "New -> Terminal" to run spark-submit at the command line.