# PySpark README

This notebook shows how to run Spark in both local and yarn-client modes within TAP.

Several [Spark examples](/tree/examples/spark) come with TAP.

More examples are available on the Spark website: http://spark.apache.org/examples.html

PySpark RDD api documentation: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD

## Create a SparkContext in local mode

In local mode no cluster resources are used.  It is easy to setup and is good for small scale testing.

In [None]:
import pyspark

# Create a SparkContext in local mode
sc = pyspark.SparkContext("local")

In [None]:
# Test the context is working by creating an RDD and performing a simple operation
rdd = sc.parallelize(range(10))
print rdd.count()

In [None]:
# Stop the context when you are done with it. When you stop the SparkContext resources 
# are released and no further operations can be performed within that context
sc.stop()

## Create a SparkContext in yarn-client mode

In yarn-client mode, a Spark job is launched in the cluster.  This is needed to work with big data.

In [None]:
import pyspark

# create a configuration
conf = pyspark.SparkConf()

# set the master to "yarn-client"
conf.setMaster("yarn-client")

# set other options as desired
conf.set("spark.driver.memory", "512mb")
conf.set("spark.executor.memory", "2g")

# create the context
sc = pyspark.SparkContext(conf=conf)


In [None]:
# Test the context is working by creating an RDD and performing a simple operation
rdd = sc.parallelize(range(10))
print rdd.count()

In [None]:
# Stop the context when you are done with it. When you stop the SparkContext resources 
# are released and no further operations can be performed within that context
sc.stop()