# Spark README

Apache Spark&trade; is a general engine for cluster scale computing. It provides API's for multiple languages including Python, Scala, and SQL.

This notebook shows how to run Spark in both local and yarn-client modes within TAP, as well as using Spark Submit.

Several [Spark examples](/tree/examples/spark) are included with TAP and [others](http://spark.apache.org/examples.html) are available on the [Spark website](http://spark.apache.org/)

See the [PySpark API documentation](http://spark.apache.org/docs/latest/api/python/) for more information on the API's below.

## Supported Modes

Currently the YARN scheduler is supported on TAP and Spark jobs can be ran in three different modes:

Mode | Good for Big Data | Supports Interactive Sessions | Supports Batch Jobs | Runtime | Use With | Best For
---------- | --- | -- | --- | --- | --------------- | ----------------- | ------------------------------
**Local mode** | No | Yes | Yes | Both driver and workers run locally | pyspark, spark-shell, spark-submit | Fast small scale testing in an interactive shell or batch.  Best mode to start with if you are new to Spark.
**Yarn Client** | Yes | Yes | Yes | Driver runs locally and workers run in cluster | pyspark, spark-shell, spark-submit | Big data in an interactive shell.
**Yarn Cluster** | Yes | No | Yes | Both driver and workers run in cluster | spark-submit | Big data batch jobs.

More information is avaialable in the [Spark Documentation](http://spark.apache.org/docs/latest/)

## Create a SparkContext in local mode

In local mode no cluster resources are used.  It is easy to setup and is good for small scale testing.

In [None]:
import pyspark

# Create a SparkContext in local mode
sc = pyspark.SparkContext("local")

In [None]:
# Test the context is working by creating an RDD and performing a simple operation
rdd = sc.parallelize(range(10))
print rdd.count()

In [None]:
# Find out ore information about your SparkContext
print 'Python Version: ' + sc.pythonVer
print 'Spark Version: ' + sc.version
print 'Spark Master: ' + sc.master
print 'Spark Home: ' + str(sc.sparkHome)
print 'Spark User: ' + str(sc.sparkUser())
print 'Application Name: ' + sc.appName
print 'Application Id: ' + sc.applicationId

In [None]:
# Stop the context when you are done with it. When you stop the SparkContext resources 
# are released and no further operations can be performed within that context.
sc.stop()

In [None]:
# Please restart the Kernel to switch to yarn-client mode
# This is only needed if you already ran with local mode in same session
# The Kernel can be restarted via the menus above or with the following code:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Create a SparkContext in yarn-client mode

In yarn-client mode, a Spark job is launched in the cluster.  This is needed to work with big data.

In [None]:
import pyspark

# create a configuration
conf = pyspark.SparkConf()

# set the master to "yarn-client"
conf.setMaster("yarn-client")

# set other options as desired
conf.set("spark.yarn.am.memory", "512mb")
conf.set("spark.executor.memory", "1g")

# create the context
sc = pyspark.SparkContext(conf=conf)

In [None]:
# Test the context is working by creating an RDD and performing a simple operation
rdd = sc.parallelize(range(10))
print rdd.count()

In [None]:
# Find out ore information about your SparkContext
print 'Python Version: ' + sc.pythonVer
print 'Spark Version: ' + sc.version
print 'Spark Master: ' + sc.master
print 'Spark Home: ' + str(sc.sparkHome)
print 'Spark User: ' + str(sc.sparkUser())
print 'Application Name: ' + sc.appName
print 'Application Id: ' + sc.applicationId

In [None]:
# Stop the context when you are done with it. When you stop the SparkContext resources 
# are released and no further operations can be performed within that context.
sc.stop()

## Using Spark Submit

It is possible to upload jars via Jupyter and use Spark Submit to run them.  Jars can be uploaded by accessing the [Jupyter dashboard](/tree) and clicking the "Upload" button

In [None]:
# Call spark-submit to run the SparkPi example that ships with Spark.
# We didn't need to upload this jar because it is already loaded on the system.
!spark-submit --class org.apache.spark.examples.SparkPi \
    --master local \
    /usr/local/spark/lib/spark-examples-*.jar \
    10

Alternatively, you can access the [Jupyter dashboard](/tree) and then choose "New -> Terminal" to run spark-submit at the command line.

## Using the Scala spark-shell

Access the [Jupyter dashboard](/tree) and then choose "New -> Terminal" to open a terminal Window.

In the terminal window type:

```bash
    spark-shell --master local
```

Wait for the prompt and then type a simple Spark program to verify it is working:

```scala
   // create an RDD and perform count
   val rdd = sc.parallelize(1 to 10)
   rdd.count()
   
   // exit when you are done
   exit()
```

## Viewing/Modifying Spark Configuration

Spark configuration can be modified using SparkConf, as in the example above.

Additionally, default configuration can be viewed and modified in a terminal (access the [Jupyter dashboard](/tree) and then choose "New -> Terminal")

Default Spark configuration is stored in /etc/spark/conf, important files include:

- spark-defaults.conf - default properties
- log4j.properties - logging configuration

Additional Hadoop settings are in /etc/hadoop/conf, important files include:

- core-site.xml - Core Hadoop Configuration
- hdfs-site.xml - HDFS Configuration
- yarn-site.xml - YARN Configuration

These settings are automatically downloaded from Cloudera Manager when provisioning a new Jupyter instance.

View the [Spark Configuration](http://spark.apache.org/docs/latest/configuration.html) documentation for more information.