# PySpark

Apache Spark with Python in Jupyter Notebooks.

I have `spark-shell` running with Scala, and have set environment variables in Bash under which my `jupyter notebook` is running.

    export PYSPARK_PYTHON=python3
    export PYSPARK_DRIVER_PYTHON=ipython
    export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

In [3]:
import sys
sys.path.append('/opt/spark/python')
import pyspark

## Connection to `local`

In [2]:
sc = pyspark.SparkContext(master='local', appName="notebooklearning")

The question is, did we just spawn a new Spark, or connect to what we have running?

Some data stuff. We use RDD which is the Spark thing.

In [3]:
rdd1 = sc.parallelize([('a',7),('a',2),('b',2)])

Now, an operation

In [4]:
rdd1.reduce(lambda a,b: a+b)

('a', 7, 'a', 2, 'b', 2)

Ok that ran, but new SparkUIs were made at http://raspberrypi:4041 and http://raspberrypi:4042.

## Connection to `spark://` in standalone mode

In [61]:
sc.stop()

Here I instead have `/opt/spark/sbin/start-master.sh` and `/opt/spark/sbin/start-master.sh` running. Getting connection refused until I restarted the kernel, after the above. Ok so `sc.stop()` left something in my Python kernel. I was killing stuff from command line though, and had a number of spark Context open so I am not surprised. Aaaanyway http://raspberrypi:8080 and detail UI at http://raspberrypi:4040.

In [None]:
sparkconf = pyspark.SparkConf().setAppName('notebookLearning2').setMaster('spark://raspberrypi:7077')
sc2 = pyspark.SparkContext(conf=sparkconf)

Now, data.

In [None]:
rdd1 = sc2.parallelize([('a',7),('a',2),('b',2)])
rdd2 = sc2.parallelize([("a",["x","y","z"]), ("b",["p", "r"])])
rdd3 = sc2.parallelize(range(100))

And operations.

In [4]:
rdd1.reduce(lambda a,b: a+b)

('a', 7, 'a', 2, 'b', 2)

In [5]:
rdd2.flatMapValues(lambda x: x).collect()

[('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')]

Simple access to (key, value) tuples, which I have in `rdd1`

In [6]:
rdd1.keys().collect()

['a', 'a', 'b']

In [23]:
rdd1.values().sum()

11

In [8]:
rdd3.stats()

(count: 100, mean: 49.5, stdev: 28.8660700477, max: 99.0, min: 0.0)

I can run the Scala `spark-shell`... or could, if I didn't run out of memory. Little RaspberryPi 3 is busy!

To constrain memory usage, I added the following four lines, without knowing what each of them exactly means, to `conf/spark-env.sh`.

    SPARK_EXECUTOR_MEMORY=500m
    SPARK_DRIVER_MEMORY=500m
    SPARK_WORKER_MEMORY=500m
    SPARK_DAEMON_MEMORY=500m

I think I might prefer to do such settings in `conf/spakr-default.conf` instead.

So in Scala, I now do

    val inputfile = sc.textFile("input.txt")
    val counts = inputfile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_);