# Spark
![Spark](https://spark.apache.org/images/spark-logo-trademark.png)

- https://spark.apache.org/

## Setup

- version 3.0.1 (Pre-built for Apache Hadoop 3.2 and later)

In [None]:
%%bash

# Download package
cd /opt/pkgs
wget -q -c https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz

# unpack file and create link
tar -zxf spark-3.0.1-bin-hadoop3.2.tgz -C /opt
ln -s /opt/spark-3.0.1-bin-hadoop3.2 /opt/spark

# update envvars.sh
cat >> /opt/envvars.sh << EOF
# Spark
export SPARK_HOME=/opt/spark
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=ipython3
export PYTHONIOENCODING=utf8
export PATH=\${PATH}:\${SPARK_HOME}/bin

EOF

cat /opt/envvars.sh

In [None]:
# Load environment variables
%load_ext dotenv
%dotenv -o /opt/envvars.sh
%env

## Example with Pi

In [None]:
%%bash

# Local execution
$SPARK_HOME/bin/run-example --master yarn SparkPi 10 2> /dev/null

# Local execution with 4 processes
# $SPARK_HOME/bin/run-example --master local[4] SparkPi 10

# Execution using YARN
# $SPARK_HOME/bin/run-example --master yarn SparkPi 10

# Execution using spark-submit
# $SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi --master local \
# $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 10

## Using pyspark

```bash
source /opt/envvars.sh
pyspark --master yarn
```

- Spark application UI - http://localhost:4040

```python
text_file = sc.textFile("hdfs:///user/hadoop/shakespeare")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs:///user/hadoop/shakespeare_result")
counts.collect()
```

```python
exit()
```

## Using Spark with Jupyter

In [None]:
%%bash

pip3 install findspark

In [None]:
import findspark
findspark.init()

import pyspark
import random

from pyspark.conf import SparkConf
from pyspark.context import SparkContext

conf = SparkConf()
conf.setAppName("pi")
conf.setMaster('yarn')
conf.set('spark.yarn.dist.files','file:/opt/spark/python/lib/pyspark.zip,file:/opt/spark/python/lib/py4j-0.10.9-src.zip')
conf.setExecutorEnv('PYTHONPATH','pyspark.zip:py4j-0.10.9-src.zip')
sc = SparkContext(conf=conf)

num_samples = 10000

def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1

count = sc.parallelize(range(0, num_samples)).filter(inside).count()

pi = 4 * count / num_samples
print(pi)

sc.stop()

In [None]:
import findspark
findspark.init()

from pyspark.conf import SparkConf
from pyspark.context import SparkContext

conf = SparkConf()
conf.setAppName("wordcount")
conf.setMaster('yarn')
conf.set('spark.yarn.dist.files','file:/opt/spark/python/lib/pyspark.zip,file:/opt/spark/python/lib/py4j-0.10.9-src.zip')
conf.setExecutorEnv('PYTHONPATH','pyspark.zip:py4j-0.10.9-src.zip')
sc = SparkContext(conf=conf)

text_file = sc.textFile("hdfs:///user/hadoop/shakespeare")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
output = counts.take(10)
for (word, count) in output:
    print("%s: %i" % (word, count))

sc.stop()

## pyspark-pictures

- https://github.com/jkthompson/pyspark-pictures/