# Spark
![Spark](https://spark.apache.org/images/spark-logo-trademark.png)

- https://spark.apache.org/

## Setup

- version 3.0.1 (Pre-built for Apache Hadoop 3.2 and later)

In [1]:
%%bash

# Download package
cd /opt/pkgs
wget -q -c https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz

# unpack file and create link
tar -zxf spark-3.0.1-bin-hadoop3.2.tgz -C /opt
ln -s /opt/spark-3.0.1-bin-hadoop3.2 /opt/spark

# update envvars.sh
cat >> /opt/envvars.sh << EOF
# Spark
export SPARK_HOME=/opt/spark
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
export PYTHONIOENCODING=utf8
export PATH=\${PATH}:\${SPARK_HOME}/bin

EOF

cat /opt/envvars.sh

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export PDSH_RCMD_TYPE=ssh

export HADOOP_HOME=/opt/hadoop
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_YARN_HOME=${HADOOP_HOME}

export PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin     

# Flume
export FLUME_HOME=/opt/flume
export PATH=${PATH}:${FLUME_HOME}/bin

# Sqoop
export SQOOP_HOME=/opt/sqoop
export PATH=${PATH}:${SQOOP_HOME}/bin

# Pig
export PIG_HOME=/opt/pig
export PATH=${PATH}:${PIG_HOME}/bin

# Hive
export HIVE_HOME=/opt/hive
export PATH=${PATH}:${HIVE_HOME}/bin

# Spark
export SPARK_HOME=/opt/spark
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=ipython3
export PYTHONIOENCODING=utf8
export PATH=${PATH}:${SPARK_HOME}/bin



In [2]:
# Load environment variables
%load_ext dotenv
%dotenv -o /opt/envvars.sh
%env

{'HOSTNAME': 'hadoop',
 'OLDPWD': '/',
 'PWD': '/opt',
 'HOME': '/home/hadoop',
 'SHELL': '/bin/bash',
 'SHLVL': '1',
 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/hadoop/bin:/opt/hadoop/sbin:/opt/flume/bin:/opt/sqoop/bin:/opt/pig/bin:/opt/hive/bin:/opt/spark/bin',
 '_': '/usr/bin/nohup',
 'LANGUAGE': 'en.UTF-8',
 'LANG': 'en.UTF-8',
 'JPY_PARENT_PID': '1566',
 'TERM': 'xterm-color',
 'CLICOLOR': '1',
 'PAGER': 'cat',
 'GIT_PAGER': 'cat',
 'MPLBACKEND': 'module://ipykernel.pylab.backend_inline',
 'JAVA_HOME': '/usr/lib/jvm/java-1.8.0-openjdk-amd64',
 'PDSH_RCMD_TYPE': 'ssh',
 'HADOOP_HOME': '/opt/hadoop',
 'HADOOP_COMMON_HOME': '/opt/hadoop',
 'HADOOP_CONF_DIR': '/opt/hadoop/etc/hadoop',
 'HADOOP_HDFS_HOME': '/opt/hadoop',
 'HADOOP_MAPRED_HOME': '/opt/hadoop',
 'HADOOP_YARN_HOME': '/opt/hadoop',
 'FLUME_HOME': '/opt/flume',
 'SQOOP_HOME': '/opt/sqoop',
 'PIG_HOME': '/opt/pig',
 'HIVE_HOME': '/opt/hive',
 'SPARK_HOME': '/opt/spark',
 'PYSPARK_PYTHON': 'pyth

## Example with Pi

In [5]:
%%bash

# Local execution
#$SPARK_HOME/bin/run-example --master local SparkPi 10 2> /dev/null

# Local execution with 4 processes
#$SPARK_HOME/bin/run-example --master local[4] SparkPi 10

# Execution using YARN
# $SPARK_HOME/bin/run-example --master yarn SparkPi 10

# Execution using spark-submit
$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn \
 $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 10

Pi is roughly 3.1416071416071416


2021-01-29 17:59:46,559 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-01-29 17:59:47,267 INFO spark.SparkContext: Running Spark version 3.0.1
2021-01-29 17:59:47,471 INFO resource.ResourceUtils: Resources for spark.driver:

2021-01-29 17:59:47,475 INFO spark.SparkContext: Submitted application: Spark Pi
2021-01-29 17:59:47,745 INFO spark.SecurityManager: Changing view acls to: hadoop
2021-01-29 17:59:47,746 INFO spark.SecurityManager: Changing modify acls to: hadoop
2021-01-29 17:59:47,747 INFO spark.SecurityManager: Changing view acls groups to: 
2021-01-29 17:59:47,748 INFO spark.SecurityManager: Changing modify acls groups to: 
2021-01-29 17:59:47,749 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set

## Using pyspark

```bash
source /opt/envvars.sh
pyspark --master yarn
```

- Spark application UI - http://localhost:4040

```python
text_file = sc.textFile("hdfs:///user/hadoop/shakespeare")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs:///user/hadoop/shakespeare_result")
counts.collect()
```

```python
exit()
```

## pyspark-pictures

- https://github.com/jkthompson/pyspark-pictures/