Showcase how to create a Python Spark application that can be launch in both client and cluster mode.
To run Spark in cluster mode it is necessary to send the Spark application code in the spark-submit
command. To do so we start by creating an egg file containing the code as described in the setup.py file (packages property).
In the code it is necessary to import the package in your code's entry point (spark_cluster_mode/init.py):
import sys
sys.path.insert(0, "spark-submit-cluster-python-0.1.0-py2.7")
The spark-submit must have the option --py-files
with the absolute path to the egg package (spark-submit.sh):
name=test_cluster_mode
app_path=$CWD/../spark_cluster_mode/__init__.py
master_mode=yarn
deploy_mode=cluster
spark_queue=spark
py_files=$CWD/../dist/spark_submit_cluster_python-0.1.0-py2.7.egg
spark-submit \
--name $name \
--master $master_mode \
--deploy-mode $deploy_mode \
--queue $spark_queue \
--py-files $py_files \
$app_path
Clone this repository to a Hadoop node and build the egg:
$ bash scripts/create-egg.sh
Now just run the code:
$ bash scripts/spark-submit.sh
To inpect the Spark application logs do:
yarn logs --applicationId application_XXXXXXXXXXXX_XXXX
This project is licensed under the MIT License - see the LICENSE file for details.