Skip to content

sparkfireworks/spark-submit-cluster-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-submit-cluster-python

License: MIT

Showcase how to create a Python Spark application that can be launch in both client and cluster mode.

How it works

To run Spark in cluster mode it is necessary to send the Spark application code in the spark-submit command. To do so we start by creating an egg file containing the code as described in the setup.py file (packages property).

In the code it is necessary to import the package in your code's entry point (spark_cluster_mode/init.py):

import sys
sys.path.insert(0, "spark-submit-cluster-python-0.1.0-py2.7")

The spark-submit must have the option --py-files with the absolute path to the egg package (spark-submit.sh):

name=test_cluster_mode
app_path=$CWD/../spark_cluster_mode/__init__.py
master_mode=yarn
deploy_mode=cluster
spark_queue=spark
py_files=$CWD/../dist/spark_submit_cluster_python-0.1.0-py2.7.egg

spark-submit \
  --name $name \
  --master $master_mode \
  --deploy-mode $deploy_mode \
  --queue $spark_queue \
  --py-files $py_files \
  $app_path

Deploy

Clone this repository to a Hadoop node and build the egg:

$ bash scripts/create-egg.sh

Run

Now just run the code:

$ bash scripts/spark-submit.sh

See the logs

To inpect the Spark application logs do:

yarn logs --applicationId application_XXXXXXXXXXXX_XXXX

Authors

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Sowcase how to create a Python Spark application that can be launch in both client and cluster mode.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published