# Hands-on Lab: Submit Apache Spark Applications

![https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/images/IDSN-logo.png](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/images/IDSN-logo.png)

**Estimated time needed: 20 minutes**

In this lab, you will learn how to submit Apache Spark applications from a python script. This exercise is straightforward thanks to Docker Compose.

## **Learning Objectives**

In this lab, you will:

- Install a Spark Master and Worker using Docker Compose
- Create a python script containing a spark job
- Submit the job to the cluster directly from python (Note: you’ll learn how to submit a job from the command line in the Kubernetes Lab)

## **Prerequisites**

Note: If you are running this lab within the Skillsnetwort Lab environment, all prerequisites are already installed for you

The only pre-requisites to this lab are:

- A working *docker* installation
- Docker Compose
- The *git* command line tool
- A python development environment

# **Install a Apache Spark cluster using Docker Compose**

On the right hand side to this instructions you’ll see the Cloud IDE. Select the *Lab* tab. On the menu bar select *Terminal>New Terminal*.

![https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/images/NewTerminal.png](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/images/NewTerminal.png)

Get the latest code:



In [None]:
git clone https://github.com/big-data-europe/docker-spark.git




Change the directory to the downloaded code:



In [None]:
cd docker-spark




Start the cluster



In [None]:
docker-compose up




After quite some time you should see the following message:

*Successfully registered with master spark://<server address>:7077*

![https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/labs/images/reg_success.png](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/labs/images/reg_success.png)

> NOTE: Please keep this terminal running and do not close it, as it is essential for further steps in the lab to run correctly.
>

# **Create code**

1. Click `Terminal` from the menu, and click `New Terminal` to open a new terminal window.

![https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/images/NewTerminal.png](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/images/NewTerminal.png)

1. Once the terminal opens up at the bottom of the window, please type in the following command in the terminal to create the Python script.



In [None]:
touch submit.py




A new python file called `submit.py` is created.

1. Open the file in the file editor.

![https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/labs/images/edit_submitpy.png](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/labs/images/edit_submitpy.png)

1. Paste the following code to the file and save.



In [None]:
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, IntegerType, StringType

sc = SparkContext.getOrCreate(SparkConf().setMaster('spark://localhost:7077'))
sc.setLogLevel("INFO")

spark = SparkSession.builder.getOrCreate()

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
(1, "foo"),
(2, "bar"),
],
StructType(
[
StructField("id", IntegerType(), False),
StructField("txt", StringType(), False),
]
),
)
print(df.dtypes)
df.show()




# **Execute code / submit Spark job**

Now we execute the python file we saved earlier.

1. In the terminal, run the following commands to upgrade the pip installer to ensure you have the latest version by running the following commands.



In [None]:
rm -r ~/.cache/pip/selfcheck/
pip3 install --upgrade pip
pip install --upgrade distro-info




> rm -r ~/.cache/pip/selfcheck/ removes any previously cached version of pip and allows to install the latest one.
>
1. Please enter the following commands in the terminal to download the spark environment.



In [None]:
wget https://archive.apache.org/dist/spark/spark-3.3.3/spark-3.3.3-bin-hadoop3.tgz && tar xf spark-3.3.3-bin-hadoop3.tgz && rm -rf spark-3.3.3-bin-hadoop3.tgz





> This takes a while. This downloads the spark as a zipped archive and unzips it in the current directory.
>
1. Run the following commands to set up the `JAVA_HOME` which is preinstalled in the environment and `SPARK_HOME` which you just downloaded.



In [None]:
export JAVA_HOME=/usr/lib/jvm/java-1.11.0-openjdk-amd64
export SPARK_HOME=/home/project/spark-3.3.3-bin-hadoop3




1. Install the required packages to set up the spark environment.



In [None]:
pip install pyspark


In [None]:
python3 -m pip install findspark




1. Type in the following command in the terminal to execute the Python script.



In [None]:
python3 submit.py




# **Experiment yourself**

Please have a look at the UI of the Apache Spark master and worker.

1. Click on the button below to launch the `Spark Master`. Alternatively, click on the Skills Network button on the left, it will open the “Skills Network Toolbox”. Then click the `Other`, then `Launch Application`. From there you should be able to enter the port number as `8080` and launch.

Spark Master

![https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/images/Launch_Application--new_IDE.png](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/images/Launch_Application--new_IDE.png)

1. This will take you to the admin UI of the Spark master (if your popup blocker doesn’t prevent it).

![https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/labs/images/spark_submit_lab_master_ui.png](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/labs/images/spark_submit_lab_master_ui.png)

1. Please notice that you can see all registered workers (one in this case) and submitted jobs (also one in this case)

> Note: The way how the lab environment works you can’t click on links in the UI - in a real installation, this of course is possible.
>
1. Click the button below to open the `Spark Worker` on 8081. Alternatively, click on the Skills Network button on the left, it will open the “Skills Network Toolbox”. Then click the `Other`, then `Launch Application`. From there, you should be able to enter the port number as `8081` and launch.

Spark Worker

You should find your currently running job here as well.

# **Summary**

In this lab you’ve learned how to setup an experimental Apache Spark cluster on top of Docker Compose. You are now able to submit a Spark job directly from python code. In the Kubernetes lab you’ll learn how to subit Spark jobs from command line as well.

## **Author(s)**

Romeo Kienzler

[Lavanya T S](https://www.linkedin.com/in/lavanya-sunderarajan-199a445/)

### **© IBM Corporation. All rights reserved.**