This is a guide for running QBatcher on Apache Spark in a distributed Docker environment.
This manual is structured on the assumption that the following environment has been established. If you do not have the following environment, configure it and follow the manual below.
- Multiple Linux-based servers. (e.g. Ubuntu, CentOS, ..)
- Each server must have docker installed, and the user must have access to this docker.
- Each server must be able to communicate with each other over the network and must be connected to the Internet.
- Among servers, the server to be used as the Master must be selected, and the server's IP must be identified.
- The ssh connection from the master to each worker must be possible.
Each machine of the cluster has Intel Xeon E5-2450 CPU with 16 cores and 192 GB of RAM. On the software side, we installed Spark 3.0 and Hadoop 2.7.7 on Ubuntu 20.04.3 LTS. Please keep the disk usage rate not exceeding 80% before the execution. Otherwise, it is impossible to initialize Spark.
The codes to be executed are written in a code block and have the following format.
[MACHINE]$ command
Here, MACHINE
has the following types. You can proceed by following the manual while checking whether the location where the code is executed is running in the correct location.
- master : Master node.
- master-docker : Inside docker container running on master node.
- worker : Worker node.
- worker-docker : Inside docker container running on worker node.
In a distributed programming environment, in most cases there are multiple workers, so you have to iterate over each worker machine with the same setup. Name each machine in the list of worker machines in the following order: worker1
, worker2
, worker3
,... In addition, the part marked with workerX
in the manual below means to repeatedly execute the instruction by substituting a worker number of X = 1,2,3...
for each worker machine.
This step is to group each machine where Docker is installed into one cluster.
- Connect via ssh to the master node.
- Enter the following command.
[master]$ docker swarm init --advertise-addr MASTER_IP
When input, the message Swarm initialized...
appears, and the command to run the docker swarm join ...
command appears. Copy the entire command.
- (by ssh connection to each worker node) Paste the copied command and execute it. This process is the process of tying each worker into one network.
(example)
[worker]$ docker swarm join \
--token SWMTKN-1-49nj1cmql0jkz5s954yi3oex3nedyz0fb0xx14ie39trti4wxv-8vxv8rssmk743ojnwacrr2e7c \
MASTER_IP:2377
- (Back to the master node) Enter the following command.
[master]$ docker node ls
When you do this, all machines (including master) to be included in the cluster should be displayed.
- Enter the following command.
[master]$ docker network create -d overlay --attachable spark-cluster-net
This step is the process of downloading the configuration file, creating a Docker image, and running the container.
-
Download this repository on the master machine.
-
Download two datasets (i.e., BRA and eBay datasets) on
qbatcher/datasets
.
-
Download BRA dataset directory on
qbatcher/datasets
-
Download eBay dataset directory on
qbatcher/datasets
- Download two query sets and their parameters on
qbatcher/querysets
-
Download brazilian-ecommerce directory on
qbatcher/querysets
-
Download eBay directory on
qbatcher/querysets
- Download the master docker image base-image-master.tar and load the docker image.
[master]$ docker load < base-image-master.tar
- Run the docker container. (Here, users put the path of the repository into
PATH_TO_DIR
)
[master]$ docker run -itd --privileged -h master -v PATH_TO_DIR:/root/qbatcher --name master --net spark-cluster-net base-image-master
- Enter the docker container.
[master]$ docker exec -it master /bin/bash
- Enter the following commands and exit the docker container.
[master-docker]$ /root/init.sh
[master-docker]$ exit
-
Download the worker docker image base-image-worker.tar on each worker machine.
-
Load the docker image.
[worker]$ docker load < base-image-worker.tar
- Run the docker container. (Note
workerX
here. Replace X with the number of each worker node (1,2,3..).)
[worker]$ docker run -itd --privileged -h workerX --name workerX --link master:master --net spark-cluster-net base-image-worker
- Enter the docker container. (Note
workerX
.)
[worker]$ docker exec -it workerX /bin/bash
- Enter to
/root/dev/hadoop-2.7.7/etc/hadoop
and modify the corresponding files as follows.
Add the configurations as follows.
A : (Total memory for each worker node) - 2048 (MB)
B : (Number of logical cores in each worker node) - 2
In the above case, it is a setting to use the maximum resources that each worker node can use. In the case of the Worker node, almost all resources are allocated to this task because almost no work other than performing distributed computation is performed. The reason for subtracting some values here is to reserve resources for system processes (all processes except YARN container).
...
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>A</value>
</property>
...
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>B</value>
</property>
...
<property>
<name>yarn.nodemanager.resource.resource.cpu-vcores</name>
<value>B</value>
</property>
...
Change the configurations as follows.
<property>
<name>hadoop.tmp.dir</name>
<value>/root/qbatcher/tmp</value>
</property>
Change the configurations as follows.
<property>
<name>dfs.datanode.name.dir</name>
<value>/root/qbatcher/datanode</value>
</property>
- Generate the password of the account.
[worker-docker]$ passwd
- Enter the following commands and exit the docker container.
[worker-docker]$ /root/init.sh
[worker-docker]$ exit
This step is for ssh communication between machines inside the Docker network.
- Enter the master docker container.
[master]$ docker exec -it master /bin/bash
- Repeat the following for all workers. In this process, a password is requested. Enter the account password set in 3.6. (note
workerX
)
[master-docker]$ ssh-copy-id -i /root/.ssh/id_rsa.pub root@workerX
This step initializes HDFS and runs Hadoop and Spark services.
- Enter the master docker container.
[master]$ docker exec -it master /bin/bash
- Enter
/root/dev/hadoop-2.7.7/etc/hadoop/
and modify each file as follows.
A, B : (Total memory of the master node) - 4096 (MB)
C, D : (Total number of logical cores of the master node) - 4
In the above case, it is a setting to use the appropriate resources that the master node can use. In the case of the master node, since it performs the execution order and resource management for the entire distributed application, an appropriate amount of resources is allocated in consideration of this part, and the above setting with a certain amount of consideration for the appropriate amount is applied.
...
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>A</value>
</property>
...
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>B</value>
</property>
...
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>C</value>
</property>
...
<property>
<name>yarn.nodemanager.resource.resource.cpu-vcores</name>
<value>C</value>
</property>
...
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>D</value>
</property>
...
Delete all lines written in the existing file, and record as follows according to the number of workers.
Here, if you want the master node to not only manage the Spark program but also perform the actual work (the role of the worker), add master
to the top line.
worker1
worker2
worker3
.....
Change the configurations as follows.
<property>
<name>dfs.namenode.name.dir</name>
<value>/root/qbatcher/namenode</value>
</property>
- Run the following commands in the docker container.
[master-docker]$ chmod 777 -R /root/qbatcher/
[master-docker]$ /root/qbatcher/scripts/setup.sh
Run the scripts in qbatcher/scripts
.
All three scripts below generate result files (*.dat) in the directory on [PATH_TO_OUTPUT_DIR] (e.g., /root/qbatcher/results).
[master-docker]$ cd /root/qbatcher/scripts; ./run-test.sh [PATH_TO_OUTPUT_DIR]