This is a simple example of Apache Spark working with Gluster, using glusterfs-hadoop.
To build the project, just run:
./mvnw package
The application jar will be written to target/spark-gluster-example-<version>.jar
.
A working Gluster cluster is required. If you are looking for a simple way to test locally, we recommend using carmstrong/multinode-glusterfs-vagrant.
For this example, we will assume that:
SPARK_HOME
environment variable contains the path to an Apache Spark distribution.HADOOP_CONF_DIR
points to a directory with a Hadoopcore-site.xml
, containing some Gluster configuration. Inconf/core
you will find a minimal working example, assuming an existing Gluster volume namedgv0
and mounted in/mnt/gv0
.
Now you can run:
$SPARK_HOME/bin/spark-submit \
--master 'local[2]' \
target/spark-gluster-example-0.1.0-SNAPSHOT.jar
This will generate numbers from 1 to 100000 and write them to Gluster in Parquet format. Then it will read them back and compare to the original. If the result is correct, it will print OK
and exit with status 0, otherwise, it will print KO
and exit with status 1.
This example is released under the terms of the Apache 2 License.