This project holds a showcase for Hadoop with CDC on Quarkus.
Following make targets exist in the subfolder podman:
-
pd-machine-create - Create a suitable Qemu machine
-
pd-pod-create - Create the pod with port mappings
-
pd-pod-rm - Remove pod
-
pd-pod-recreate - Remove and recreate the pod
-
pd-build - Build all images
-
pd-init - Create machine, pod and build all images
-
pd-start - Start all container
Following make targets are available here:
-
todo - Create todo entry via curl
-
list - List todo entries via curl
-
kat-listen - Listen for Kafka messages
-
kat-send - Send Kafka message
-
psql - Use psql CLI to connect to Postgres
-
beeline - Start beeline CLI and connect to Hive
-
beeline-hive-select - Select data from hive_todos
-
beeline-debezium-select - Select data from debezium_todos
-
beeline-spark-select - Select data from spark_messages and spark_todos
-
spark-shell - Start Spark shell and connect to Spark
-
spark-beeline - Start Spark Beeline and connect to Spark
-
data-init - Init all Hive data
-
copy - Copy the Scala jar into the Hadoop container
-
open-namenode - Open the namenode in a browser
-
open-datanode - Open the datanode in a browser
-
open-spark-master - Open the Spark master in a browser
-
open-spark-slave - Open the Spark slave in a browser
-
open-spark-shell - Open the Spark shell in a browser
-
open-resourcemanager - Open the Resoucemanager in a browser
-
open-debezium - Open the Debezium in a browser
-
open-app - Open the Quarkus Dev tools in a browser
-
Create podman machine:
make -C podman pd-machine-create
-
Start podman machine:
make -C podman pd-machine-start
-
Create pod:
make -C podman pd-pod-create
-
Build all containers:
make -C podman pd-build
-
Start all containers:
make -C podman pd-start
-
Init Hive tables:
make init
-
Compile scala jar:
make scala
-
SSH into Hadoop pod:
make ssh
Apparently, the datanodes use components like C-libraries which I couldn’t get NOT to dump core dump with Alpine/Musl.
Starting with Podman 4.4.1 they dropped the default privileges for chroot, which led to following problems on connection:
ssh: Connection closed by 127.0.0.1 port 22
sshd: chroot("/run/sshd"): Operation not permitted [preauth]
java.lang.NoSuchMethodError: 'scala.collection.immutable.ArraySeq scala.runtime.ScalaRunTime$.wrapRefArray(java.lang.Object[])'
Caused by: java.lang.ClassNotFoundException: scala.$less$colon$less
Make sure the Scala version of the jars/dependencies match the Scala version of Spark.
This can be easily checked with:
mvn dependency:tree
The jdbc connection string for either anonymous or hduser for Hive is following:
jdbc:hive2://localhost:10000/default
And adding the external Debezium table works:
add jar /home/hduser/hive/lib/iceberg-hive-runtime-1.1.0.jar;
create external table debezium stored by 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' location 'hdfs://localhost:9000/warehouse/debeziumevents/debeziumcdc_showcase_public_todos' TBLPROPERTIES ('iceberg.catalog'='location_based_table')"
Spark executors use submitted values for JAVA_HOME:
Hadoop includes a Jakarta enabled version of Jetty, but still in v3.3.6 many of the servlets
implement interfaces of javax.servlets.*
and this doesn’t work in a Jakarta project.
java.lang.RuntimeException: java.lang.NoSuchMethodError: 'void org.eclipse.jetty.servlet.ServletHolder.<init>(javax.servlet.Servlet)'
-
https://medium.com/analytics-vidhya/hadoop-single-node-cluster-on-docker-e88c3d09a256
-
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
-
https://www.edureka.co/blog/install-hadoop-single-node-hadoop-cluster
-
https://stackoverflow.com/questions/41266403/how-to-access-hadoop-web-ui-in-linux
-
https://www.ibm.com/docs/el/db2-big-sql/5.0?topic=applications-impersonation-in-big-sql
-
https://www.dremio.com/blog/introduction-to-apache-iceberg-using-spark/
-
https://spark.apache.org/docs/latest/sql-getting-started.html
-
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
-
https://sparkbyexamples.com/apache-hive/how-to-connect-spark-to-remote-hive/
-
https://sparkbyexamples.com/spark/spark-split-dataframe-column-into-multiple-columns/
-
https://www.adaltas.com/en/2019/04/18/spark-streaming-data-pipelines-with-structured-streaming/