This repo was started initially with an aim of creating a 3 node hadoop cluster having 1 master and 2 worker nodes with Hadoop, Hive and Spark using docker.io containers. You can read more about the original project here: https://medium.com/@aditya.pal/setup-a-3-node-hadoop-spark-hive-cluster-from-scratch-using-docker-332dae6b98d0
UPDATE : thanks to pedro-glongaron, this project now has 1 master, 2 workers, 1 edge node (with Flume, Sqoop and Kafka !!) , 1 Hue service node, 1 Zeppelin service node and 1 Nifi node.
NOTE: Please verify that the download links in the Dockerfiles are still active.
For Hadoop: Choose any Hadoop version > 2.0 from here https://archive.apache.org/dist/hadoop/core/
For Hive: Choose Hive version > 2.0.0 (preferably < 3.0) from here https://archive.apache.org/dist/hive/
For Spark: Choose Spark version > 2.0 from here https://archive.apache.org/dist/spark/
Update: Added pyspark support by installing python 2.7 ... Change to your default python version in spark Dockerfile
Instructions for use
-
run
cd hadoop_hive_spark_docker
-
Build cluster using ./build.sh
-
Once all images are built, start cluster by ./cluster.sh start
-
Verify the containers running by docker ps -as. nodemaster, node2, node3, psqlhms edge, nifi, huenode, zeppelin containers should be running.
-
Enter any container this way: docker exec -u hadoop -it nodemaster /bin/bash
-
Once all work is done, bring down cluster by ./cluster.sh stop
Tests Done with Hive
-
Copy file from local to nodemaster container : docker cp test_data.csv nodemaster:/tmp/
-
Enter nodemaster : docker exec -u hadoop -it nodemaster /bin/bash
-
Create directory in HDFS : hadoop@nodemaster:/$ hdfs dfs -mkdir -p /user/hadoop/test
-
Get file from container local to HDFS : hadoop@nodemaster:/$ hdfs dfs -put /tmp/test_data.csv /user/hadoop/test/
-
execute Hive by : hadoop@nodemaster:/$ hive
-
In hive terminal : hive>create schema if not exists test;
-
In hive terminal : hive>create external table if not exists test.test_data (row1 int, row2 int, row3 decimal(10,3), row4 int) row format delimited fields terminated by ',' stored as textfile location 'hdfs://172.20.1.1:9000/user/hadoop/test/';
-
Results
hive> select * from test.test_data where row3 > 2.499;
OK
1 122 5.000 838985046
1 185 4.500 838983525
1 231 4.000 838983392
1 292 3.500 838983421
1 316 3.000 838983392
1 329 2.500 838983392
1 377 3.500 838983834
1 420 5.000 838983834
1 466 4.000 838984679
1 480 5.000 838983653
1 520 2.500 838984679
1 539 5.000 838984068
1 586 3.500 838984068
1 588 5.000 838983339
Time taken: 0.175 seconds, Fetched: 14 row(s)