GitHub

This repo was started initially with an aim of creating a 3 node hadoop cluster having 1 master and 2 worker nodes with Hadoop, Hive and Spark using docker.io containers. You can read more about the original project here: https://medium.com/@aditya.pal/setup-a-3-node-hadoop-spark-hive-cluster-from-scratch-using-docker-332dae6b98d0

UPDATE : thanks to pedro-glongaron, this project now has 1 master, 2 workers, 1 edge node (with Flume, Sqoop and Kafka !!) , 1 Hue service node, 1 Zeppelin service node and 1 Nifi node.

NOTE: Please verify that the download links in the Dockerfiles are still active.

For Hadoop: Choose any Hadoop version > 2.0 from here https://archive.apache.org/dist/hadoop/core/

For Hive: Choose Hive version > 2.0.0 (preferably < 3.0) from here https://archive.apache.org/dist/hive/

For Spark: Choose Spark version > 2.0 from here https://archive.apache.org/dist/spark/

Update: Added pyspark support by installing python 2.7 ... Change to your default python version in spark Dockerfile

Instructions for use

run cd hadoop_hive_spark_docker
Build cluster using ./build.sh
Once all images are built, start cluster by ./cluster.sh start
Verify the containers running by docker ps -as. nodemaster, node2, node3, psqlhms edge, nifi, huenode, zeppelin containers should be running.
Enter any container this way: docker exec -u hadoop -it nodemaster /bin/bash
Once all work is done, bring down cluster by ./cluster.sh stop

Tests Done with Hive

Copy file from local to nodemaster container : docker cp test_data.csv nodemaster:/tmp/
Enter nodemaster : docker exec -u hadoop -it nodemaster /bin/bash
Create directory in HDFS : hadoop@nodemaster:/$ hdfs dfs -mkdir -p /user/hadoop/test
Get file from container local to HDFS : hadoop@nodemaster:/$ hdfs dfs -put /tmp/test_data.csv /user/hadoop/test/
execute Hive by : hadoop@nodemaster:/$ hive
In hive terminal : hive>create schema if not exists test;
In hive terminal : hive>create external table if not exists test.test_data (row1 int, row2 int, row3 decimal(10,3), row4 int) row format delimited fields terminated by ',' stored as textfile location 'hdfs://172.20.1.1:9000/user/hadoop/test/';
Results

hive> select * from test.test_data where row3 > 2.499;
OK
1 122 5.000 838985046
1 185 4.500 838983525
1 231 4.000 838983392
1 292 3.500 838983421
1 316 3.000 838983392
1 329 2.500 838983392
1 377 3.500 838983834
1 420 5.000 838983834
1 466 4.000 838984679
1 480 5.000 838983653
1 520 2.500 838984679
1 539 5.000 838984068
1 586 3.500 838984068
1 588 5.000 838983339
Time taken: 0.175 seconds, Fetched: 14 row(s)

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github		.github
hadoop_hive_spark_docker		hadoop_hive_spark_docker
.gitignore		.gitignore
README.md		README.md
SECURITY.md		SECURITY.md
coursera_postgres_py2		coursera_postgres_py2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

hadoop_hive_spark_docker

hadoop_hive_spark_docker

.gitignore

.gitignore

README.md

README.md

SECURITY.md

SECURITY.md

coursera_postgres_py2

coursera_postgres_py2

Repository files navigation

About

Releases

Packages

Languages

vpinnaka/dockers

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Security policy

Stars

Watchers

Forks

Languages