Hadoop, Spark, Hive

This project contains Hadoop, Spark and Hive

Setup

Build all docker images in the database, master and slave folder (It will take awhile ...)

Build hive sql image

cd hive
docker build -t hive-db-img-centos7 .

Build sql image

cd mysql
docker build -t mysql-for-hive-img .

Build Hadoop master image (contains spark too)

cd master
docker build -t hadoop-master-img-centos7 .

Build Hadoop slave image

cd slave
docker build -t hadoop-slave-img-centos7 .

Create the bridge network

docker network create -d bridge my-bridge-network

Then run in root project folder (read below if you are using Windows)

./setup.sh

~~For Windows, run ./setup_windows.bat on a bash shell:~~

If you are using a Windows machine, and have Git Bash installed, follow this workaround to mitigate the path interpretation issues: https://github.com/borekb/docker-path-workaround .

Answer y when you are prompted to remove kernel spec.

Also, if you face org.freedesktop.PolicyKit1 was not provided by any .service files errors you may have to reinstall polkit on hadoop-master. Do this:

docker exec -it --user root bash
yum reinstall polkit

Please refer to the scripts (database/master folders) on what it does as I am too lazy to write out.

How do I know if I setup correctly?

Go to http://localhost:8088, you will be able to see the hadoop web page with 2 active nodes!

Try running a spark submit command, go to localhost:8080 and you can see the active jobs!

docker exec -it hadoop-master /bin/bash
cd spark
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    examples/jars/spark-examples*.jar \
    10

Check if you can access the hive table

docker exec -it hadoop-master /bin/bash
spark/bin/spark-shell
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("show databases").show()

To start your Jupyter notebook, run:

docker exec -it --user hadoop hadoop-master jupyter notebook --ip=0.0.0.0 --port=8081

If your containers have stopped, use:

./startup.sh

Yay!

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
dags		dags
hive		hive
master		master
mysql		mysql
slave		slave
.gitignore		.gitignore
README.md		README.md
copy_keys.sh		copy_keys.sh
setup.sh		setup.sh
setup_python.sh		setup_python.sh
setup_windows.bat		setup_windows.bat
start-containers.sh		start-containers.sh
startup.sh		startup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dags

dags

hive

hive

master

master

mysql

mysql

slave

slave

.gitignore

.gitignore

README.md

README.md

copy_keys.sh

copy_keys.sh

setup.sh

setup.sh

setup_python.sh

setup_python.sh

setup_windows.bat

setup_windows.bat

start-containers.sh

start-containers.sh

startup.sh

startup.sh

Repository files navigation

Hadoop, Spark, Hive

Setup

How do I know if I setup correctly?

About

Releases

Packages

Contributors 3

Languages

shermanelee92/hadoop-spark-hive

Folders and files

Latest commit

History

Repository files navigation

Hadoop, Spark, Hive

Setup

How do I know if I setup correctly?

About

Resources

Stars

Watchers

Forks

Languages