Apache Hadoop Cluster Docker images

Before we start

I'm falling in love with Hadoop and Docker. I want to leverage them to "containerize" the Hadoop cluster. I create this project to build and run Hadoop modules inside Docker containers. It is just for practical purpose and not tested in the production environment.

Use your Docker repo

If you wish to customize the source code and push the images to your Docker repo for future use, you will need to change vietanh85 (my Docker account) to yours in the docker-compose.*.yml files. You will see the image property of the services.

Why `Dockerfile.onbuild` and so many `Dockerfile.*`?

For this practice, I'm going to use 2 modules of Hadoop system:

Yarn for node/resource management
HDFS for storage.

All of them could be downloaded in the same package of Hadoop, the only different thing is the starting script. To save my code and effort, I decide to create a base images for all modules. By doing that, Docker engine will not have to download hadoop package every time it build the images. The hierarchy of our images will be as bellow:

Why `docker-compose.build.yml`?

You may know that docker-compose is a great tool to define your services with dependencies and wire them up together with a simple command docker-compose up.

If I use the same docker-compose.yml file for both build and run purpose, docker-compose will automatically wire up all the services including the base-images (onbuild) which I don't want to start it up. I decide to create the docker-compose.build.yml just for build purpose.

Build the images

To automatically build the images, you can just simply run this command and docker-compose will take care of the rest:

docker-compose -f docker-compose.build.yml build

To check the build result, you can run docker images, you will see your new images there.

Start the containers

Start a Pseudo-Distributed container

You can run your Pseudo-Distributed Operation Hadoop system by running this command:

docker-compose -f docker-compose.pseudo.yml up

Now, your Pseudo-Distributed Hadoop system will be ready with HDFS and Yarn up and running inside a single container. If you run docker ps you will see there is one new container has been started. To access to your HDFS Name Node web interface, you can go to http://localhost:50070. To access to Resource Manager, you can go to http://localhost:8088.

[IMG]

Testing

By default, docker-compose will set your container name to hadoopdocker_hadoop_pseudo_1, to see your container name, you can run docker ps. Bellows are the steps to test your containers:

# Make the HDFS directories required to execute MapReduce jobs
$ docker exec hadoopdocker_hadoop_pseudo_1 bash -c "hdfs dfs -mkdir -p /user/root/input"

# Copy the input files into the distributed filesystem
$ docker exec hadoopdocker_hadoop_pseudo_1 bash -c "hdfs dfs -put etc/hadoop/*.xml input"

# Run some of the examples provided:
$ docker exec hadoopdocker_hadoop_pseudo_1 bash -c "hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input output 'dfs[a-z.]+'"

# View the output files on the distributed filesystem:
$ docker exec hadoopdocker_hadoop_pseudo_1 bash -c "hdfs dfs -cat output/*"

Start Cluster containers

Now it's the time to run your Hadoop system in "fully" distributed mode. I put the word "fully" inside double-quotes, since it is not a real fully distributed system which run in multiple hosts. Instead of that, we will run our cluster inside multiple containers and all of them will run in one host.

[IMG]

To start your Hadoop Cluster, you can run this command:

docker-compose -f docker-compose.cluster.yml up

docker-compose will start 3 containers including:

HDFS (name node)
Yarn (both resource and node manager)
HDFS (data node)

To see your containers, run docker ps.

You can easily scale your data nodes using docker-compose as well:

docker-compose -f docker-compose.cluster.yml scale hdfs_data=3

Start Cluster in multiple hosts

Issue: moby/swarmkit#939

TBD

Todos

Run Pseudo-Distributed Operation
Run Cluster Operation
Use Docker Swarm to run and deploy in multiple hosts

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
bin		bin
conf		conf
img		img
.gitignore		.gitignore
Dockerfile.hdfs.data		Dockerfile.hdfs.data
Dockerfile.hdfs.name		Dockerfile.hdfs.name
Dockerfile.onbuild		Dockerfile.onbuild
Dockerfile.pseudo		Dockerfile.pseudo
Dockerfile.yarn		Dockerfile.yarn
README.md		README.md
Vagrantfile		Vagrantfile
Vagrantfile.base		Vagrantfile.base
Vagrantfile.run		Vagrantfile.run
docker-compose.build.yml		docker-compose.build.yml
docker-compose.cluster.yml		docker-compose.cluster.yml
docker-compose.pseudo.yml		docker-compose.pseudo.yml
supervisord.conf		supervisord.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Hadoop Cluster Docker images

Before we start

Use your Docker repo

Why `Dockerfile.onbuild` and so many `Dockerfile.*`?

Why `docker-compose.build.yml`?

Build the images

Start the containers

Start a Pseudo-Distributed container

Testing

Start Cluster containers

Start Cluster in multiple hosts

Todos

About

Releases

Packages

Languages

vietanh85/hadoop-docker

Folders and files

Latest commit

History

Repository files navigation

Apache Hadoop Cluster Docker images

Before we start

Use your Docker repo

Why Dockerfile.onbuild and so many Dockerfile.*?

Why docker-compose.build.yml?

Build the images

Start the containers

Start a Pseudo-Distributed container

Testing

Start Cluster containers

Start Cluster in multiple hosts

Todos

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Why `Dockerfile.onbuild` and so many `Dockerfile.*`?

Why `docker-compose.build.yml`?

Packages