Skip to content

An example integration of StreamSets and HDFS, using Docker

Notifications You must be signed in to change notification settings

zketley/streamsets-hdfs-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StreamSets demo

You must accept the Oracle Binary Code License Agreement for Java SE to use the StreamSets image.

This project provides a simple way to spin up StreamSets and a standlone Hadoop Docker containers. Linux, Windows and Mac are all supported. On Linux, the Docker containers are spun up directly. On Mac and Windows they are spun up in an Ubuntu 16.04 Virtualbox host VM. An example pipeline is provided, based on the Taxi payments tutorial. The data is outputted to HDFS running in the Hadoop container.

Getting started

Install dependencies

Windows / Mac only

  • Install Vagrant. Vagrant version 1.9.1 is tested.
  • Install VirtualBox. VirtualBox version 5.1.14 is tested.

Linux only

NOTE: If running on Ubuntu, you can install all these dependencies by running

sudo host/install.sh $USER
sudo usermod -aG docker $USER
  • Test that docker commands can be run without sudo
docker ps

Start Docker hosts

Windows / Mac

From the root of the project, run

vagrant up

Note you will be prompted by UAC to agree admin privileges. This is so that your /etc/hosts file can be modified automatically.

Linux

From the root of the project, run

docker-compose up -d
Making HDFS download links work

If you wish to access the HDFS GUI from another machine, you will need to add a redirect in /etc/hosts for hdfs to your Linux host. On the computer you want to access the HDFS GUI from, add the following to your

  • (On Windows) C:\Windows\System32\drivers\etc\hosts file
  • (On Linux / Mac) /etc/hosts
hdfs <IP of your Linux host>

Import the example StreamSets pipeline and run it

  • Wait for HDFS startup to finish. You can verify this by checking that "Safemode is off" appears beneath Summary in the HDFS GUI Overview tab
  • From the StreamSets GUI, import the pipeline in examples/ and run it
  • Navigate to http://localhost:50070/explorer.html#/opt/files/destination to see the output files. Note the output will be prefixed _tmp_ while the pipeline is running. Once you stop the pipeline, the files will be renamed to out_. You can download them from here.

Other bits and pieces

(Windows / Mac only) Accessing the Ubuntu host

  • To SSH into the host VM, run
vagrant ssh

To add a new StreamSets package

  • From within the StreamSets app, navigate to Package Manager
  • Copy the library name you want
  • Uncomment out the relevant build lines in docker-compose.yml
  • Comment out the sdc image line in docker-compose.yml
  • Add the library name to PACKAGES_TO_INSTALL in docker-compose.yml
  • From the Linux host, run
docker-compose stop sdc
docker-compose rm --force sdc
docker-compose build sdc
docker-compose up -d sdc

License

MIT License

Copyright (c) 2017 Dominic Ketley

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

An example integration of StreamSets and HDFS, using Docker

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages