Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Advanced common functionality for hadoop
Java
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
.settings
examples
src
.DS_Store
.classpath
.project
LICENSE.txt
README.md
pom.cdh3.xml
pom.xml

README.md

HadCom.utils

Overview

This project creates a runnable jar file that can do some common advance functionality with hadoop.

This first version was built for CDH 4. I will be making it work for CDH 3 shortly.

Functionality

Put

a collections of layered functionality for advance putting. For details on how to use this functionality click here The user will be able to use all the following:

Layer 1: Reading

CSV files, Delimiter Files, Flat Files, Variable Length Delimiter files, Variable Length Flat Files

Layer 2: Aggregating

Many files into a few

Appending file name to every row of aggregated files

Layer 3: Threading

Run in single or multi thread mode

Each thread writing to a different HDFS file to increase write speed

Layer 4: Listening

Report progress to console

Layer 5: Compresing

Use Snappy, Gzip, or Bzip2

Layer 6: Writing

Sequence, Avro Files, Rc Files, or to HBase

Route

This allows you to make one or more directories pumps files into HDFS as you favorite splittable formates (sequence, avro, or rc) Like the "put" functionality the route logic is also layered.

Layer 1: Route

Event driven

Schedule driven

Layer 2: Put Threads

Define number of put threads in the thread pool

Layer 3: Put

Get all the functionality and options from the above put command

Get

hadoop fs -get is good but. What if you want to get a sequence, avro, or rc file? And what if you want to be able to read the results? Well then you can use these get methods to uncompress sequence, avro or rc files into text to your local drive.

Out

hadoop fs -text only goes so far this takes us to the next step by being able to output rc files and avro files in clear text. Click here for more information.

Env

Converting a {key}|{field}|{value} env files to an avro file with a generated schema

Converting a multiple row type file to multiple avro files each having a generated schema

NonSplittableGzip

Converts a non-splittable gzip file stored in hdfs to a sequence file of your choose of compression (snappy, gzip, bzip2)

NonSplittableZip

Converts a non-splittable zip file stored in hdfs to a sequence file(s) of your choose of compression (snappy, gzip, bzip2). There will be a sequence file for every file in the original zip file.

Something went wrong with that request. Please try again.