Big Data Pattern and Samples

Objectives

Some code are collected for learning big data and hadoop and reusing at work.

Table of Content

The ptn0xx is about MapReduce & HDFS

/java/common: Commonly shared utility and tools
/java/ptn001: Default MapReduce program, ToolRunner, Debug&Counter, Invert index sample
/java/ptn002: Database to file, file to database
/java/ptn003: HBase to file and HFile, file to HBase,
/java/ptn004: XML to file, file to XML, customized Inputformat for XML
/java/ptn005: Two ways to read and write Avro files
/java/ptn006: Sequence file reader and writer, customized WritableComparable
/java/ptn007: Customized key, value, InputFormat, RecordReader, Partitioner
/java/ptn008: Use distributed cache
/java/ptn009: Secondary and global sorting
/java/ptn010: Combine small files into big ones by Avro, SequenceFiles, CombineFileInputFormat
/java/ptn011: Read and write compressed files and LZOP
/java/ptn012: Log processing utility
/java/ptn013: Split reader to exam split content

The ptn1xx is about MRUnit

/java/ptn101: JUnit help class for MRUnit
/java/ptn102: Identity Map and Reduce test

The ptn2xx is about Hive

/java/ptn201: Hive UDF, UDAF, and GenericUDF
/java/ptn202: Hive SerDe

The ptn3xx is about Pig

/java/ptn301: Pig customized store/load function for common log and sequencefile
/java/ptn302: Pig UDF for LoadFunc, EvalFunc, and FilterFunc

The ptn4xx is about HBase

/java/ptn401: HBase CRUD Operations in terms of put, get, delete
/java/ptn402: HBase scan and row locking
/java/ptn403: HBase imports data from other source

Hadoop Version

All the code has been exercised against CDH3u2, which for the purposes of the code is the same has Hadoop 0.20.x. There are a couple of places where I utilize some features in Pig 0.9.1, which won't work with CDH3u1 which uses 0.8.1.

Building and Running

Download

git clone git://github.com/willddy/bigdata_pattern.git

Build

cd bigdata_pattern
mvn package

Runtime Dependencies

Many of the examples use Snappy and LZOP compression. Therefore you may get runtime errors if you don't have them installed and configured in your cluster.

Snappy can be installed on CDH by following the instructions at here

To install LZOP follow the instructions here

Run an example

# copy the input files into HDFS
hadoop fs -mkdir /tmp
hadoop fs -put you-test-data/* /tmp/

# replace the path below with the location of your Hadoop installation
# this isn't required if you are running CDH3, for example
export HADOOP_HOME=/usr/local/hadoop

# run the map-reduce job
bin/run.sh ptn001.InvertedIndexMapReduce /tmp/file1.txt /tmp/file2.txt output

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
bin		bin
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Big Data Pattern and Samples

Objectives

Table of Content

The ptn0xx is about MapReduce & HDFS

The ptn1xx is about MRUnit

The ptn2xx is about Hive

The ptn3xx is about Pig

The ptn4xx is about HBase

Hadoop Version

Building and Running

Download

Build

Runtime Dependencies

Run an example

About

Uh oh!

Releases

Packages

Languages

License

willddy/bigdata_pattern

Folders and files

Latest commit

History

Repository files navigation

Big Data Pattern and Samples

Objectives

Table of Content

The ptn0xx is about MapReduce & HDFS

The ptn1xx is about MRUnit

The ptn2xx is about Hive

The ptn3xx is about Pig

The ptn4xx is about HBase

Hadoop Version

Building and Running

Download

Build

Runtime Dependencies

Run an example

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages