GitHub - rajeshbalamohan/ifile_v2: Benchmark different ifile formats (KV, MultiKV)

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.settings		.settings
src		src
.classpath		.classpath
.project		.project
README.txt		README.txt
initial_results.png		initial_results.png
pom.xml		pom.xml
store_sales_60_l.csv		store_sales_60_l.csv
store_sales_sample.csv		store_sales_sample.csv

Repository files navigation

Purpose: 
=======
To compare KV, MultiKV ifile formats with different options like with/without compression, with/without RLE etc.

Ref: https://issues.apache.org/jira/browse/TEZ-1228

The current vertex-intermediate format used all across Tez is a flat file of variable length k,v pairs. For a significant number of use-cases, in particular the sorted output phase, a large number of consecutive identical keys are found within the same stream. The IFile format ends up writing each key out fully into the stream to generate (K,V) pairs instead of ordering it into a more efficient K,
{V1, .. Vn}
list.
This duplication of key data needs larger buffers to hold in memory and requires comparison between keys known to be identical while doing a merge sort.
This bug tracks the building of a prototype IFile format which is optimized for lower uncompressed sizes within memory buffers and less compute intensive to perform merge sorts during the reducer phase.

To run:
======

mvn clean package -DskipTests=true exec:java -Dexec.mainClass="org.apache.tez.runtime.library.common.ifile2.benchmark.Benchmark" -Dexec.args="file:////Users/...directory location.../store_sales_60_l.csv"

In case you need to run for all codecs, it would be easier to run via hadoop command with tez classpaths in HADOOP_CLASSPATH

HADOOP_CLASSPATH=./target/*:$TEZ_HOME/*:$TEZ_HOME/lib/*:$HADOOP_CLASSPATH hadoop jar ./target/ifile_v2-0.0.1-SNAPSHOT.jar org.apache.tez.runtime.library.common.ifile2.benchmark.Benchmark file:////grid/..../ifile_v2/store_sales_60_l.csv