Support for non-LZO files. #101

Open
wants to merge 2 commits into
from

Projects

None yet

4 participants

@rangadi
Twitter, Inc. member

set elephantbird.lzo.input.include.nonlzo to enable it.
by default only lzo files are read.

All the existing Lzo* input formats and Pig loaders would work with this.

(eventually we might remove "Lzo" prefix for many of these loader classes).

Raghu Angadi support non-lzo input files.
set elephantbird.lzo.input.include.nonlzo to enable it.
by default only lzo files are read.
136c2a0
@rangadi
Twitter, Inc. member

Once #97 is merged, I will update MultiInputFormat loader to read non-lzo files.

@julienledem

As this is changing the api should we add a method with the previous signature and a default value?
protected void addInputPath(List results, FileSystem fs,
FileStatus pathStat, boolean recursive) throws IOException {
this.addInputPath(results, fs, pathStat, recursive, false);
}

@julienledem julienledem commented on the diff Nov 22, 2011
...tter/elephantbird/mapreduce/input/LzoInputFormat.java
@@ -27,6 +29,11 @@ import com.hadoop.compression.lzo.LzoIndex;
public abstract class LzoInputFormat<K, V> extends FileInputFormat<K, V> {
@julienledem
julienledem Nov 22, 2011

If the LZOInputFormat does not do LZO only maybe we should split that class.
The LZOInputFormat would work the same and a new InputFormat would handle multiple compression formats.

@rangadi
rangadi Nov 22, 2011

Yes. Essentially we need to rename this class (along with many other higher level classes) and not have Lzo in the name. This the beginning of transition.

@julienledem julienledem commented on the diff Nov 22, 2011
...tter/elephantbird/mapreduce/input/LzoInputFormat.java
Path path = pathStat.getPath();
if (pathStat.isDir()) {
if (recursive) {
for(FileStatus stat: fs.listStatus(path, hiddenPathFilter)) {
- addInputPath(results, fs, stat, recursive);
+ addInputPath(results, fs, stat, recursive, includeNonLzo);
+ }
+ }
+ } else {
+ if (includeNonLzo) {
+ String name = path.getName();
+ if (!name.endsWith(lzoIndexSuffix) && !name.endsWith(lzoTmpIndexSuffix)) {
@julienledem
julienledem Nov 22, 2011

maybe add a comment here. We read all the files except the LZO index files.

@julienledem julienledem commented on the diff Nov 22, 2011
...tter/elephantbird/mapreduce/input/LzoInputFormat.java
@@ -99,8 +116,16 @@ public abstract class LzoInputFormat<K, V> extends FileInputFormat<K, V> {
* blocks and this.getSplits() adjusts the positions.
*/
try {
- FileSystem fs = filename.getFileSystem( context.getConfiguration() );
- return fs.exists( filename.suffix( LzoIndex.LZO_INDEX_SUFFIX ) );
+ if (visibleLzoFilter.accept(filename)) {
+ FileSystem fs = filename.getFileSystem( context.getConfiguration() );
+ return fs.exists( filename.suffix( LzoIndex.LZO_INDEX_SUFFIX ) );
+ } else { // non-lzo file
+ //only simple files are splitable (borrowed from TextInputFormat).
+ CompressionCodec codec =
+ new CompressionCodecFactory(context.getConfiguration()).getCodec(filename);
+ return codec == null;
+ // TODO: bzip2 files are splitable. reuse reader from PigStorage.
@julienledem
julienledem Nov 22, 2011

there should probably be an wrapping InputFormat which just delegates to the actual format of the file to know if it is splittable, etc.
There is a related work in: https://issues.apache.org/jira/browse/PIG-1722

@rangadi
rangadi Nov 22, 2011

I noticed AllLoader in Pig9 last week. It is interesting. will keep in mind.
briefly chatted with Dmitriy about it. It will of course need more patch to fit in here with Protobuf, Thrift loaders etc.

Also, as we discussed, we rework this patch more.. Lzo* classes only do Lzo etc.

@julienledem julienledem commented on the diff Nov 22, 2011
...tter/elephantbird/mapreduce/input/LzoInputFormat.java
@@ -121,6 +146,12 @@ public abstract class LzoInputFormat<K, V> extends FileInputFormat<K, V> {
FileSplit fileSplit = (FileSplit)genericSplit;
Path file = fileSplit.getPath();
+ if (!visibleLzoFilter.accept(file)) {
@julienledem
julienledem Nov 22, 2011

Same comment if there was a wrapping InputFormat delegating to the InputFormats dealing with the actual format there would not be a special case here

@rangadi
Twitter, Inc. member

after discussing with Julien.

  • lets refactor more and do the right thing now rather than later.
    • so Lzo* classes handle only lzo input
    • e.g. in addition to LzoProtobufBlockPigLoader, there will be ProtobufBlockPigLoader that loads files in other formats as well.
  • delegate to differerent inputformats as cleanly as the InputFormat interface allows
    • there are some limitations here.. e.g. what makes sure that "file.lzo.index" is not read by any input format etc.

will try to get better patch.

@dvryaboy
Twitter, Inc. member

Raghu are you still planning on tackling this?

@rangadi
Twitter, Inc. member

yeah. This is second on my elephant-bird list after the update to Thrift-RCFile. should have have a patch next week.

@xvrl

any update on this issue? I would like to see support for non-lzo files as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment