Add support for lzo files in input #104

benjben · 2020-03-31T15:17:48Z

hadoop-lzo.jar is preinstalled on EMR, and in order to be able to split indexed LZO files when reading them (and have one Spark partition per block size instead of one Spark partition per file), we need to use sc.newAPIHadoopFile() to read them instead of sc.textFile() as it's currently the case.

More info can be found here.

The text was updated successfully, but these errors were encountered:

AcidFlow linked a pull request Apr 1, 2020 that will close this issue

Read splittable LZO effectively in transformer #105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for lzo files in input #104

Add support for lzo files in input #104

benjben commented Mar 31, 2020

Add support for lzo files in input #104

Add support for lzo files in input #104

Comments

benjben commented Mar 31, 2020