Skip to content
This repository has been archived by the owner on Sep 26, 2023. It is now read-only.

Add support for lzo files in input #104

Open
benjben opened this issue Mar 31, 2020 · 0 comments · May be fixed by #105
Open

Add support for lzo files in input #104

benjben opened this issue Mar 31, 2020 · 0 comments · May be fixed by #105

Comments

@benjben
Copy link

benjben commented Mar 31, 2020

hadoop-lzo.jar is preinstalled on EMR, and in order to be able to split indexed LZO files when reading them (and have one Spark partition per block size instead of one Spark partition per file), we need to use sc.newAPIHadoopFile() to read them instead of sc.textFile() as it's currently the case.

More info can be found here.

@AcidFlow AcidFlow linked a pull request Apr 1, 2020 that will close this issue
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant