Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing of siva files bigger than 2Gb #31

Closed
smacker opened this issue Jan 3, 2018 · 7 comments
Closed

Processing of siva files bigger than 2Gb #31

smacker opened this issue Jan 3, 2018 · 7 comments

Comments

@smacker
Copy link
Contributor

smacker commented Jan 3, 2018

There is a limit for a job in Spark. It's 2GB. We need to investigate how to change it if possible and how it will affect spark. (the limit was introduced for some reason)

If somebody else will look at it, here is a tip. It looks like the limit is not from Spark actually, but JVM. I can be wrong. JFYI

Exception:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 415 in stage 1.0 failed 4 times, most recent failure: Lost task 415.3 in stage 1.0 (TID 1072, 10.2.15.79, executor 8): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
	at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
	at tech.sourced.siva.SivaReader.getEntry(SivaReader.java:42)
	at tech.sourced.engine.provider.RepositoryObjectFactory$$anonfun$genSivaRepository$1.apply(RepositoryProvider.scala:209)
@smacker
Copy link
Contributor Author

smacker commented Jan 4, 2018

Siva reader uses MappedByteBuffer:
https://github.com/src-d/siva-java/blob/master/src/main/java/tech/sourced/siva/SivaReader.java#L43

And there is a limit in jdk https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/sun/nio/ch/FileChannelImpl.java#L788

Then there is an issue in Spark, it can't have partition more than 2G:
https://issues.apache.org/jira/browse/SPARK-6235
The issue is open since 2015 year but no much progress since then.

We can't change it.

@bzz @dpordomingo

@bzz
Copy link
Contributor

bzz commented Jan 5, 2018

2 issues here:

  1. siva-java fails to read .siva files > 2Gb

    To understand, if that is by-design, or is it a 🐛 I would suggest

    • find a repository that results in > 2Gb .siva file
    • borges pack it locally
    • try siva unpack using Go implementation

If it works - log an issue in https://github.com/src-d/siva-java

  1. Apache Spark can't have partition more than 2G
    Will update this later today

@smacker
Copy link
Contributor Author

smacker commented Jan 5, 2018

  1. I have created an issue already. Because even if it's by-design it should be documented.

But here is test:

$ ls -lah
-rw-r--r--. 1 root root 4.6G Jan  5 09:48 ec644e00e3cab40629bc32562269c011ec2a6b14.siva
$ siva unpack ec644e00e3cab40629bc32562269c011ec2a6b14.siva
$ ls
HEAD  config  ec644e00e3cab40629bc32562269c011ec2a6b14.siva  go  objects  refs

Big files are available in /apps/borges/too-big if anybody else need them.

@bzz
Copy link
Contributor

bzz commented Jan 5, 2018

Nice, why not linking the issue here?

@smacker
Copy link
Contributor Author

smacker commented Jan 5, 2018

Good point haha. I honestly believed I did, but no. Here you go: src-d/siva-java#18

@bzz
Copy link
Contributor

bzz commented Jan 17, 2018

siva-java Issue was fixed src-d/siva-java#18 (comment) and new v0.1.3 was released + a new engine version that includes it.

@smacker
Copy link
Contributor Author

smacker commented Jan 23, 2018

with the last engine processing works

@smacker smacker closed this as completed Jan 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants