Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.StringIndexOutOfBoundsException: String index out of range #18

Closed
buyology opened this issue Dec 6, 2016 · 6 comments
Closed

Comments

@buyology
Copy link

buyology commented Dec 6, 2016

Trying to load a file using PySpark with

df = sqlContext.read.format("com.github.saurfang.sas.spark").load("file.sas7bdat")

Get the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o189.load.
: java.lang.StringIndexOutOfBoundsException: String index out of range: 32721
	at java.lang.String.substring(String.java:1951)
	at com.ggasoftware.parso.SasFileParser$ColumnNameSubheader.processSubheader(SasFileParser.java:723)
	at com.ggasoftware.parso.SasFileParser.processPageMetadata(SasFileParser.java:466)
	at com.ggasoftware.parso.SasFileParser.processSasFilePageMeta(SasFileParser.java:436)
	at com.ggasoftware.parso.SasFileParser.getMetadataFromSasFile(SasFileParser.java:360)
	at com.ggasoftware.parso.SasFileParser.<init>(SasFileParser.java:280)
	at com.ggasoftware.parso.SasFileParser$Builder.build(SasFileParser.java:264)
	at com.ggasoftware.parso.SasFileReader.<init>(SasFileReader.java:41)
	at com.github.saurfang.sas.spark.SasRelation.inferSchema(SasRelation.scala:98)
	at com.github.saurfang.sas.spark.SasRelation.<init>(SasRelation.scala:32)
	at com.github.saurfang.sas.spark.DefaultSource.createRelation(DefaultSource.scala:34)
	at com.github.saurfang.sas.spark.DefaultSource.createRelation(DefaultSource.scala:23)
	at com.github.saurfang.sas.spark.DefaultSource.createRelation(DefaultSource.scala:11)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:315)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:211)
	at java.lang.Thread.run(Thread.java:745)
@8bit-pixies
Copy link
Contributor

You might have better luck if you provide the file.sas7bdat which reproduces this so that people can have a look.

@buyology
Copy link
Author

Sure it was this file with results from PISA: http://vs-web-fs-1.oecd.org/pisa/PUF_SAS_COMBINED_CMB_STU_QQQ.zip

@niraj7848
Copy link

When I am trying to parse a 32MB file I am able to parse it and convert it into CSV file.

But when I am trying convert a 7GB file, I am getting ArrayIndexOutOfBoundException as below.

Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at com.ggasoftware.parso.SasFileParser.readSubheaderSignature(SasFileParser.java:493)
at com.ggasoftware.parso.SasFileParser.processPageMetadata(SasFileParser.java:460)
at com.ggasoftware.parso.SasFileParser.processNextPage(SasFileParser.java:937)
at com.ggasoftware.parso.SasFileParser.readNextPage(SasFileParser.java:919)
... 29 more

But When running with a 120MB file it is giving following exception
Caused by: java.nio.BufferUnderflowException
at java.nio.Buffer.nextGetIndex(Buffer.java:506)
at java.nio.HeapByteBuffer.getDouble(HeapByteBuffer.java:514)
at com.ggasoftware.parso.SasFileParser.bytesToDate(SasFileParser.java:1312)
at com.ggasoftware.parso.SasFileParser.processByteArrayWithData(SasFileParser.java:1106)
at com.ggasoftware.parso.SasFileParser.readNext(SasFileParser.java:887)
... 32 more

I am really not sure how to debug. Any help or guidance will be highly appreciated.
Thanks

@niraj7848
Copy link

BufferUnderflowException issue got resolved.
IndexOutOfBoundsException is coming because of wrong offset to read the subheader. readSubheaderSignature method is getting the index which is greater than the size of a Page and hence it is throwing IndexOutOfBoundsException.

Has anyone worked on this problem?

@ghost
Copy link

ghost commented Jun 26, 2017

@niraj7848 Can you explain how you resolved the BufferUnderflowException? I'm using 1.1.5

@saurfang
Copy link
Owner

saurfang commented Feb 3, 2018

Please give the new version a try as we believe this might have been fixed.

@saurfang saurfang closed this as completed Feb 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants