Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slice bounds out of range #462

Open
programmerX1123 opened this issue May 12, 2022 · 2 comments
Open

slice bounds out of range #462

programmerX1123 opened this issue May 12, 2022 · 2 comments

Comments

@programmerX1123
Copy link

programmerX1123 commented May 12, 2022

Hi, I am parsing a parquet file whose schema is (generated by parquet-tools):

{
  "Tag": "name=Schema, repetitiontype=REQUIRED",
  "Fields": [
    {
      "Tag": "name=Timestamp, type=INT64, repetitiontype=OPTIONAL"
    },
    {
      "Tag": "name=File_name, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"
    },
    {
      "Tag": "name=Avro_name, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"
    },
    {
      "Tag": "name=Offset, type=INT32, repetitiontype=OPTIONAL"
    },
    {
      "Tag": "name=File_format, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"
    },
    {
      "Tag": "name=Meta_data, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"
    }
  ]
}

And I use the following struct to hold the content of parquet file:

type Schema struct {
	Timestamp int64  `parquet:"name=timestamp, type=INT64"`
	AvroName  string `parquet:"name=avro_name, type=BYTE_ARRAY"`
	FileName  string `parquet:"name=file_name, type=BYTE_ARRAY"`
	Offset    int32  `parquet:"name=offset, type=INT32"`
}

When I try to parse a parquet file which has 4905 rows, the following error is thrown out:

panic: runtime error: slice bounds out of range [:4905] with capacity 3072

But when I run the same code on a parquet file that has only 5 rows, there is no error (these 2 parquet files are generated by same script so they share the same schema). Here is the result:

[{211297138286 Image0.avro 211297138286.png 269475} 
{210997038286 Image0.avro 210997038286.png 58} 
{210997038286 Image0.avro 210997038286.png 58} 
{210997038286 Image0.avro 210997038286.png 58} 
{210997038286 Image0.avro 210997038286.png 58}]

So is there a limit of the size of the parquet file?
Besides, when I omit the AvroName field, the first parquet file can also be read successfully ( but AvroName is a field of file names just as FileName so I don't think there are any differences between them).
Moreover, I have tested several parquet files with different number of rows, and they get the same slice bounds out of range error. Therefore I think this error is not caused by occasional mistake during the generation of parquet file.
Now I am really confused and wonder if you can help me fix this bug. Thank you in advance!

@hangxie
Copy link
Contributor

hangxie commented May 19, 2022

The schema and go struct don't match, OPTIONAL fields should be defined as pointer so it can be nil.
If it does not work after changing definition of type Schema, it will be helpful to have a sample parquet file (and better with snippet of your source code) to troubleshoot.

zolstein pushed a commit to zolstein/parquet-go that referenced this issue Jun 23, 2023
When a Read is performed after SeekToRow on mergedRowGroups, the rowIndex is
checked against the seek index and advanced until the rowIndex == seek index.
Previously, the rowIndex was not advanced in the normal read path, resulting in
mistakenly dropping unread rows when advancing the rowIndex.
@ZhenSh
Copy link

ZhenSh commented Jul 24, 2023

Hi @programmerX1123
I have run into this same issue, wondering how did you get the issue resolved? Could you share the info?
Appreciate it.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants