Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with dot "." in field name #349

Closed
pwmcintyre opened this issue Jan 27, 2021 · 10 comments
Closed

issue with dot "." in field name #349

pwmcintyre opened this issue Jan 27, 2021 · 10 comments

Comments

@pwmcintyre
Copy link

pwmcintyre commented Jan 27, 2021

hi

I know it has been briefly mentioned in other issue about the drama of using "." in field names, but i'm hoping you can help

Using the Java parquet-tools to inspect the schema of an existing Parquet file i have, i can see it contains "." in the field names, but works fine:

$ docker run -it --rm -v ${PWD}:/data  nathanhowell/parquet-tools schema /data/part-00001-d82e5581-88f1-4203-85db-861c8d907350.c000.snappy.parquet

message spark_schema {
  optional binary version (STRING);
  optional binary meta.format (STRING);
  optional binary meta.id (STRING);
}

and while using your tool i get the following:

$ parquet-tools -cmd schema -file ./part-00001-d82e5581-88f1-4203-85db-861c8d907350.c000.snappy.parquet

----- Go struct -----
Spark_schema struct {
  Version *string
  Meta46format *string
  Meta46id *string
}
----- Json schema -----
{
  "Tag": "name=Spark_schema, repetitiontype=REQUIRED",
  "Fields": [
    {
      "Tag": "name=Version, type=UTF8, repetitiontype=OPTIONAL",
      "Fields": null
    },
    {
      "Tag": "name=Meta46format, type=UTF8, repetitiontype=OPTIONAL",
      "Fields": null
    },
    {
      "Tag": "name=Meta46id, type=UTF8, repetitiontype=OPTIONAL",
      "Fields": null
    }
  ]
}

I'm similarly having trouble writing files with "." in the key — eg with this struct:

type Event struct {
	Version *string `parquet:"name=version, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
	MetaID *string `parquet:"name=meta.id, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
}

I get the following error when attempting to read it:

$ docker run -it --rm -v ${PWD}:/data  nathanhowell/parquet-tools schema /data/output_test/struct/output.parquet

org.apache.parquet.io.InvalidRecordException: meta not found in message parquet_go_root {
  optional binary version (STRING) = 0;
  optional binary meta.id (STRING) = 0;
}

any ideas?

@xitongsys
Copy link
Owner

hi, @pwmcintyre
Golang doesn't support a variable name with dot. So you should provide a legal name for a go struct field.
Following is an example of write/read a parquet file with a field which name has a ..

package main

import (
	"log"

	"github.com/xitongsys/parquet-go-source/local"
	"github.com/xitongsys/parquet-go/parquet"
	"github.com/xitongsys/parquet-go/reader"
	"github.com/xitongsys/parquet-go/writer"
)

type Student struct {
        //// name is the parquet filed name. inname is the variable name
	Name    string  `parquet:"name=student.name, inname=name, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
	Age     int32   `parquet:"name=age, type=INT32, encoding=PLAIN"`
}

func main() {
	var err error
	fw, err := local.NewLocalFileWriter("output/flat.parquet")
	if err != nil {
		log.Println("Can't create local file", err)
		return
	}

	//write
	pw, err := writer.NewParquetWriter(fw, new(Student), 4)
	if err != nil {
		log.Println("Can't create parquet writer", err)
		return
	}

	pw.RowGroupSize = 128 * 1024 * 1024 //128M
	pw.PageSize = 8 * 1024 //8K
	pw.CompressionType = parquet.CompressionCodec_SNAPPY
	num := 10
	for i := 0; i < num; i++ {
		stu := Student{
			Name:   "StudentName",
			Age:    int32(20 + i%5),
		}
		if err = pw.Write(stu); err != nil {
			log.Println("Write error", err)
		}
	}
	if err = pw.WriteStop(); err != nil {
		log.Println("WriteStop error", err)
		return
	}
	log.Println("Write Finished")
	fw.Close()

	///read
	fr, err := local.NewLocalFileReader("output/flat.parquet")
	if err != nil {
		log.Println("Can't open file")
		return
	}

	pr, err := reader.NewParquetReader(fr, new(Student), 4)
	if err != nil {
		log.Println("Can't create parquet reader", err)
		return
	}
	num = int(pr.GetNumRows())
	stus := make([]Student, num) //read 10 rows
	if err = pr.Read(&stus); err != nil {
		log.Println("Read error", err)
	}
	log.Println(stus)

	pr.ReadStop()
	fr.Close()

}

running result:

2021/01/28 08:38:46 Write Finished
2021/01/28 08:38:46 [{StudentName 20} {StudentName 21} {StudentName 22} {StudentName 23} {StudentName 24} {StudentName 20} {StudentName 21} {StudentName 22} {StudentName 23}
{StudentName 24}]

@pwmcintyre
Copy link
Author

@xitongsys — appreciate your time, thank you

i have reproduced your result above — but similar to my example earlier, when attempting to read this new parquet file with my existing systems (i'm using AWS Athena), i get an error similar to the below error from parquet-tools:

$ docker run -it --rm -v ${PWD}:/data  nathanhowell/parquet-tools schema /data/output.parquet
org.apache.parquet.io.InvalidRecordException: student not found in message parquet_go_root {
  required binary student.name (STRING) = 0;
  required int32 age = 0;
}

similarly, using another Go implementation, i still cannot read this file:

$ parquet-tool schema output.parquet
panic: line 2: expected ;, got unknown start of token '46' instead

and so i suspect there may be an issue in the handling of the "." in the output file?

@xitongsys
Copy link
Owner

hi, @pwmcintyre
Could your provide a sample file like "/data/part-00001-d82e5581-88f1-4203-85db-861c8d907350.c000.snappy.parquet ?

@pwmcintyre
Copy link
Author

@xitongsys — emailed, and while not sensitive, we would prefer it not shared publicly :)

@pwmcintyre
Copy link
Author

hi @xitongsys ... did your post get about java implementation get deleted? did you find the answer?

@xitongsys
Copy link
Owner

hi, @pwmcintyre
I have found the reason. Parquet-go just use "." as a field delimiter which caused this issue. I'm considering how to fix it and keep the compatibility with before.

@pwmcintyre
Copy link
Author

@xitongsys — thanks for the update, please let me know if there's anything I can help with

@xitongsys
Copy link
Owner

hi, @pwmcintyre
Fixed in this pull
Actually I just use \x01 as the delimiter instead of ..
Example file you can found here

@pwmcintyre
Copy link
Author

@xitongsys — well done! thanks again

I can confirm AWS Athena is happy with this change 👌 (ignore the nulls, it's just a test)
image

@xitongsys
Copy link
Owner

ok, I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants