How to use snappy? #114

seaguest · 2022-03-22T04:13:09Z

Hello,

I am trying yo use this library with snappy, but I can't find any example, here is my code

func write4() {
	f, err := os.OpenFile("output4.parquet", os.O_APPEND|os.O_WRONLY|os.O_CREATE, os.ModePerm)
	if err != nil {
		log.Error(err)
		return
	}
	defer f.Close()

	writer, _ := parquet3.Snappy.NewWriter(f)

	type record struct {
		Format   string `parquet:"format"`
		DataType int32  `parquet:"data_type"`
		Country  string `parquet:"country"`
	}

	num := 1000
	for i := 0; i < num; i++ {
		stu := record{
			Format:   "Test",
			DataType: 1,
			Country:  "IN",
		}

		writer.Write(stu) // here argument can only be []byte
	}
	// Closing the writer is necessary to flush buffers and write the file footer.
	if err := writer.Close(); err != nil {
		log.Error(err)
	}
}

I tried github.com/fraugster/parquet-go, github.com/fraugster/parquet-go/parquet, it is very easy to use SNAPPY with them.

writer.Write(stu) // here argument can only be []byte
What is the right way to do so?

The text was updated successfully, but these errors were encountered:

seaguest · 2022-03-22T08:26:18Z

I tried to use

	type record struct {
		Format   string `parquet:"format,snappy"`
		DataType int32  `parquet:"data_type,snappy"`
		Country  string `parquet:"country,snappy"`
	}

but the output file is still very large, about 34M，while with other 2 libraries, they are only 1KB

Pryz · 2022-03-22T17:31:53Z

Hi @seaguest,

Looking at the snippet you shared it doesn't look like you are using segmentio/parquet-go.

Using the following :

package main

import (
        "os"

        "log"

        segmentparquet "github.com/segmentio/parquet-go"
)

func write() {
        f, err := os.OpenFile("outputs.parquet", os.O_APPEND|os.O_WRONLY|os.O_CREATE|os.O_TRUNC, os.ModePerm)
        if err != nil {
                log.Println(err)
                return
        }
        defer f.Close()

        writer := segmentparquet.NewWriter(f)

        type record struct {
                Format   string `parquet:"format,snappy"`
                DataType int32  `parquet:"data_type,snappy"`
                Country  string `parquet:"country,snappy"`
        }

        num := 1000
        for i := 0; i < num; i++ {
                stu := record{
                        Format:   "Test",
                        DataType: 1,
                        Country:  "IN",
                }

                writer.Write(stu) // here argument can only be []byte
        }
        // Closing the writer is necessary to flush buffers and write the file footer.
        if err := writer.Close(); err != nil {
                log.Println(err)
        }

}

func main() {
        write()
}

ends up creating a file of 1060 bytes so about 1KB.

seaguest · 2022-03-24T02:26:11Z

@Pryz

Indeed it works now.
But I am curious why should we put snappy annotation for each field, usually we won't have different compression type for different fields in one struct.

I saw other library has an option like this

	pw.CompressionType = parquet2.CompressionCodec_GZIP

why this library doesn't have such an option?

kevinburkesegment · 2022-03-24T02:32:22Z

usually we won't have different compression type for different fields in one struct.

We're planning to use different compression types for different fields in one struct (tracing data), which is why we thought that choice was a good fit.

kevinburkesegment · 2022-03-24T02:32:34Z

I'm going to close this - thanks for the issue report and glad you got it working!

seaguest · 2022-03-24T08:39:49Z

are you planning to provide an option of compression type for all fields in the future?
If we have hundreds of fields, it would be a disaster to add "snappy" for each, and that is meaningless in case we need only one compression type.

thanks for your quick reply~

himanshpal · 2022-04-04T10:31:14Z

+1 on this. It would be great to provide an alternative way to pass compression config while initialising an writer.

Pryz · 2022-04-06T16:21:21Z

Created #124 as a follow up

kevinburkesegment closed this as completed Mar 24, 2022

Pryz mentioned this issue Apr 6, 2022

Add the ability to configure the compression for all Parquet fields #124

Closed

achille-roussel added the question Further information is requested label Jun 21, 2022

achille-roussel self-assigned this Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use snappy? #114

How to use snappy? #114

seaguest commented Mar 22, 2022

seaguest commented Mar 22, 2022

Pryz commented Mar 22, 2022

seaguest commented Mar 24, 2022

kevinburkesegment commented Mar 24, 2022

kevinburkesegment commented Mar 24, 2022

seaguest commented Mar 24, 2022

himanshpal commented Apr 4, 2022 •

edited

Loading

Pryz commented Apr 6, 2022

How to use snappy? #114

How to use snappy? #114

Comments

seaguest commented Mar 22, 2022

seaguest commented Mar 22, 2022

Pryz commented Mar 22, 2022

seaguest commented Mar 24, 2022

kevinburkesegment commented Mar 24, 2022

kevinburkesegment commented Mar 24, 2022

seaguest commented Mar 24, 2022

himanshpal commented Apr 4, 2022 • edited Loading

Pryz commented Apr 6, 2022

himanshpal commented Apr 4, 2022 •

edited

Loading