Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

How to use snappy? #114

Closed
seaguest opened this issue Mar 22, 2022 · 8 comments
Closed

How to use snappy? #114

seaguest opened this issue Mar 22, 2022 · 8 comments
Assignees
Labels
question Further information is requested

Comments

@seaguest
Copy link

Hello,

I am trying yo use this library with snappy, but I can't find any example, here is my code

func write4() {
	f, err := os.OpenFile("output4.parquet", os.O_APPEND|os.O_WRONLY|os.O_CREATE, os.ModePerm)
	if err != nil {
		log.Error(err)
		return
	}
	defer f.Close()

	writer, _ := parquet3.Snappy.NewWriter(f)

	type record struct {
		Format   string `parquet:"format"`
		DataType int32  `parquet:"data_type"`
		Country  string `parquet:"country"`
	}

	num := 1000
	for i := 0; i < num; i++ {
		stu := record{
			Format:   "Test",
			DataType: 1,
			Country:  "IN",
		}

		writer.Write(stu) // here argument can only be []byte
	}
	// Closing the writer is necessary to flush buffers and write the file footer.
	if err := writer.Close(); err != nil {
		log.Error(err)
	}
}

I tried github.com/fraugster/parquet-go, github.com/fraugster/parquet-go/parquet, it is very easy to use SNAPPY with them.

writer.Write(stu) // here argument can only be []byte
What is the right way to do so?

@seaguest
Copy link
Author

I tried to use

	type record struct {
		Format   string `parquet:"format,snappy"`
		DataType int32  `parquet:"data_type,snappy"`
		Country  string `parquet:"country,snappy"`
	}

but the output file is still very large, about 34M,while with other 2 libraries, they are only 1KB

@Pryz
Copy link
Contributor

Pryz commented Mar 22, 2022

Hi @seaguest,

Looking at the snippet you shared it doesn't look like you are using segmentio/parquet-go.

Using the following :

package main

import (
        "os"

        "log"

        segmentparquet "github.com/segmentio/parquet-go"
)

func write() {
        f, err := os.OpenFile("outputs.parquet", os.O_APPEND|os.O_WRONLY|os.O_CREATE|os.O_TRUNC, os.ModePerm)
        if err != nil {
                log.Println(err)
                return
        }
        defer f.Close()

        writer := segmentparquet.NewWriter(f)

        type record struct {
                Format   string `parquet:"format,snappy"`
                DataType int32  `parquet:"data_type,snappy"`
                Country  string `parquet:"country,snappy"`
        }

        num := 1000
        for i := 0; i < num; i++ {
                stu := record{
                        Format:   "Test",
                        DataType: 1,
                        Country:  "IN",
                }

                writer.Write(stu) // here argument can only be []byte
        }
        // Closing the writer is necessary to flush buffers and write the file footer.
        if err := writer.Close(); err != nil {
                log.Println(err)
        }

}

func main() {
        write()
}

ends up creating a file of 1060 bytes so about 1KB.

@seaguest
Copy link
Author

@Pryz

Indeed it works now.
But I am curious why should we put snappy annotation for each field, usually we won't have different compression type for different fields in one struct.

I saw other library has an option like this

	pw.CompressionType = parquet2.CompressionCodec_GZIP

why this library doesn't have such an option?

@kevinburkesegment
Copy link
Contributor

usually we won't have different compression type for different fields in one struct.

We're planning to use different compression types for different fields in one struct (tracing data), which is why we thought that choice was a good fit.

@kevinburkesegment
Copy link
Contributor

I'm going to close this - thanks for the issue report and glad you got it working!

@seaguest
Copy link
Author

are you planning to provide an option of compression type for all fields in the future?
If we have hundreds of fields, it would be a disaster to add "snappy" for each, and that is meaningless in case we need only one compression type.

thanks for your quick reply~

@himanshpal
Copy link

himanshpal commented Apr 4, 2022

+1 on this. It would be great to provide an alternative way to pass compression config while initialising an writer.

@Pryz
Copy link
Contributor

Pryz commented Apr 6, 2022

Created #124 as a follow up

@achille-roussel achille-roussel added the question Further information is requested label Jun 21, 2022
@achille-roussel achille-roussel self-assigned this Jun 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants