-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Production Use #2
Comments
I'm also testing it out as well, it would be a grate idea to provide godoc and ci badges |
hi @kelindar @nut-abctech , sorry about the delay reply and thanks for your suggestions. |
Hi @xitongsys , we are looking to convert .csv files on S3 to parquet. Would this tool be suitable? |
@IkiM0no Yes, it works. But you should provide the similar interface of S3 as my example. In addition I'm considering to do some other work: |
Very helpful @xitongsys thank you very much. |
@IkiM0no oh...sorry about my misleading expression |
Thanks for the clarification @xitongsys :) I suspect maybe this is due to the parameters I am passing to |
@IkiM0no what is the meaning of "broken" ? Do you store some strings, but get byte characters? like |
@xitongsys, thanks, let me provide some clarification and my steps. My csv sitting on s3 is very simple, 3 rows, 3 columns like this:
My test script is just grabbing this file from S3, no problems there.
When I select * from that table temp.parqtest, the data is all jumbled together. It doesn't create the appropriate rows and columns I expect. I'll post a screen shot tomorrow. |
Hi, @IkiM0no You can try it and if there are still error, please tell me :) |
Ah! Please forgive my user error. I have corrected this issue. I have also changed the variable names so they aren't reserved words 'First' and 'Last' in SQL, So 'First' > 'First_name', etc. Now I can inspect the schema and file using parquet-tools.
Great! Interestingly though, the columns appear in reverse order. I have seen that this can be an issue for Impala. Anyway, I dropped and re-created my table using either column order and refresh:
Now when I query the table, Impala complains:
Can I specify some other encoding? EDIT: I found this page stating that Imapala supports certain encodings: https://www.cloudera.com/documentation/enterprise/5-3-x/topics/impala_parquet.html parquet-tools meta shows:
|
@IkiM0no |
@xitongsys |
@IkiM0no |
@xitongsys something I've been working on is abstracting the need to write/hard-code a struct for each csv file I am converting to parquet (with the help of parquet-go, of course (: ). I'm expressing this as schemata.go in my project:
and generating a list of type The issue is, that type How would you recommend approaching this? I feel like maybe Write.go should expose a method that implements |
@IkiM0no |
@xitongsys apologies for the delay, but I have now had time to test the CSVWriter and happy to report this works brilliantly! Many thanks. I've been telling people about this library and how great it is. Will continue to test and provide feedback as we scale up our efforts :) |
@IkiM0no
Maybe you should update your code. The readme and examples has been updated, you can read it. |
Thank you. We will test and provide feedback.
…On Oct 27, 2017 10:13 PM, "xitongsys" ***@***.***> wrote:
@IkiM0no <https://github.com/ikim0no>
thanks your test and feedback.
Parquet-go is still under active development and it will have some
changes. If your codes doesn't work using the latest codes, you can read
the readme file :)
(Today I add some features and you should firstly call Flush() before
WriteStop(). The readme and examples has been updated. )
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AeNJACJOhhb0uDWFEMGzxFgOJPs0ciChks5swrgOgaJpZM4P2YVn>
.
|
Hi @xitongsys. I'm now using parquet-go to iterate over a data set and want to write to multiple parquet files based on a partition column in the data. So all rows with partition col = "01" write to 01.parq, if partition col = "02" write to 02.parq, etc. I am spinning off a 'worker' go routine for each row and passing the row over a channel to a worker() function. Now I would like to create a different version of the Create() method from your example that creates the parquet file if it does not exist (that partition has not yet been seen), or simply appends to that .parq file if it has already been created. I have done this successfully with plain .csv files, but am having trouble with parquet-go because ParquetFile is an an interface and I'm having trouble extending it. This is my basic method:
Do you have advice on how I can add this new method to the *PqFile object? |
hi, @IkiM0no , Do you mean you must use 'CreateOrAppend' as your function name? If it's not, why don't you put your code in Create() method? |
@xitongsys good suggestion, I have implemented
Below is writer.go in my project. As a test, I added a csv writer to The csv files write as expected, and the .parq files are created, but only contain a string 'PAR1'. I know I have a bug somewhere. I have added lots of print statements, verified that the channels are sending and receiving data, but cannot seem to locate it. Do you see that I have implemented something incorrectly in my
|
hi @IkiM0no
You should use |
Thank you for the suggestion @xitongsys :) I see what happened, I did not I have implemented the change you suggest, but still see only "PAR1" in the parquet files, but complete data in the csv files (being written by the same I wonder what the issue is because those are the only 4 bytes contained in each parquet file. I have added a small change to drop the last item in the row slice, as that is the partition column, not contained in This is my updated writer.go with your suggestion to use I wonder, do you think that I need to
|
hi, @IkiM0no Sorry about the delay. These days I rewrite many parts of parquet-go, which gives a great performance improvement. Some of the functions has changed. So please use the latest version. Parquet writer has a buffer inside, so you can't create a new writer to append some other data to the same parquet file. I have changed many places and fixed some bugs in your code. Now it works. |
@IkiM0no I have update parquet-go again and the reader/writer has been implemented inside and users need not to implement by themselves. Please use the latest version . I have changed and tested your code, it is ok.
|
Thank you @xitongsys very much for your continued support and development on parquet-go! :) |
@xitongsys happy to report this latest v0.9.8 is working very, very well! 👍 The new Writer capabilities are making this very fast now that we can use concurrency. One thing I'm wondering still is about how to properly pass dates to parquet-go.
This is the output for inspecting the same columns schema:
Should I be formatting the time stamp(s) prior to writing to the file, perhaps it prefers Unix epoch time format? |
@IkiM0no "DATE is used to for a logical date type, without a time of day. It must annotate an int32 that stores the number of days from the Unix epoch, 1 January 1970."
If you want to store a time stamp, i think you should use TIMESTAMP_MILLIS |
I was wondering if there's an update here as per whether this library should be considered stable for production usage. |
As I known, several people have used this library in their production environments. So I think you can also have a try :) |
@tejasmanohar I'm already using it in a production application. This was the only go library that I found can read and write parquet files. It saves a lot of work for me and it works very well so far. However, in my case the performance is not critical. Why not give it a try and maybe you can help to improve it as well. |
@xitongsys @aohua Sweet. I just wanted to know if folks are depending on it.
Because I can also just use the JVM like we have to for Spark |
Hi @xitongsys, it seems like this is under active development still, do you think it's ready for production use? We'd like to replace some of our spark pipelines :)
The text was updated successfully, but these errors were encountered: