-
Notifications
You must be signed in to change notification settings - Fork 20
CSV to Parquet #183
Comments
ping @xrl and @sunchao You, guys, are more familiar with Arrow and converting CSV to Parquet.
|
Thanks @sadikovi, for number 4, I forgot that when modifying a parquet, you indeed rewrite a new file. On number 3, a csv use-case is simpler because I won't have nested values. I might however have nulls. Your explanation addresses most of my question. so for nulls I might be able to: let all_fields: Vec<Type> = some_data;
let unique_vals: Vec<Type> = some_data.distinct();
let positions: Vec<usize> = fn_field_positions(&all_fields, &unique_vals);
typed_writer.write_batch(&all_fields, None, Some(positions)) // here I'm trying to write unique vals and their positions. This might obviously be incorrect |
Note that you provide non-null values and the list of definition levels.
Also it looks like people struggle to write values. I will start working on
high level write API, so you will be able to easily map values from CSV or
JSON.
…On Sat, 3 Nov 2018 at 1:30 PM, Neville Dipale ***@***.***> wrote:
Thanks @sadikovi <https://github.com/sadikovi>, for number 4, I forgot
that when modifying a parquet, you indeed rewrite a new file.
On number 3, a csv use-case is simpler because I won't have nested values.
I might however have nulls. Your explanation addresses most of my question.
so for nulls I might be able to:
``rust`
let all_fields: Vec = some_data;
let unique_vals: Vec = some_data.distinct();
let positions: Vec = fn_field_positions(&all_fields, &unique_vals);
typed_writer.write_batch(&all_fields, None, Some(positions)) // here I'm
trying to write unique vals and their positions. This might obviously be
incorrect
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#183 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHbY3pKeIGt7KPZp6vgch5wVh7RYKem4ks5urYxUgaJpZM4YMq0U>
.
|
I've made some progress with generating a schema from inspecting a sample of csv values. An easier write API would be great, as right now I don't know how to deal with nulls. To deal with nulls, would a function/interface that takes |
Okay, I will start planning work for high level API, plus, fixing bunch of
other tickets from the backlog.
Yes, that would work, but we already have a high level Row API, user would
just have to map data to that and we will write a proper file and would
handle nulls, arrays, maps, etc.
Cheers!
…On Sat, 3 Nov 2018 at 5:13 PM, Neville Dipale ***@***.***> wrote:
I've made some progress with generating a schema from inspecting a sample
of csv values. An easier write API would be great, as right now I don't
know how to deal with nulls.
To deal with nulls, would a function/interface that takes Option<Type>
work?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#183 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHbY3mLYp6nYqqKiULuRSC-5RO7_5xbVks5urcCqgaJpZM4YMq0U>
.
|
Great question! I had a similar problem and this is what I came up with:
and the awkwardly named
it gets even more complicated if you schema is nested so thankfully I've avoided that! I wouldn't know where to start with building out that code. You'll also see that my code isn't as generic as it could be, that's mostly just related to the lack of generality for writing strings (they have to be owned bytes). |
@sadikovi I do agree that a high level write API is required. I've been hammering out my diesel-to-parquet code and I've been writing many different flavors of Vec<Option> writers and stumbling over keeping the column writers lined up with the schema. I have some observations that are probably best put in an issue. I'll wait for you to post your design and I will definitely chime in! |
@xrl, I haven't gotten there. with nulls. Here's my code to read a csv with strings and integers https://gist.github.com/nevi-me/443025fe11038e2709083db2e24a5e64 I can do trial & error for other field types. Not the most effecient, but it's a start. |
Hi, I'm experimenting with creating a CSV to Parquet writer, and I have a few questions.
My endgoal from the experiment is to create a crate that converts various file formats (csv, bson, json) to parquet. This would lend itself to being a possible CSV to Apache Arrow reader.
How can I write strings? I usedI found this one,message schema {REQUIRED BYTE_ARRAY name;}
, but when I read a parquet file in Python, the strings are shown as bytes.{REQUIRED BYTE_ARRAY name;} (UTF8)
If I have a csv file that looks like:
I have written the below:
Similar to Q1, am I writing my strings properly, or is there a better way?From reading through the conversation on How to write a None value to a column #174, it looks like I have to specify an index of where my values are. So, does
typed_writer.write_batch(&[24,25,24,26,27,28], None, None)
produce a less compact file? Is it even valid?In general, does this library allow appending to existing parquet files? I haven't tried it yet.
I'm trying a naive approach of first reading csv files and generating the schema with a string builder. If that works, I can look at macros.
The other thing that I'll explore separately is how to convert
csv
'sStringRecord
values into parquetType
s. Might use regex to figure out if things are strings, i32, i64, timestamps, etc. If someone knows of an existing way, that'd also be welcome.The text was updated successfully, but these errors were encountered: