New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement support for schema in saving dask parquet (#1736) #1746
Conversation
Thanks for raising this @avsolatorio it's looking real good! |
@datajoely I have an idea, but this might take some time. I'm looking at using I'll make this PR a draft until I finish that. |
I don't think PyArrow provides a native way to define complex schema as text (as mentioned in your issue), although simple stuff like |
66edb47
to
3c3a2d6
Compare
@deepyaman thanks for looking into this! I checked Fugue, and I don't think it supports complex schema structures such as nested list types. For example, I don't see any way to represent a field that has this representation I have implemented a parser instead to supplant the use of |
Ugh, you're right. I wonder if they'd be open to a simple change to support this?
I think something to this extent would work, but I'm personally in favor of a solution that doesn't involve maintaining a schema parser, if we can. I assume the subset of people using complex schemas in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @avsolatorio, thank you so much for taking this on.
I don't know much about dask and or parquet, so I'd like to ask some questions:
- I see you've added
pyarrow
as a dependency to the dataset. If a user just wants to use the existing functionality they will now to also need to install this dependency. Would it perhaps make sense to have a separate dataset that makes use ofpyarrow
, or do you think in general any user that uses dask/parquet would probably usepyarrow
as well? - You mention: "Alternatively, a full pyarrow.Schema type may also be provided if desired." How exactly would that work?
- In the schema a user provides, is it necessary to provide all columns with datatypes?
- How would more complex datatypes work? Perhaps a more advanced example would be useful as well.
I guess I'm mainly trying to understand how many use cases this covers or if it's rather specific.
That's a good catch. I don't think it's necessary to add |
@MerelTheisenQB you're very much correct. While I think pyarrow is widely used, this is an optional dependency to dask as well. And as @deepyaman pointed out, dask's
So dask's In the parser that I implemented, it allows for all of these specifications in the
Not necessarily. If the passed value is a dictionary of
I discovered this issue when I was trying to create a dataset where I had to store in a column a list of tokenized segments of text, e.g., sentences in a paragraph. I need to use this data in a transformer model that requires the ordering of a sequence of segmented text data. Therefore the data structure that my pipeline output has this form Indeed, I'm not sure how many would have this use case. I just thought that it would be good to support the functionalities of the underlying backends since kedro has the I think @deepyaman's PR in Fugue is a very good alternative to decouple this dependency. Nonetheless, Fugue strictly depends on pyarrow. So it is inevitable that pyarrow will become an implicit dependency on kedro. |
Responding again to your answers on my questions:
On question 4:
Thanks for explaining your need for this functionality in more detail, your use case definitely makes sense. And I appreciate your thoughts on adding the support in Kedro, it would be good to have this functionality for other users who run into this issue. I think my main concern with this implementation is explainability and maintainability. The first point is mainly around if this solution makes sense to others and is intuitive enough so they would make use of it. I think that can be solved by adding detailed docs and examples. The second point about maintainability echos @deepyaman's thoughts above for maintaining a schema parser, which wouldn't be ideal for the team.
The above sounds okay to me, because |
@avsolatorio We've discussed this among the maintainer team, and while we think it's important to support this functionality, we'd like to do it in a way that would result in the least maintenance on our end--especially given our current expectation that the use case of complex schema is pretty niche. To this extent, our preference are, in order:
While we appreciate the work you've done, and think it's really cool, we probably won't go the third route. For now, this would also mean using a custom local dataset (would be true regardless, until next release). That being said, would you be open to updating your PR (or opening a new one) to leverage |
Confirmed with the maintainers that this is just a pending bug, so will try and get it resolved this weekend! |
@deepyaman I'm very much open to updating the PR to use |
@avsolatorio Great! Triad 0.6.7 was released a few hours ago, and it has the bug fix to support nested schema. :) |
3c3a2d6
to
c1069c4
Compare
Hello, @deepyaman. I have already updated the PR to use |
This is excellent work @avsolatorio really appreciate the effort here 💪 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great overall! Took a quick pass and left 2 comments; I'll try and go through it more properly later.
|
||
if isinstance(schema, dict): | ||
pa_schema = triad.Schema( | ||
",".join([":".join(i) for i in schema.items()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can construct triad.Schema
directly from a dict; see https://github.com/fugue-project/triad/blob/master/triad/collections/schema.py#L59.
An example with string values:
>>> Schema({"a": "int", "b": "float"}).pyarrow_schema
a: int32
b: float
I don't know that you can do more complex stuff, like nested lists, but I think it's a bit unnatural that you define the top level as a dict but the nested lists as strings (as in the test cases).
I think if you've got a power-user, they can just construct the schema string for more complex cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I chose to implement it this way because of how dask.to_parquet
distinguishes between a dict
-like vs. a pa.Schema
-like value for its schema
argument.
If we directly cast a dict
input into a schema as your snippet shows, triad
would return a schema containing the key:value
explicitly declared in the dict
which would prevent dask from inferring the types for the other fields (if any). It could cause unexpected behaviors since there could be a case where a user simply wants to explicitly override the inferred field types for some of the columns and let dask do the inference for the rest.
In some sense, the dict
input is a more general case since we can choose to specify the field types only for certain fields or for all the fields.
it's a bit unnatural that you define the top level as a dict but the nested lists as strings (as in the test cases).
Ah, I was simulating how the parsed arguments from the catalog.yml
would be passed into the dataset.
Since in the catalog.yml
, the specification looks like:
save_args:
schema:
col1: int64
col2: "[[int64]]"
then, I suppose that would be passed as a dict with this structure {"col1": "int64", "col2": "[[int64]]"}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my reference (and anybody else who may be looking at this and not familiar with the way dask.to_parquet
works), the code that does the partial inference that @avsolatorio describes is here: https://github.com/dask/dask/blob/2022.6.0/dask/dataframe/io/parquet/arrow.py#L500-L529
In that case, I agree that it makes sense to not convert to a full PyArrow schema before passing to dask.to_parquet
, and instead pass a field name-type mapping. Perhaps the exception is if somebody passes a pa.Schema
object as schema
, in which case we should replicate the Dask behavior of not performing inference? It's a bit confusing...
Taking all of this into account, maybe it's simplest and most explicit to:
- Accept a string, dict,
pa.Schema
, or whatever else Triad takes to construct atriad.Schema
for theschema
argument. - Add an
infer_schema
argument (that defaults toTrue
), based on which theschema
arg is either used as is or as overrides for the inferred schema.
As for converting the triad.Schema
object to a field-type mapping goes, you can use the fields
property on triad.Schema
(no need to extract the pyarrow_schema
and iterate over that yourself).
Let me know if that makes sense?
""" | ||
schema = self._save_args.get("schema") | ||
|
||
if isinstance(schema, dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to also support the case in which somebody passes a pa.Schema
? I assume so, since that's the native Dask expectation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's a good point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using pa.Schema
is already supported since whatever value that is passed in _save_args["schema"]
will also be passed to Dask's to_parquet
.
I'm so sorry about the delay on this. I told myself I'd take a look this weekend, but I got busy with other things. I will try and make sure to take a look at it later this week and push it through! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the very overdue response. I've finally had the chance to give this a much more thorough look, and made a suggestion. Let me know if you agree, or if I'm missing something again.
|
||
if isinstance(schema, dict): | ||
pa_schema = triad.Schema( | ||
",".join([":".join(i) for i in schema.items()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my reference (and anybody else who may be looking at this and not familiar with the way dask.to_parquet
works), the code that does the partial inference that @avsolatorio describes is here: https://github.com/dask/dask/blob/2022.6.0/dask/dataframe/io/parquet/arrow.py#L500-L529
In that case, I agree that it makes sense to not convert to a full PyArrow schema before passing to dask.to_parquet
, and instead pass a field name-type mapping. Perhaps the exception is if somebody passes a pa.Schema
object as schema
, in which case we should replicate the Dask behavior of not performing inference? It's a bit confusing...
Taking all of this into account, maybe it's simplest and most explicit to:
- Accept a string, dict,
pa.Schema
, or whatever else Triad takes to construct atriad.Schema
for theschema
argument. - Add an
infer_schema
argument (that defaults toTrue
), based on which theschema
arg is either used as is or as overrides for the inferred schema.
As for converting the triad.Schema
object to a field-type mapping goes, you can use the fields
property on triad.Schema
(no need to extract the pyarrow_schema
and iterate over that yourself).
Let me know if that makes sense?
Hi @avsolatorio do you still want to complete this PR, or should someone from the team take-over? We'd like to get all PRs related to datasets to be merged soon now we're moving our datasets code to a different package (see our medium blog post for more details) |
Hello @merelcht , apologies for not being able to get back and close this PR earlier. I'll finalize this and update with @deepyaman 's comments. |
Hi @avsolatorio, do you think you'll have time to complete this in the next few days? We are going to do a release early next week and need any dataset changes to be merged by then. Otherwise, we'll close this PR and ask you to re-open it on the new datasets repo when it's ready. |
8ce053a
to
9d03036
Compare
Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>
9d03036
to
037d3d6
Compare
Hello @merelcht PR updated! @deepyaman kindly check because I decided not to implement the The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution and your patience with our reviews @avsolatorio 🙏 ⭐
I especially like the clear doc-strings you've added to explain the schema functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avsolatorio Thank you for the updates! I think this makes sense, and the way you've documented when inference happens (i.e. if you specify a dict) is great.
Look forward to seeing some more people use this and see how they like the added power! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Thanks for the contribution 😃
Two minor code optimisation suggestions from me, nothing blocking.
# Create a schema from values that triad can handle directly | ||
triad_schema = triad.Schema( | ||
{k: v for k, v in schema.items() if not isinstance(v, str)} | ||
) | ||
|
||
# Handle the schema keys that are represented as string and add them to the triad schema | ||
triad_schema.update( | ||
triad.Schema( | ||
",".join( | ||
[ | ||
":".join([k, v]) | ||
for k, v in schema.items() | ||
if isinstance(v, str) | ||
] | ||
) | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cleaner to build the schema in one loop, would something like the following work?
# Create a schema from values that triad can handle directly | |
triad_schema = triad.Schema( | |
{k: v for k, v in schema.items() if not isinstance(v, str)} | |
) | |
# Handle the schema keys that are represented as string and add them to the triad schema | |
triad_schema.update( | |
triad.Schema( | |
",".join( | |
[ | |
":".join([k, v]) | |
for k, v in schema.items() | |
if isinstance(v, str) | |
] | |
) | |
) | |
) | |
triad_schema_dict = {} | |
for k, v in schema.items(): | |
if isinstance(v, str): | |
... | |
else: | |
... | |
triad_schema = triad.Schema(triad_schema_dict) |
Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
@merelcht @deepyaman @jmholzer, thanks for all the suggestions! Apologies that I was not able to work on the final set of comments. I was traveling, and I wasn't able to find time to implement the suggestions. Nonetheless, great to see this PR merged! 🥳 |
@merelcht btw, would this PR qualify for this: #2050 (comment). Would be super to be added on this list #2064 (comment)! 🔋 |
Absolutely! Let me create a new PR to add you 🙂 |
Oh actually you're already there 😄 https://github.com/kedro-org/kedro/blob/main/RELEASE.md#contributions-from-the-kedroid-community |
Yay! Thank you so much! 🥳 |
Signed-off-by: Aivin V. Solatorio avsolatorio@gmail.com
Description
Resolves: #1736
Development notes
This change allows the use of the schema argument in the
dask.to_parquet
API from kedro'sdask.ParquetDataSet
.A custom parser for the schema is implemented. The parser supports all kinds of schema declaration accepted by the underlying
dask.to_parquet
API.The documentation was updated to show an example of the grammar for defining the schema in the
catalog.yml
The change requires the parsing of the
_save_args
for theschema
key and handles the transformation of the fields to apyarrow.DataType
orpyarrow.Schema
accordingly.Tests have also been written for this change.
Checklist
RELEASE.md
file