Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arrow - dealing with columns of type ArrowType.Null #323

Closed
hjrnunes opened this issue Sep 21, 2022 · 2 comments
Closed

arrow - dealing with columns of type ArrowType.Null #323

hjrnunes opened this issue Sep 21, 2022 · 2 comments

Comments

@hjrnunes
Copy link

hjrnunes commented Sep 21, 2022

TMD will break if fed an arrow dataset with a column of type ArrowType.Null:

(tmd-arrow/stream->dataset "withnullcol.arrow")
;; java.lang.Exception: Failed to datafy datatype class org.apache.arrow.vector.types.pojo.ArrowType$Null
;;     at tech.v3.libs.arrow$read_schema$fn__47204.invoke(arrow.clj:711)

If one extends the protocol to this type, for example

(extend-protocol clj-proto/Datafiable
  ArrowType$Null
  (datafy [this] {:datatype :boolean}))

It will then throw:

java.lang.IndexOutOfBoundsException: null
 at clojure.lang.RT.subvec (RT.java:1614)
    clojure.core$subvec.invokeStatic (core.clj:3830)
    clojure.core$subvec.invoke (core.clj:3819)
    tech.v3.libs.arrow$records__GT_ds$fn__20962.invoke (arrow.clj:1365)

From looking at the code, It seems to me that TMD's assumption that datatypes will have at least 2 buffers, does not hold for this odd type.

Python used to create the arrow dataset:

my_schema = pa.schema([
    pa.field('year', pa.int64()),
    pa.field('nullcol', pa.null())])
pylist = [{'year': 2020, 'nullcol': None}]
table = pa.Table.from_pylist(pylist, schema=my_schema)
feather.write_feather(table, "withnullcol.arrow", compression="zstd", version=2)
@cnuernber
Copy link
Collaborator

Great, thanks for the issue, will fix soon

@cnuernber
Copy link
Collaborator

Release 6.100 fixes this. The arrow docs state the null schema type is for columns with no physical data so a column of all mising entries is reasonably null. Whether this is broadly useful or not is a different question...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants