-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[r] Update SOMADataFrame
creation and writes
#1835
Conversation
This pull request has been linked to Shortcut Story #35945: [TileDB-R] Incorporate core enumeration PRs for zero levels at create, and level extension at write. |
CI fails because I |
Codecov ReportAll modified and coverable lines are covered by tests ✅ ❗ Your organization needs to install the Codecov GitHub app to enable full functionality. see 38 files with indirect coverage changes 📢 Thoughts on this report? Let us know!. |
SOMADataFrame
creation and writes
b5dfa48
to
f8a6c4e
Compare
tiledb::tiledb_array_schema_set_enumeration_empty(schema = tdb_schema, | ||
attr = tdb_attrs[[field_name]], | ||
enum_name = field_name, | ||
type_str = "UTF8", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while UTF-8 strings are indeed the nominal (and by far the most common) value type, they are quite general. we don't have existing unit-test coverage for this in tiledbsoma-r to date, but, i think this should also be tiledb_type_from_arrow_type
here on the arrow-schema's value type -- so arrow::string
would map to '"UTF8", and
arrow::float64to
"FLOAT64"`, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry but are asking for a function such as
TileDB-SOMA/apis/r/R/utils-arrow.R
Lines 43 to 91 in 501ec84
tiledb_type_from_arrow_type <- function(x) { | |
stopifnot(is_arrow_data_type(x)) | |
switch(x$name, | |
int8 = "INT8", | |
int16 = "INT16", | |
int32 = "INT32", | |
int64 = "INT64", | |
uint8 = "UINT8", | |
uint16 = "UINT16", | |
uint32 = "UINT32", | |
uint64 = "UINT64", | |
float32 = "FLOAT32", | |
float = "FLOAT32", | |
float64 = "FLOAT64", | |
# based on tiledb::r_to_tiledb_type() | |
double = "FLOAT64", | |
boolean = "BOOL", | |
bool = "BOOL", | |
# large_string = "large_string", | |
# binary = "binary", | |
# large_binary = "large_binary", | |
# fixed_size_binary = "fixed_size_binary", | |
# tiledb::r_to_tiledb_type() returns UTF8 for characters but they are | |
# not yet queryable so we use ASCII for now | |
utf8 = "ASCII", | |
string = "ASCII", | |
large_utf8 = "ASCII", | |
# date32 = "date32", | |
# date64 = "date64", | |
# time32 = "time32", | |
# time64 = "time64", | |
# null = "null", | |
# timestamp = "timestamp", | |
# decimal128 = "decimal128", | |
# decimal256 = "decimal256", | |
# struct = "struct", | |
# list_of = "list", | |
# list = "list", | |
# large_list_of = "large_list", | |
# large_list = "large_list", | |
# fixed_size_list_of = "fixed_size_list", | |
# fixed_size_list = "fixed_size_list", | |
# map_of = "map", | |
# duration = "duration", | |
dictionary = "INT32", # for a dictionary the 'values' are ints, levels are character | |
stop("Unsupported Arrow data type: ", x$name, call. = FALSE) | |
) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eddelbuettel yes
Arrow dictionary types have an index type (often but not necessarily int8
) and a value type (very often but not necessariliy string
) -- both can vary and both need to be looked up, and tiledb_type_from_arrow_type
seems to be the best way to do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eddelbuettel IIUC we're lacking support, and unit-test, for non-string enumeration-value types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnkerl Do you have an example data set with such factors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up in #1846
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned I believe this to be due to the factor that your column titled enum
is a factor
, and as we saw you cannot create a factor outside the 'int as index, string as levels' pair. So in the second example we cannot return a tibble
because a tibble
would require a factor
and we cannot instantiate one with those types as far as I understand it. At least I have not been able to create one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even in straight up arrow
(the R package) going away from int
seems to have issues:
> dictionary(int8(), utf8())
DictionaryType
dictionary<values=string, indices=int8>
>
> dictionary(float32(), utf8())
Error: Type error: Dictionary index type should be integer, got float
>
> dictionary(float64(), utf8())
Error: Type error: Dictionary index type should be integer, got double
>
Double values works but let me see if I can get it into R:
> dictionary(int8(), float64())
DictionaryType
dictionary<values=double, indices=int8>
>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally have an existence proof:
> dictionary(int8(), float64())
DictionaryType
dictionary<values=double, indices=int8>
> sch <- arrow::schema(arrow::field("ind", arrow::int8()), arrow::field("fct", arrow::dictionary(int8(), float64())))
> sch
Schema
ind: int8
fct: dictionary<values=double, indices=int8>
> arrow_table(ind=1:3, fct=c(2.1, 3.2, 4.3))
Table
3 rows x 2 columns
$ind <int32>
$fct <double>
> tibble::as_tibble(arrow_table(ind=1:3, fct=c(2.1, 3.2, 4.3)))
# A tibble: 3 × 2
ind fct
<int> <dbl>
1 1 2.1
2 2 3.2
3 3 4.3
>
but note that even here we loose the factor
-ness. The second column is now a straight-up double
with no labels:
> str(tibble::as_tibble(arrow_table(ind=1:3, fct=c(2.1, 3.2, 4.3))))
tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
$ ind: int [1:3] 1 2 3
$ fct: num [1:3] 2.1 3.2 4.3
>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @eddelbuettel !
* [c++/python] Depends on core 2.17.3 and TileDB-Py 0.23.2 * Point at dev tiledb-r from #1835
This requires tiledb-r version 0.21.2 ("to be") or 0.21.1.10 as of right now
b4567b8
to
d8b8882
Compare
* [c++/python] Depends on core 2.17.3 and TileDB-Py 0.23.2 * Point at dev tiledb-r from #1835
* [c++/python] Depends on core 2.17.3 and TileDB-Py 0.23.2 * Point at dev tiledb-r from #1835
* [r] Updated SOMADataFrame creation and writing This requires tiledb-r version 0.21.2 ("to be") or 0.21.1.10 as of right now * [r] Update two test files to reflect SOMADataFrame updates * [r] *Temporarily* 'park' tests file affected by SeuratObject 5.0.0 * [r] Set 'UTF8' as the string type * [r] Extend the type inference to dictionary index type * [r] Following #1809 and rebase re-activeate four 'parked' tests
* [r] Updated SOMADataFrame creation and writing This requires tiledb-r version 0.21.2 ("to be") or 0.21.1.10 as of right now * [r] Update two test files to reflect SOMADataFrame updates * [r] *Temporarily* 'park' tests file affected by SeuratObject 5.0.0 * [r] Set 'UTF8' as the string type * [r] Extend the type inference to dictionary index type * [r] Following #1809 and rebase re-activeate four 'parked' tests Co-authored-by: Dirk Eddelbuettel <edd@debian.org>
Issue and/or context:
TileDB Core 2.17.3 provides updated semantics for Enumerations (aka
factor
variables). The TileDB R package (in upcoming release 0.21.2 or the current snapshot 0.21.1.10) connects this to R, and this PR lets the TileDB SOMA package take advantage of it.Changes:
Data frame objects are created with 'empty' enumerations, i.e. without factor levels. These are added as needed during writes.
Notes for Reviewer:
The PR is still in draft form as it needed to 'park' four test files affected by an unrelated update toSeuratObject
.The CI setup is once again modified to let is use 'newer than CRAN' package versions of TileDB R via r-universe.
We can lift the 'draft' status when either or both of these constraints can be lifted.
SC 35945