Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r] Update SOMADataFrame creation and writes #1835

Merged
merged 6 commits into from
Nov 1, 2023

Conversation

eddelbuettel
Copy link
Member

@eddelbuettel eddelbuettel commented Oct 28, 2023

Issue and/or context:

TileDB Core 2.17.3 provides updated semantics for Enumerations (aka factor variables). The TileDB R package (in upcoming release 0.21.2 or the current snapshot 0.21.1.10) connects this to R, and this PR lets the TileDB SOMA package take advantage of it.

Changes:

Data frame objects are created with 'empty' enumerations, i.e. without factor levels. These are added as needed during writes.

Notes for Reviewer:

The PR is still in draft form as it needed to 'park' four test files affected by an unrelated update to SeuratObject.

The CI setup is once again modified to let is use 'newer than CRAN' package versions of TileDB R via r-universe.

We can lift the 'draft' status when either or both of these constraints can be lifted.

SC 35945

@shortcut-integration
Copy link

@eddelbuettel
Copy link
Member Author

eddelbuettel commented Oct 28, 2023

CI fails because I overlooked that TileDB R 0.2.1.10 does not automatically bring in forgot to update the package fallback-pin to TileDB Core 2.17.3. This has been taken care of in 66118ca.

@codecov-commenter
Copy link

codecov-commenter commented Oct 28, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

see 38 files with indirect coverage changes

📢 Thoughts on this report? Let us know!.

@johnkerl johnkerl changed the title [r] Update SOMADataFrame creation and writes [r] Update SOMADataFrame creation and writes Oct 30, 2023
tiledb::tiledb_array_schema_set_enumeration_empty(schema = tdb_schema,
attr = tdb_attrs[[field_name]],
enum_name = field_name,
type_str = "UTF8",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while UTF-8 strings are indeed the nominal (and by far the most common) value type, they are quite general. we don't have existing unit-test coverage for this in tiledbsoma-r to date, but, i think this should also be tiledb_type_from_arrow_type here on the arrow-schema's value type -- so arrow::string would map to '"UTF8", and arrow::float64to"FLOAT64"`, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry but are asking for a function such as

tiledb_type_from_arrow_type <- function(x) {
stopifnot(is_arrow_data_type(x))
switch(x$name,
int8 = "INT8",
int16 = "INT16",
int32 = "INT32",
int64 = "INT64",
uint8 = "UINT8",
uint16 = "UINT16",
uint32 = "UINT32",
uint64 = "UINT64",
float32 = "FLOAT32",
float = "FLOAT32",
float64 = "FLOAT64",
# based on tiledb::r_to_tiledb_type()
double = "FLOAT64",
boolean = "BOOL",
bool = "BOOL",
# large_string = "large_string",
# binary = "binary",
# large_binary = "large_binary",
# fixed_size_binary = "fixed_size_binary",
# tiledb::r_to_tiledb_type() returns UTF8 for characters but they are
# not yet queryable so we use ASCII for now
utf8 = "ASCII",
string = "ASCII",
large_utf8 = "ASCII",
# date32 = "date32",
# date64 = "date64",
# time32 = "time32",
# time64 = "time64",
# null = "null",
# timestamp = "timestamp",
# decimal128 = "decimal128",
# decimal256 = "decimal256",
# struct = "struct",
# list_of = "list",
# list = "list",
# large_list_of = "large_list",
# large_list = "large_list",
# fixed_size_list_of = "fixed_size_list",
# fixed_size_list = "fixed_size_list",
# map_of = "map",
# duration = "duration",
dictionary = "INT32", # for a dictionary the 'values' are ints, levels are character
stop("Unsupported Arrow data type: ", x$name, call. = FALSE)
)
}
?

Copy link
Member

@johnkerl johnkerl Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eddelbuettel yes

Arrow dictionary types have an index type (often but not necessarily int8) and a value type (very often but not necessariliy string) -- both can vary and both need to be looked up, and tiledb_type_from_arrow_type seems to be the best way to do that

Copy link
Member

@johnkerl johnkerl Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eddelbuettel IIUC we're lacking support, and unit-test, for non-string enumeration-value types

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnkerl Do you have an example data set with such factors?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@eddelbuettel eddelbuettel Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned I believe this to be due to the factor that your column titled enum is a factor, and as we saw you cannot create a factor outside the 'int as index, string as levels' pair. So in the second example we cannot return a tibble because a tibble would require a factor and we cannot instantiate one with those types as far as I understand it. At least I have not been able to create one.

Copy link
Member Author

@eddelbuettel eddelbuettel Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in straight up arrow (the R package) going away from int seems to have issues:

> dictionary(int8(), utf8())
DictionaryType
dictionary<values=string, indices=int8>
>
> dictionary(float32(), utf8())
Error: Type error: Dictionary index type should be integer, got float
> 
> dictionary(float64(), utf8())
Error: Type error: Dictionary index type should be integer, got double
> 

Double values works but let me see if I can get it into R:

> dictionary(int8(), float64())
DictionaryType
dictionary<values=double, indices=int8>
> 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally have an existence proof:

> dictionary(int8(), float64())
DictionaryType
dictionary<values=double, indices=int8>
> sch <- arrow::schema(arrow::field("ind", arrow::int8()), arrow::field("fct", arrow::dictionary(int8(), float64())))
> sch
Schema
ind: int8
fct: dictionary<values=double, indices=int8>
> arrow_table(ind=1:3, fct=c(2.1, 3.2, 4.3))
Table
3 rows x 2 columns
$ind <int32>
$fct <double>
> tibble::as_tibble(arrow_table(ind=1:3, fct=c(2.1, 3.2, 4.3)))
# A tibble: 3 × 2
    ind   fct
  <int> <dbl>
1     1   2.1
2     2   3.2
3     3   4.3
> 

but note that even here we loose the factor-ness. The second column is now a straight-up double with no labels:

> str(tibble::as_tibble(arrow_table(ind=1:3, fct=c(2.1, 3.2, 4.3))))
tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
 $ ind: int [1:3] 1 2 3
 $ fct: num [1:3] 2.1 3.2 4.3
> 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eddelbuettel !

johnkerl added a commit that referenced this pull request Oct 31, 2023
johnkerl added a commit that referenced this pull request Oct 31, 2023
johnkerl added a commit that referenced this pull request Oct 31, 2023
* [c++/python] Depends on core 2.17.3 and TileDB-Py 0.23.2

* Point at dev tiledb-r from #1835
nguyenv pushed a commit that referenced this pull request Oct 31, 2023
* [c++/python] Depends on core 2.17.3 and TileDB-Py 0.23.2

* Point at dev tiledb-r from #1835
github-actions bot pushed a commit that referenced this pull request Oct 31, 2023
* [c++/python] Depends on core 2.17.3 and TileDB-Py 0.23.2

* Point at dev tiledb-r from #1835
johnkerl added a commit that referenced this pull request Oct 31, 2023
* [c++/python] Depends on core 2.17.3 and TileDB-Py 0.23.2

* Point at dev tiledb-r from #1835

Co-authored-by: John Kerl <kerl.john.r@gmail.com>
@johnkerl johnkerl self-requested a review November 1, 2023 14:22
@eddelbuettel eddelbuettel merged commit 9c18c49 into main Nov 1, 2023
13 checks passed
@eddelbuettel eddelbuettel deleted the de/sc-35945/enumeration branch November 1, 2023 15:20
github-actions bot pushed a commit that referenced this pull request Nov 1, 2023
* [r] Updated SOMADataFrame creation and writing

This requires tiledb-r version 0.21.2 ("to be") or 0.21.1.10 as of right now

* [r] Update two test files to reflect SOMADataFrame updates

* [r] *Temporarily* 'park' tests file affected by SeuratObject 5.0.0

* [r] Set 'UTF8' as the string type

* [r] Extend the type inference to dictionary index type

* [r] Following #1809 and rebase re-activeate four 'parked' tests
johnkerl pushed a commit that referenced this pull request Nov 1, 2023
* [r] Updated SOMADataFrame creation and writing

This requires tiledb-r version 0.21.2 ("to be") or 0.21.1.10 as of right now

* [r] Update two test files to reflect SOMADataFrame updates

* [r] *Temporarily* 'park' tests file affected by SeuratObject 5.0.0

* [r] Set 'UTF8' as the string type

* [r] Extend the type inference to dictionary index type

* [r] Following #1809 and rebase re-activeate four 'parked' tests

Co-authored-by: Dirk Eddelbuettel <edd@debian.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants