[r] Update `SOMADataFrame` creation and writes #1835

eddelbuettel · 2023-10-28T17:14:34Z

Issue and/or context:

TileDB Core 2.17.3 provides updated semantics for Enumerations (aka factor variables). The TileDB R package (in upcoming release 0.21.2 or the current snapshot 0.21.1.10) connects this to R, and this PR lets the TileDB SOMA package take advantage of it.

Changes:

Data frame objects are created with 'empty' enumerations, i.e. without factor levels. These are added as needed during writes.

Notes for Reviewer:

~~The PR is still in draft form as it needed to 'park' four test files affected by an unrelated update to SeuratObject.~~

The CI setup is once again modified to let is use 'newer than CRAN' package versions of TileDB R via r-universe.

We can lift the 'draft' status when either or both of these constraints can be lifted.

SC 35945

shortcut-integration · 2023-10-28T17:14:37Z

This pull request has been linked to Shortcut Story #35945: [TileDB-R] Incorporate core enumeration PRs for zero levels at create, and level extension at write.

eddelbuettel · 2023-10-28T17:24:58Z

CI fails because I ~~overlooked that TileDB R 0.2.1.10 does not automatically bring in~~ forgot to update the package fallback-pin to TileDB Core 2.17.3. This has been taken care of in 66118ca.

codecov-commenter · 2023-10-28T18:56:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

see 38 files with indirect coverage changes

📢 Thoughts on this report? Let us know!.

johnkerl · 2023-10-30T21:31:44Z

apis/r/R/SOMADataFrame.R

+              tiledb::tiledb_array_schema_set_enumeration_empty(schema = tdb_schema,
+                                                                attr = tdb_attrs[[field_name]],
+                                                                enum_name = field_name,
+                                                                type_str = "UTF8",


while UTF-8 strings are indeed the nominal (and by far the most common) value type, they are quite general. we don't have existing unit-test coverage for this in tiledbsoma-r to date, but, i think this should also be tiledb_type_from_arrow_type here on the arrow-schema's value type -- so arrow::string would map to '"UTF8", and arrow::float64to"FLOAT64"`, etc.

I am sorry but are asking for a function such as

TileDB-SOMA/apis/r/R/utils-arrow.R

Lines 43 to 91 in 501ec84

tiledb_type_from_arrow_type <- function(x) {

stopifnot(is_arrow_data_type(x))

switch(x$name,

int8 = "INT8",

int16 = "INT16",

int32 = "INT32",

int64 = "INT64",

uint8 = "UINT8",

uint16 = "UINT16",

uint32 = "UINT32",

uint64 = "UINT64",

float32 = "FLOAT32",

float = "FLOAT32",

float64 = "FLOAT64",

# based on tiledb::r_to_tiledb_type()

double = "FLOAT64",

boolean = "BOOL",

bool = "BOOL",

# large_string = "large_string",

# binary = "binary",

# large_binary = "large_binary",

# fixed_size_binary = "fixed_size_binary",

# tiledb::r_to_tiledb_type() returns UTF8 for characters but they are

# not yet queryable so we use ASCII for now

utf8 = "ASCII",

string = "ASCII",

large_utf8 = "ASCII",

# date32 = "date32",

# date64 = "date64",

# time32 = "time32",

# time64 = "time64",

# null = "null",

# timestamp = "timestamp",

# decimal128 = "decimal128",

# decimal256 = "decimal256",

# struct = "struct",

# list_of = "list",

# list = "list",

# large_list_of = "large_list",

# large_list = "large_list",

# fixed_size_list_of = "fixed_size_list",

# fixed_size_list = "fixed_size_list",

# map_of = "map",

# duration = "duration",

dictionary = "INT32", # for a dictionary the 'values' are ints, levels are character

stop("Unsupported Arrow data type: ", x$name, call. = FALSE)

)

}

?

@eddelbuettel yes

Arrow dictionary types have an index type (often but not necessarily int8) and a value type (very often but not necessariliy string) -- both can vary and both need to be looked up, and tiledb_type_from_arrow_type seems to be the best way to do that

@eddelbuettel IIUC we're lacking support, and unit-test, for non-string enumeration-value types

@johnkerl Do you have an example data set with such factors?

Follow-up in #1846

cc @aaronwolen @mojaveazure @mlin @pablo-gar

As mentioned I believe this to be due to the factor that your column titled enum is a factor, and as we saw you cannot create a factor outside the 'int as index, string as levels' pair. So in the second example we cannot return a tibble because a tibble would require a factor and we cannot instantiate one with those types as far as I understand it. At least I have not been able to create one.

Even in straight up arrow (the R package) going away from int seems to have issues:

> dictionary(int8(), utf8()) DictionaryType dictionary<values=string, indices=int8> > > dictionary(float32(), utf8()) Error: Type error: Dictionary index type should be integer, got float > > dictionary(float64(), utf8()) Error: Type error: Dictionary index type should be integer, got double >

Double values works but let me see if I can get it into R:

> dictionary(int8(), float64()) DictionaryType dictionary<values=double, indices=int8> >

I finally have an existence proof:

> dictionary(int8(), float64()) DictionaryType dictionary<values=double, indices=int8> > sch <- arrow::schema(arrow::field("ind", arrow::int8()), arrow::field("fct", arrow::dictionary(int8(), float64()))) > sch Schema ind: int8 fct: dictionary<values=double, indices=int8> > arrow_table(ind=1:3, fct=c(2.1, 3.2, 4.3)) Table 3 rows x 2 columns $ind <int32> $fct <double> > tibble::as_tibble(arrow_table(ind=1:3, fct=c(2.1, 3.2, 4.3))) # A tibble: 3 × 2 ind fct <int> <dbl> 1 1 2.1 2 2 3.2 3 3 4.3 >

but note that even here we loose the factor-ness. The second column is now a straight-up double with no labels:

> str(tibble::as_tibble(arrow_table(ind=1:3, fct=c(2.1, 3.2, 4.3)))) tibble [3 × 2] (S3: tbl_df/tbl/data.frame) $ ind: int [1:3] 1 2 3 $ fct: num [1:3] 2.1 3.2 4.3 >

Thanks @eddelbuettel !

* [c++/python] Depends on core 2.17.3 and TileDB-Py 0.23.2 * Point at dev tiledb-r from #1835

This requires tiledb-r version 0.21.2 ("to be") or 0.21.1.10 as of right now

* [c++/python] Depends on core 2.17.3 and TileDB-Py 0.23.2 * Point at dev tiledb-r from #1835

* [c++/python] Depends on core 2.17.3 and TileDB-Py 0.23.2 * Point at dev tiledb-r from #1835 Co-authored-by: John Kerl <kerl.john.r@gmail.com>

* [r] Updated SOMADataFrame creation and writing This requires tiledb-r version 0.21.2 ("to be") or 0.21.1.10 as of right now * [r] Update two test files to reflect SOMADataFrame updates * [r] *Temporarily* 'park' tests file affected by SeuratObject 5.0.0 * [r] Set 'UTF8' as the string type * [r] Extend the type inference to dictionary index type * [r] Following #1809 and rebase re-activeate four 'parked' tests

* [r] Updated SOMADataFrame creation and writing This requires tiledb-r version 0.21.2 ("to be") or 0.21.1.10 as of right now * [r] Update two test files to reflect SOMADataFrame updates * [r] *Temporarily* 'park' tests file affected by SeuratObject 5.0.0 * [r] Set 'UTF8' as the string type * [r] Extend the type inference to dictionary index type * [r] Following #1809 and rebase re-activeate four 'parked' tests Co-authored-by: Dirk Eddelbuettel <edd@debian.org>

eddelbuettel requested review from aaronwolen, johnkerl, awenocur and mojaveazure October 28, 2023 17:14

johnkerl changed the title ~~[r] Update SOMADataFrame creation and writes~~ [r] Update SOMADataFrame creation and writes Oct 30, 2023

mojaveazure mentioned this pull request Oct 30, 2023

[r] Force exporting v3 assays with SeuratObject v5 installed #1809

Merged

eddelbuettel force-pushed the de/sc-35945/enumeration branch from b5dfa48 to f8a6c4e Compare October 30, 2023 21:22

johnkerl reviewed Oct 30, 2023

View reviewed changes

johnkerl added a commit that referenced this pull request Oct 31, 2023

Point at dev tiledb-r from #1835

416d480

johnkerl added a commit that referenced this pull request Oct 31, 2023

Point at dev tiledb-r from #1835

65ff77e

johnkerl added a commit that referenced this pull request Oct 31, 2023

[c++/python] Depend on core 2.17.3 and TileDB-Py 0.23.2 (#1838)

928bb14

* [c++/python] Depends on core 2.17.3 and TileDB-Py 0.23.2 * Point at dev tiledb-r from #1835

eddelbuettel added 6 commits October 31, 2023 10:33

[r] Updated SOMADataFrame creation and writing

5ab8596

This requires tiledb-r version 0.21.2 ("to be") or 0.21.1.10 as of right now

[r] Update two test files to reflect SOMADataFrame updates

a2424bf

[r] *Temporarily* 'park' tests file affected by SeuratObject 5.0.0

a49052f

[r] Set 'UTF8' as the string type

1f167ed

[r] Extend the type inference to dictionary index type

5a7b862

[r] Following #1809 and rebase re-activeate four 'parked' tests

d8b8882

eddelbuettel force-pushed the de/sc-35945/enumeration branch from b4567b8 to d8b8882 Compare October 31, 2023 15:36

nguyenv pushed a commit that referenced this pull request Oct 31, 2023

[c++/python] Depend on core 2.17.3 and TileDB-Py 0.23.2 (#1838)

bfdf20b

* [c++/python] Depends on core 2.17.3 and TileDB-Py 0.23.2 * Point at dev tiledb-r from #1835

github-actions bot pushed a commit that referenced this pull request Oct 31, 2023

[c++/python] Depend on core 2.17.3 and TileDB-Py 0.23.2 (#1838)

37bbbfd

* [c++/python] Depends on core 2.17.3 and TileDB-Py 0.23.2 * Point at dev tiledb-r from #1835

johnkerl added the backport release-1.5 label Oct 31, 2023

eddelbuettel mentioned this pull request Oct 31, 2023

[r] Write string attrs as UTF-8 (Python compatibility) #1843

Merged

johnkerl mentioned this pull request Nov 1, 2023

[r] Consider a more robust way to handle R rendering of non-string enum values #1846

Open

johnkerl approved these changes Nov 1, 2023

View reviewed changes

johnkerl self-requested a review November 1, 2023 14:22

johnkerl approved these changes Nov 1, 2023

View reviewed changes

eddelbuettel merged commit 9c18c49 into main Nov 1, 2023
13 checks passed

eddelbuettel deleted the de/sc-35945/enumeration branch November 1, 2023 15:20

github-actions bot mentioned this pull request Nov 1, 2023

[Backport release-1.5] [r] Update SOMADataFrame creation and writes #1847

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[r] Update `SOMADataFrame` creation and writes #1835

[r] Update `SOMADataFrame` creation and writes #1835

eddelbuettel commented Oct 28, 2023 •

edited

Loading

shortcut-integration bot commented Oct 28, 2023

eddelbuettel commented Oct 28, 2023 •

edited

Loading

codecov-commenter commented Oct 28, 2023 •

edited

Loading

johnkerl Oct 30, 2023

eddelbuettel Oct 30, 2023

johnkerl Oct 31, 2023 •

edited

Loading

johnkerl Oct 31, 2023 •

edited

Loading

eddelbuettel Oct 31, 2023

johnkerl Nov 1, 2023

eddelbuettel Nov 1, 2023 •

edited

Loading

eddelbuettel Nov 1, 2023 •

edited

Loading

eddelbuettel Nov 1, 2023

johnkerl Nov 1, 2023

	tiledb_type_from_arrow_type <- function(x) {
	stopifnot(is_arrow_data_type(x))
	switch(x$name,

	int8 = "INT8",
	int16 = "INT16",
	int32 = "INT32",
	int64 = "INT64",
	uint8 = "UINT8",
	uint16 = "UINT16",
	uint32 = "UINT32",
	uint64 = "UINT64",
	float32 = "FLOAT32",
	float = "FLOAT32",
	float64 = "FLOAT64",
	# based on tiledb::r_to_tiledb_type()
	double = "FLOAT64",
	boolean = "BOOL",
	bool = "BOOL",
	# large_string = "large_string",
	# binary = "binary",
	# large_binary = "large_binary",
	# fixed_size_binary = "fixed_size_binary",
	# tiledb::r_to_tiledb_type() returns UTF8 for characters but they are
	# not yet queryable so we use ASCII for now
	utf8 = "ASCII",
	string = "ASCII",
	large_utf8 = "ASCII",
	# date32 = "date32",
	# date64 = "date64",
	# time32 = "time32",
	# time64 = "time64",
	# null = "null",
	# timestamp = "timestamp",
	# decimal128 = "decimal128",
	# decimal256 = "decimal256",
	# struct = "struct",
	# list_of = "list",
	# list = "list",
	# large_list_of = "large_list",
	# large_list = "large_list",
	# fixed_size_list_of = "fixed_size_list",
	# fixed_size_list = "fixed_size_list",
	# map_of = "map",
	# duration = "duration",
	dictionary = "INT32", # for a dictionary the 'values' are ints, levels are character
	stop("Unsupported Arrow data type: ", x$name, call. = FALSE)
	)
	}

[r] Update SOMADataFrame creation and writes #1835

[r] Update SOMADataFrame creation and writes #1835

Conversation

eddelbuettel commented Oct 28, 2023 • edited Loading

shortcut-integration bot commented Oct 28, 2023

eddelbuettel commented Oct 28, 2023 • edited Loading

codecov-commenter commented Oct 28, 2023 • edited Loading

Codecov Report

johnkerl Oct 30, 2023

Choose a reason for hiding this comment

eddelbuettel Oct 30, 2023

Choose a reason for hiding this comment

johnkerl Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

johnkerl Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

eddelbuettel Oct 31, 2023

Choose a reason for hiding this comment

johnkerl Nov 1, 2023

Choose a reason for hiding this comment

eddelbuettel Nov 1, 2023 • edited Loading

Choose a reason for hiding this comment

eddelbuettel Nov 1, 2023 • edited Loading

Choose a reason for hiding this comment

eddelbuettel Nov 1, 2023

Choose a reason for hiding this comment

johnkerl Nov 1, 2023

Choose a reason for hiding this comment

[r] Update `SOMADataFrame` creation and writes #1835

[r] Update `SOMADataFrame` creation and writes #1835

eddelbuettel commented Oct 28, 2023 •

edited

Loading

eddelbuettel commented Oct 28, 2023 •

edited

Loading

codecov-commenter commented Oct 28, 2023 •

edited

Loading

johnkerl Oct 31, 2023 •

edited

Loading

johnkerl Oct 31, 2023 •

edited

Loading

eddelbuettel Nov 1, 2023 •

edited

Loading

eddelbuettel Nov 1, 2023 •

edited

Loading