Out of core concatenation support #955

selmanozleyen · 2023-03-17T10:05:23Z

Hi,

Now that I have an idea how this might work, I decided to copy concat function's signature almost 1-1 was the best option for me. This way I could get familiar with the existing code easier using unit tests.

Signature

def concat_on_disk(
	in_files: Union[Collection[str], typing.MutableMapping],
	out_file: Union[str, typing.MutableMapping],
	overwrite: bool = False,
	*,
	axis: Literal[0, 1] = 0,
	join: Literal["inner", "outer"] = "inner",
	merge: Union[StrategiesLiteral, Callable, None] = None,
	uns_merge: Union[StrategiesLiteral, Callable, None] = None,
	label: Optional[str] = None,
	keys: Optional[Collection] = None,
	index_unique: Optional[str] = None,
	fill_value: Optional[Any] = None,
	pairwise: bool = False,
):

"""Concatenates multiple AnnData objects along a specified axis using their
corresponding stores or paths, and writes the resulting AnnData object
to a target location on disk.

Unlike the `concat` function, this method does not require
loading the input AnnData objects into memory,
making it a memory-efficient alternative for large datasets.
The resulting object written to disk should be equivalent
to the concatenation of the loaded AnnData objects using
the `concat` function."""

Some notes and doubts on the signature decision:

I made the file related parameters positional to somewhat make the signature easier to read.
regarding the above item, I don't know if it might be a bad or good thing in the future.
I copied all the parameters of the original function and I figured even if we don't support every variation we might raise an error when not

Functionality

Nothing yet. I indent to implement the case when the format is zarr and the others are the default params first.

Unit Tests for Equivalence to Concat

I believe I added all the tests that should ensure concat and concat_on_disk are giving equivalent results. From the unit tests of concat I added all the adatas that are given to that function and
wrote them to disk, gave it to concat_on_disk with same arguments except the filenames. Called assert_equal on both results.

Unit Tests for Memory and Disk stuff

Not done and probably should be done. (Memory leaks, filenames etc.)

@ivirshup

for more information, see https://pre-commit.ci

codecov · 2023-03-17T10:17:23Z

Codecov Report

Merging #955 (c0408bb) into main (dc793fe) will increase coverage by 0.17%.
The diff coverage is 88.62%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #955      +/-   ##
==========================================
+ Coverage   84.13%   84.30%   +0.17%     
==========================================
  Files          33       35       +2     
  Lines        4733     4932     +199     
==========================================
+ Hits         3982     4158     +176     
- Misses        751      774      +23

Impacted Files	Coverage Δ
anndata/tests/helpers.py	`96.01% <ø> (ø)`
anndata/experimental/merge.py	`87.56% <87.56%> (ø)`
anndata/_core/merge.py	`93.26% <100.00%> (+0.01%)`	⬆️
anndata/_io/specs/methods.py	`87.81% <100.00%> (+0.03%)`	⬆️
anndata/experimental/__init__.py	`100.00% <100.00%> (ø)`
anndata/experimental/_dispatch_io.py	`100.00% <100.00%> (ø)`

... and 1 file with indirect coverage changes

for more information, see https://pre-commit.ci

…nto concat-on-disk

for more information, see https://pre-commit.ci

This reverts commit 41f6369.

ivirshup

Looks like a good start! I've made a few requests for changes. A few more points:

`read_groups`

The biggest one is probably the request to use either read_dispatched instead of adding new methods to the registry class. However, I'm not sure I understand what exactly the read_groups method is supposed to do. Could you explain that?

Commented + unused code

Could you do a pass and try to remove commented out code and unused code paths that you're no longer using? I think it would increase the readability here quite a bit.

Expanding tests

This would be the other big point. Do you think you could expand the test suite to cover more cases? I've given some suggestions in my comments.

anndata/_io/merge.py

anndata/_io/specs/registry.py

anndata/tests/test_concatenate_disk.py

for more information, see https://pre-commit.ci

selmanozleyen · 2023-04-04T16:13:20Z

@ivirshup ah very sorry, I was doing the work without the reindexing on a different branch because this got very messy.

ivirshup

Looking good! Mostly minor.

Have you done any benchmarking with this? E.g showing lower peak memory?

anndata/tests/test_concatenate_disk.py

anndata/experimental/merge.py

ivirshup · 2023-07-03T10:59:12Z

anndata/experimental/merge.py

+        elif iospec.encoding_type in EAGER_TYPES:
+            return read_elem(elem)


Does removing this case do anything? If so, is this block special casing nested types?

It reads the dataframes so it would. Wdym by nested types? I have a special case for dict. Could you give some examples?

I am not sure if I understand read_dispatch very good. I replaced read_elem(elem) with func(elem) and it gives errors, shouldn't they be the same? I will look into it once I got other things figured

Update: I checked and the documentation also uses it this way: https://anndata.readthedocs.io/en/latest/tutorials/notebooks/%7Bread%2Cwrite%7D_dispatched.html.

anndata/experimental/merge.py

ivirshup · 2023-07-03T11:07:42Z

anndata/experimental/merge.py

+    elems = _gen_slice_to_append(
+        datasets, reindexers, max_loaded_sparse_elems, axis, fill_value
+    )
+    init_elem = (csr_matrix, csc_matrix)[axis](next(elems))


Does this load the first element into memory if it's not already?

Yes, but they will be in memory; the problem is when I take a slice of the sparse dataset (i.e., read it partially), it sometimes returns it as other types of sparse format even if the SparseDataset itself has a determined format. I already checked if SparseDataset's are in the correct format and if they all have the same format. You can try to see this

def write_concat_sparse( datasets: Sequence[SparseDataset], output_group: Union[ZarrGroup, H5Group], output_path: Union[ZarrGroup, H5Group], max_loaded_sparse_elems: int, axis: Literal[0, 1] = 0, reindexers: Reindexer = None, fill_value=None, ): elems = _gen_slice_to_append( datasets, reindexers, max_loaded_sparse_elems, axis, fill_value ) init_elem = next(elems) write_elem(output_group, output_path, init_elem) del init_elem out_dataset: SparseDataset = read_as_backed(output_group[output_path]) for temp_elem in elems: out_dataset.append(temp_elem) del temp_elem

I am still not sure why this happens I just used that as a workaround

anndata/experimental/merge.py

anndata/experimental/__init__.py

Co-authored-by: Isaac Virshup <ivirshup@gmail.com>

for more information, see https://pre-commit.ci

…nto concat-on-disk

selmanozleyen · 2023-07-12T11:09:30Z

@ivirshup

Looking good! Mostly minor.

Have you done any benchmarking with this? E.g showing lower peak memory?

Yes I just updated the branch concat-on-disk-benchmark

I am trying to do a setting for limiting memory usage for dask, but it is filling my tmp directory currently.

ivirshup · 2023-07-18T13:28:00Z

Yes I just updated the branch concat-on-disk-benchmark

I'm not sure I see what you're talking about here on that branch

flying-sheep

Love it! I looked at its surface (docs, API) there’s a few minor improvements in typing and docs I’d like to see and one possible change in API. Maybe 15 minutes of work.

anndata/experimental/merge.py

Co-authored-by: Philipp A. <flying-sheep@web.de>

for more information, see https://pre-commit.ci

flying-sheep

perfect!

selmanozleyen and others added 3 commits March 17, 2023 09:36

start with unit tests

6245a6f

signature clarification

41f6369

[pre-commit.ci] auto fixes from pre-commit.com hooks

c7103a8

for more information, see https://pre-commit.ci

selmanozleyen marked this pull request as draft March 17, 2023 10:06

selmanozleyen and others added 18 commits March 21, 2023 17:37

Merge branch 'main' into concat-on-disk

d008f6c

restricted concat on zarr

fbdbca5

[pre-commit.ci] auto fixes from pre-commit.com hooks

f2200b3

for more information, see https://pre-commit.ci

works for layers

7009d49

Merge branch 'concat-on-disk' of https://github.com/syelman/anndata i…

275052b

…nto concat-on-disk

[pre-commit.ci] auto fixes from pre-commit.com hooks

ebc88b5

for more information, see https://pre-commit.ci

added test

5543952

obs and var annotations concatenation

4b35de0

[pre-commit.ci] auto fixes from pre-commit.com hooks

4c01055

for more information, see https://pre-commit.ci

Merge branch 'main' into concat-on-disk

722958b

init

43ff24c

getting ready for merge

120d21b

fixing index names remaining

cfd9bd4

obsm varm now works

80b14e2

formatting

8a67f29

Revert "signature clarification"

381d102

This reverts commit 41f6369.

fix imports

6c27dcf

refactor

75c8328

ivirshup requested changes Apr 4, 2023

View reviewed changes

anndata/_io/merge.py Outdated Show resolved Hide resolved

anndata/_io/merge.py Outdated Show resolved Hide resolved

anndata/_io/specs/registry.py Outdated Show resolved Hide resolved

anndata/tests/test_concatenate_disk.py Outdated Show resolved Hide resolved

ivirshup reviewed Apr 4, 2023

View reviewed changes

anndata/tests/test_concatenate_disk.py Outdated Show resolved Hide resolved

Merge branch 'scverse:main' into restricted-ooc-concat

865c1d6

selmanozleyen mentioned this pull request Apr 4, 2023

Restricted OOC Concat #967

Closed

[pre-commit.ci] auto fixes from pre-commit.com hooks

403bc36

for more information, see https://pre-commit.ci

Merge branch 'main' into concat-on-disk

5bfbc76

selmanozleyen and others added 2 commits June 30, 2023 11:45

Merge branch 'main' into concat-on-disk

cd209ec

added test case

8e14c1b

ivirshup reviewed Jul 3, 2023

View reviewed changes

selmanozleyen and others added 12 commits July 3, 2023 19:00

Update anndata/experimental/merge.py

e0906d7

Co-authored-by: Isaac Virshup <ivirshup@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e86471a

for more information, see https://pre-commit.ci

write_concat revision

12a599e

Merge branch 'concat-on-disk' of https://github.com/syelman/anndata i…

fe88652

…nto concat-on-disk

import in init

16e980a

back to old init

b8996e5

annot update

e5c7e31

refactor test functions

13a99a9

Merge branch 'main' into concat-on-disk

67abf3f

fix init file

6e2459e

no branches for concat

31c7b24

Merge branch 'main' into concat-on-disk

7caa5c9

Merge branch 'main' into concat-on-disk

75152b7

ivirshup approved these changes Jul 18, 2023

View reviewed changes

ivirshup and others added 2 commits July 18, 2023 15:29

Merge branch 'main' into concat-on-disk

eadcd81

group register decorators

9fa5772

flying-sheep requested changes Jul 20, 2023

View reviewed changes

flying-sheep and others added 6 commits July 20, 2023 11:20

Merge branch 'main' into concat-on-disk

0c25ed6

Update anndata/experimental/merge.py

205e6f4

Co-authored-by: Philipp A. <flying-sheep@web.de>

Update anndata/experimental/merge.py

145b438

Co-authored-by: Philipp A. <flying-sheep@web.de>

Update anndata/experimental/merge.py

0d7c277

Co-authored-by: Philipp A. <flying-sheep@web.de>

[pre-commit.ci] auto fixes from pre-commit.com hooks

bfebddc

for more information, see https://pre-commit.ci

rewrite api

c0408bb

flying-sheep approved these changes Jul 21, 2023

View reviewed changes

flying-sheep merged commit bd47cf9 into scverse:main Jul 21, 2023
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of core concatenation support #955

Out of core concatenation support #955

selmanozleyen commented Mar 17, 2023 •

edited

codecov bot commented Mar 17, 2023 •

edited

ivirshup left a comment

selmanozleyen commented Apr 4, 2023

ivirshup left a comment

ivirshup Jul 3, 2023

selmanozleyen Jul 3, 2023

selmanozleyen Jul 10, 2023

selmanozleyen Jul 18, 2023

ivirshup Jul 3, 2023

selmanozleyen Jul 3, 2023

selmanozleyen commented Jul 12, 2023

ivirshup commented Jul 18, 2023

flying-sheep left a comment

flying-sheep left a comment

		elif iospec.encoding_type in EAGER_TYPES:
		return read_elem(elem)

Out of core concatenation support #955

Out of core concatenation support #955

Conversation

selmanozleyen commented Mar 17, 2023 • edited

Signature

Functionality

Unit Tests for Equivalence to Concat

Unit Tests for Memory and Disk stuff

codecov bot commented Mar 17, 2023 • edited

Codecov Report

ivirshup left a comment

Choose a reason for hiding this comment

read_groups

Commented + unused code

Expanding tests

selmanozleyen commented Apr 4, 2023

ivirshup left a comment

Choose a reason for hiding this comment

ivirshup Jul 3, 2023

Choose a reason for hiding this comment

selmanozleyen Jul 3, 2023

Choose a reason for hiding this comment

selmanozleyen Jul 10, 2023

Choose a reason for hiding this comment

selmanozleyen Jul 18, 2023

Choose a reason for hiding this comment

ivirshup Jul 3, 2023

Choose a reason for hiding this comment

selmanozleyen Jul 3, 2023

Choose a reason for hiding this comment

selmanozleyen commented Jul 12, 2023

ivirshup commented Jul 18, 2023

flying-sheep left a comment

Choose a reason for hiding this comment

flying-sheep left a comment

Choose a reason for hiding this comment

selmanozleyen commented Mar 17, 2023 •

edited

codecov bot commented Mar 17, 2023 •

edited

`read_groups`