Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of core concatenation support #955

Merged
merged 141 commits into from Jul 21, 2023

Conversation

selmanozleyen
Copy link
Member

@selmanozleyen selmanozleyen commented Mar 17, 2023

Hi,

Now that I have an idea how this might work, I decided to copy concat function's signature almost 1-1 was the best option for me. This way I could get familiar with the existing code easier using unit tests.

Signature

def concat_on_disk(
	in_files: Union[Collection[str], typing.MutableMapping],
	out_file: Union[str, typing.MutableMapping],
	overwrite: bool = False,
	*,
	axis: Literal[0, 1] = 0,
	join: Literal["inner", "outer"] = "inner",
	merge: Union[StrategiesLiteral, Callable, None] = None,
	uns_merge: Union[StrategiesLiteral, Callable, None] = None,
	label: Optional[str] = None,
	keys: Optional[Collection] = None,
	index_unique: Optional[str] = None,
	fill_value: Optional[Any] = None,
	pairwise: bool = False,
):

"""Concatenates multiple AnnData objects along a specified axis using their
corresponding stores or paths, and writes the resulting AnnData object
to a target location on disk.

Unlike the `concat` function, this method does not require
loading the input AnnData objects into memory,
making it a memory-efficient alternative for large datasets.
The resulting object written to disk should be equivalent
to the concatenation of the loaded AnnData objects using
the `concat` function."""

Some notes and doubts on the signature decision:

  • I made the file related parameters positional to somewhat make the signature easier to read.
  • regarding the above item, I don't know if it might be a bad or good thing in the future.
  • I copied all the parameters of the original function and I figured even if we don't support every variation we might raise an error when not

Functionality

Nothing yet. I indent to implement the case when the format is zarr and the others are the default params first.

Unit Tests for Equivalence to Concat

I believe I added all the tests that should ensure concat and concat_on_disk are giving equivalent results. From the unit tests of concat I added all the adatas that are given to that function and
wrote them to disk, gave it to concat_on_disk with same arguments except the filenames. Called assert_equal on both results.

Unit Tests for Memory and Disk stuff

Not done and probably should be done. (Memory leaks, filenames etc.)

@ivirshup

@selmanozleyen selmanozleyen marked this pull request as draft March 17, 2023 10:06
@codecov
Copy link

codecov bot commented Mar 17, 2023

Codecov Report

Merging #955 (c0408bb) into main (dc793fe) will increase coverage by 0.17%.
The diff coverage is 88.62%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #955      +/-   ##
==========================================
+ Coverage   84.13%   84.30%   +0.17%     
==========================================
  Files          33       35       +2     
  Lines        4733     4932     +199     
==========================================
+ Hits         3982     4158     +176     
- Misses        751      774      +23     
Impacted Files Coverage Δ
anndata/tests/helpers.py 96.01% <ø> (ø)
anndata/experimental/merge.py 87.56% <87.56%> (ø)
anndata/_core/merge.py 93.26% <100.00%> (+0.01%) ⬆️
anndata/_io/specs/methods.py 87.81% <100.00%> (+0.03%) ⬆️
anndata/experimental/__init__.py 100.00% <100.00%> (ø)
anndata/experimental/_dispatch_io.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good start! I've made a few requests for changes. A few more points:

read_groups

The biggest one is probably the request to use either read_dispatched instead of adding new methods to the registry class. However, I'm not sure I understand what exactly the read_groups method is supposed to do. Could you explain that?

Commented + unused code

Could you do a pass and try to remove commented out code and unused code paths that you're no longer using? I think it would increase the readability here quite a bit.

Expanding tests

This would be the other big point. Do you think you could expand the test suite to cover more cases? I've given some suggestions in my comments.

anndata/_io/merge.py Outdated Show resolved Hide resolved
anndata/_io/merge.py Outdated Show resolved Hide resolved
anndata/_io/specs/registry.py Outdated Show resolved Hide resolved
anndata/tests/test_concatenate_disk.py Outdated Show resolved Hide resolved
@selmanozleyen
Copy link
Member Author

@ivirshup ah very sorry, I was doing the work without the reindexing on a different branch because this got very messy.

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Mostly minor.

Have you done any benchmarking with this? E.g showing lower peak memory?

anndata/tests/test_concatenate_disk.py Outdated Show resolved Hide resolved
anndata/experimental/merge.py Outdated Show resolved Hide resolved
Comment on lines +142 to +143
elif iospec.encoding_type in EAGER_TYPES:
return read_elem(elem)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does removing this case do anything? If so, is this block special casing nested types?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It reads the dataframes so it would. Wdym by nested types? I have a special case for dict. Could you give some examples?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I understand read_dispatch very good. I replaced read_elem(elem) with func(elem) and it gives errors, shouldn't they be the same? I will look into it once I got other things figured

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I checked and the documentation also uses it this way: https://anndata.readthedocs.io/en/latest/tutorials/notebooks/%7Bread%2Cwrite%7D_dispatched.html.

anndata/experimental/merge.py Outdated Show resolved Hide resolved
elems = _gen_slice_to_append(
datasets, reindexers, max_loaded_sparse_elems, axis, fill_value
)
init_elem = (csr_matrix, csc_matrix)[axis](next(elems))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this load the first element into memory if it's not already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but they will be in memory; the problem is when I take a slice of the sparse dataset (i.e., read it partially), it sometimes returns it as other types of sparse format even if the SparseDataset itself has a determined format. I already checked if SparseDataset's are in the correct format and if they all have the same format. You can try to see this

def write_concat_sparse(
    datasets: Sequence[SparseDataset],
    output_group: Union[ZarrGroup, H5Group],
    output_path: Union[ZarrGroup, H5Group],
    max_loaded_sparse_elems: int,
    axis: Literal[0, 1] = 0,
    reindexers: Reindexer = None,
    fill_value=None,
):
    elems = _gen_slice_to_append(
        datasets, reindexers, max_loaded_sparse_elems, axis, fill_value
    )
    init_elem = next(elems)
    write_elem(output_group, output_path, init_elem)
    del init_elem
    out_dataset: SparseDataset = read_as_backed(output_group[output_path])
    for temp_elem in elems:
        out_dataset.append(temp_elem)
        del temp_elem

I am still not sure why this happens I just used that as a workaround

anndata/experimental/merge.py Outdated Show resolved Hide resolved
anndata/experimental/merge.py Outdated Show resolved Hide resolved
anndata/experimental/merge.py Outdated Show resolved Hide resolved
anndata/experimental/__init__.py Show resolved Hide resolved
@selmanozleyen
Copy link
Member Author

@ivirshup

Looking good! Mostly minor.

Have you done any benchmarking with this? E.g showing lower peak memory?

Yes I just updated the branch concat-on-disk-benchmark

I am trying to do a setting for limiting memory usage for dask, but it is filling my tmp directory currently.

@ivirshup
Copy link
Member

Yes I just updated the branch concat-on-disk-benchmark

I'm not sure I see what you're talking about here on that branch

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it! I looked at its surface (docs, API) there’s a few minor improvements in typing and docs I’d like to see and one possible change in API. Maybe 15 minutes of work.

anndata/experimental/merge.py Outdated Show resolved Hide resolved
anndata/experimental/merge.py Outdated Show resolved Hide resolved
anndata/experimental/merge.py Outdated Show resolved Hide resolved
anndata/experimental/merge.py Outdated Show resolved Hide resolved
anndata/experimental/merge.py Show resolved Hide resolved
anndata/experimental/merge.py Outdated Show resolved Hide resolved
anndata/experimental/merge.py Outdated Show resolved Hide resolved
anndata/experimental/merge.py Outdated Show resolved Hide resolved
anndata/experimental/merge.py Outdated Show resolved Hide resolved
flying-sheep and others added 6 commits July 20, 2023 11:20
Co-authored-by: Philipp A. <flying-sheep@web.de>
Co-authored-by: Philipp A. <flying-sheep@web.de>
Co-authored-by: Philipp A. <flying-sheep@web.de>
Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect!

@flying-sheep flying-sheep merged commit bd47cf9 into scverse:main Jul 21, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants