New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of core concatenation support #955
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #955 +/- ##
==========================================
+ Coverage 84.13% 84.30% +0.17%
==========================================
Files 33 35 +2
Lines 4733 4932 +199
==========================================
+ Hits 3982 4158 +176
- Misses 751 774 +23
|
for more information, see https://pre-commit.ci
…nto concat-on-disk
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
This reverts commit 41f6369.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good start! I've made a few requests for changes. A few more points:
read_groups
The biggest one is probably the request to use either read_dispatched
instead of adding new methods to the registry class. However, I'm not sure I understand what exactly the read_groups
method is supposed to do. Could you explain that?
Commented + unused code
Could you do a pass and try to remove commented out code and unused code paths that you're no longer using? I think it would increase the readability here quite a bit.
Expanding tests
This would be the other big point. Do you think you could expand the test suite to cover more cases? I've given some suggestions in my comments.
for more information, see https://pre-commit.ci
@ivirshup ah very sorry, I was doing the work without the reindexing on a different branch because this got very messy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Mostly minor.
Have you done any benchmarking with this? E.g showing lower peak memory?
elif iospec.encoding_type in EAGER_TYPES: | ||
return read_elem(elem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does removing this case do anything? If so, is this block special casing nested types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It reads the dataframes so it would. Wdym by nested types? I have a special case for dict. Could you give some examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if I understand read_dispatch very good. I replaced read_elem(elem)
with func(elem)
and it gives errors, shouldn't they be the same? I will look into it once I got other things figured
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I checked and the documentation also uses it this way: https://anndata.readthedocs.io/en/latest/tutorials/notebooks/%7Bread%2Cwrite%7D_dispatched.html.
anndata/experimental/merge.py
Outdated
elems = _gen_slice_to_append( | ||
datasets, reindexers, max_loaded_sparse_elems, axis, fill_value | ||
) | ||
init_elem = (csr_matrix, csc_matrix)[axis](next(elems)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this load the first element into memory if it's not already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but they will be in memory; the problem is when I take a slice of the sparse dataset (i.e., read it partially), it sometimes returns it as other types of sparse format even if the SparseDataset itself has a determined format. I already checked if SparseDataset's are in the correct format and if they all have the same format. You can try to see this
def write_concat_sparse(
datasets: Sequence[SparseDataset],
output_group: Union[ZarrGroup, H5Group],
output_path: Union[ZarrGroup, H5Group],
max_loaded_sparse_elems: int,
axis: Literal[0, 1] = 0,
reindexers: Reindexer = None,
fill_value=None,
):
elems = _gen_slice_to_append(
datasets, reindexers, max_loaded_sparse_elems, axis, fill_value
)
init_elem = next(elems)
write_elem(output_group, output_path, init_elem)
del init_elem
out_dataset: SparseDataset = read_as_backed(output_group[output_path])
for temp_elem in elems:
out_dataset.append(temp_elem)
del temp_elem
I am still not sure why this happens I just used that as a workaround
Co-authored-by: Isaac Virshup <ivirshup@gmail.com>
for more information, see https://pre-commit.ci
…nto concat-on-disk
Yes I just updated the branch concat-on-disk-benchmark I am trying to do a setting for limiting memory usage for dask, but it is filling my tmp directory currently. |
I'm not sure I see what you're talking about here on that branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it! I looked at its surface (docs, API) there’s a few minor improvements in typing and docs I’d like to see and one possible change in API. Maybe 15 minutes of work.
Co-authored-by: Philipp A. <flying-sheep@web.de>
Co-authored-by: Philipp A. <flying-sheep@web.de>
Co-authored-by: Philipp A. <flying-sheep@web.de>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect!
Hi,
Now that I have an idea how this might work, I decided to copy
concat
function's signature almost 1-1 was the best option for me. This way I could get familiar with the existing code easier using unit tests.Signature
Some notes and doubts on the signature decision:
Functionality
Nothing yet. I indent to implement the case when the format is zarr and the others are the default params first.
Unit Tests for Equivalence to Concat
I believe I added all the tests that should ensure
concat
andconcat_on_disk
are giving equivalent results. From the unit tests ofconcat
I added all the adatas that are given to that function andwrote them to disk, gave it to
concat_on_disk
with same arguments except the filenames. Calledassert_equal
on both results.Unit Tests for Memory and Disk stuff
Not done and probably should be done. (Memory leaks, filenames etc.)
@ivirshup