-
-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add memmap capability for binary data element values #1267
base: main
Are you sure you want to change the base?
Conversation
* read-only so far * begin editing numpy_handler * add "TB" to size_in_bytes units
Hello @darcymason! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:
|
Codecov Report
@@ Coverage Diff @@
## master #1267 +/- ##
==========================================
- Coverage 95.22% 95.19% -0.03%
==========================================
Files 60 60
Lines 9501 9527 +26
==========================================
+ Hits 9047 9069 +22
- Misses 454 458 +4
Continue to review full report at Codecov.
|
Does it have to be a numpy memmap? What about Python's mmap? |
I got scared off by this:
and pagesize is I think 4096 by default. For us, the offset to the data can be completely arbitrary. Also, it seemed easier when returning pixel_array, to just adjust the dtype and shape of an existing memmap (for non-compressed pixels), because memmap objects act just like an ndarray; even |
Hmm, fair enough. You've beat me to it, but my thinking was going to be memmapping the whole file instead, similar to the RawDataElement concept. I'd really like the frame handling to not require numpy, but I suppose it's not super important. |
Hmmm, hadn't thought of that, and I'm not clear how that would work - do you mean still parse through the data elements, and memmap just the values? If so, it seems like a performance hit because it would effectively read the file twice, since skipping bytes isn't any faster than reading them into memory (smaller values are almost certainly already completely in memory, due to buffering in file reading). Also, many elements take memory after conversion - e.g. all text items because of decoding them to Python (unicode) strings, IS/DS to Python number types, multivalues to be split, etc. My thought here is that pixel data (uncompressed) should be able to stay in the file, but not too many values can. Also in this PR, the file handle is not obtained until a large data element is actually accessed - if keeping file handles for all files, then we might run into too many open handles, as people have seen in the past.
Don't they eventually have to be a numpy array anyway, if the user is to do any manipulation? Maybe I'm thinking of frames only in terms of pixel data; are there others? I'm happy to hear your thoughts, because this looks like it could get quite complex; better to nail down the issues as early on as possible. |
Those are all good points, using numpy seems like the way to go. |
@@ -1008,22 +1021,41 @@ def read_deferred_data_element(fileobj_type, filename_or_obj, timestamp, | |||
if is_filename else filename_or_obj) | |||
is_implicit_VR = raw_data_elem.is_implicit_VR | |||
is_little_endian = raw_data_elem.is_little_endian | |||
|
|||
if config._memmap_size and raw_data_elem.length > config._memmap_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also check for binary VR here as above (or move the check into a variable).
raise ImportError( | ||
"Numpy required for memmap'd values, but numpy is missing" | ||
) | ||
mode = 'r' if config._memmap_read_only else 'r+' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was already asking here why not use the Python module, and why not map the whole file - but now noticed that @scaramallion had asked exactlt the same :) Makes sense to me, too.
_memmap_read_only: bool = True | ||
|
||
def memmap_size(size: Union[int, str]) -> None: | ||
"""Set Data element value lenght beyond which values are memmap'd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo lenght
"original {1:s}".format(data_elem.VR, | ||
raw_data_elem.VR)) | ||
"original {1:s}".format(data_elem.VR, | ||
raw_data_elem.VR)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like your editor (or black?) is messing up the indentation.
Okay, since you were both thinking this, it made me question whether I'm thinking about this correctly. And looking again, perhaps Python's Looking into numpy's But one question - since from mmap import mmap, ACCESS_READ
from pydicom.data import get_testdata_file
from pydicom import dcmread
f = open(get_testdata_file("RG3_UNCR.dcm"), 'rb')
mm = mmap(f.fileno(), 0, access=ACCESS_READ)
ds = dcmread(mm) So, since after the large parts there isn't much file left anyway, and numpy.memmap doesn't seem to add special behavior beyond the I'm not sure how full-file The full file Thoughts? |
You already had me convinced that the
I don't have experience with Python's |
And just to be clear, I didn't mean anything bad by my comment, it's just that I respect your judgment, and I jumped in without a good understanding of how these two choices actually work. And now I understand much better which will hopefully translate into a better solution that won't need reworking later. I'll try some tests with huge files and see how it looks. I think my concern about 'double'-reading may not actually be an issue, because the data elements before the pixel data will almost certainly be in memory in a small number of pages (usually one for most images, I would think). Although some SOP classes (e.g. RTSTRUCT) can be pretty large because of non-binary data. I think it will come down to the open file handles issue, and also making the 'defer' concept work again. I plan to think a bit more about this and the different use cases.
I don't think we can do anything to help with the 32-bit 4 GB addressing limit. People with files over that size will just have to be on a 64-bit system. |
That goes without saying - I can't even imagine that from from you :)
Sounds like a good plan to me.
Agreed. Also, I couldn't find it in the documentation, but I expect to get an exception if the mapping fails, which we could handle by falling back to no mapping. |
As an interesting side comment, during my reading, I see that mmap has the "copy on write" option, which basically mirrors the existing array and only actually copies the pages that are modified, thereby conserving memory - this might be relevant for large files with respect to the solution in #717 / #742, where a copy of the numpy array is returned. It would be ideal if the user doesn't actually change anything. It wouldn't help for modifying all pixel data, but perhaps if only a section of an image were modified it might vastly improve memory use. |
In terms of setting up testing for this, I'm wondering whether anybody knows of any good test files - large ones that were previously attached to pydicom issues, or just any freely available on the web. I had a quick look but they are hard to find. I'm looking for files of ~1 GB to ~20 GB, both uncompressed and compressed (although I can create the latter from the former using dcmtk or gdcm if necessary). Alternatively (or perhaps in addition), does anyone best practices for simulating limited memory in Python, especially with Thanks |
Here's some data that might work for you: https://www.dropbox.com/s/r3xdpym8wb6ztan/LUNG-1-LN_DICOM_OneChannelPerImage_20200724.tar.bz2 It's public microscopy data converted to dicom multiframe by @dclunie. https://www.synapse.org/#!Synapse:syn17774625 as described in detail in https://www.nature.com/articles/s41597-019-0332-y |
Thanks, @pieper, that covers an important part of the tests for sure. |
Describe the changes
Adds memmap config capability (reading of large values deferred, then mapped directly to file, not read into memory).
Closes #139. Related to #291.
Defers reading of any large binary data element value (above specified threshold). When the data element value is accessed, returns a numpy memmap for the value. This part basically replaces previous "deferred" reading.
First step does this for any binary VR above the size threshold.
Next step is to modify pixel_array to avoid copying the PixelData into memory, but leave as memmap where possible (uncompressed pixel data).
Tasks
[...]/_build/html/index.html
)