[WIP] Add memmap capability for binary data element values #1267

darcymason · 2020-11-27T20:10:50Z

Describe the changes

Adds memmap config capability (reading of large values deferred, then mapped directly to file, not read into memory).
Closes #139. Related to #291.

Defers reading of any large binary data element value (above specified threshold). When the data element value is accessed, returns a numpy memmap for the value. This part basically replaces previous "deferred" reading.

First step does this for any binary VR above the size threshold.

Next step is to modify pixel_array to avoid copying the PixelData into memory, but leave as memmap where possible (uncompressed pixel data).

Tasks

Unit tests added that reproduce the issue or prove feature is working
Fix or feature added
Documentation updated (if relevant)
- No warnings during build
- Preview link (CircleCI -> Artifacts -> [...]/_build/html/index.html)
Unit tests passing and overall coverage the same or better

* read-only so far * begin editing numpy_handler * add "TB" to size_in_bytes units

pep8speaks · 2020-11-27T20:10:55Z

Hello @darcymason! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file pydicom/config.py:

Line 364:1: E302 expected 2 blank lines, found 1

In the file pydicom/filereader.py:

Line 1046:37: E128 continuation line under-indented for visual indent
Line 1053:25: E128 continuation line under-indented for visual indent
Line 1057:25: E128 continuation line under-indented for visual indent

In the file pydicom/tests/test_memmap.py:

Line 41:5: E303 too many blank lines (2)
Line 56:38: W292 no newline at end of file

codecov · 2020-11-27T20:15:43Z

Codecov Report

Merging #1267 (360ff40) into master (3a28b59) will decrease coverage by 0.02%.
The diff coverage is 87.87%.

@@            Coverage Diff             @@
##           master    #1267      +/-   ##
==========================================
- Coverage   95.22%   95.19%   -0.03%     
==========================================
  Files          60       60              
  Lines        9501     9527      +26     
==========================================
+ Hits         9047     9069      +22     
- Misses        454      458       +4

Impacted Files	Coverage Δ
pydicom/pixel_data_handlers/numpy_handler.py	`98.86% <80.00%> (-1.14%)`	⬇️
pydicom/filereader.py	`93.47% <86.66%> (-0.30%)`	⬇️
pydicom/config.py	`95.45% <90.90%> (-0.70%)`	⬇️
pydicom/misc.py	`100.00% <100.00%> (ø)`
pydicom/pixel_data_handlers/util.py	`99.71% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3a28b59...310e70f. Read the comment docs.

scaramallion · 2020-11-27T21:59:06Z

Does it have to be a numpy memmap? What about Python's mmap?

darcymason · 2020-11-27T23:32:07Z

What about Python's mmap?

I got scared off by this:

... offset must be a multiple of the ALLOCATIONGRANULARITY which is equal to PAGESIZE on Unix systems.

and pagesize is I think 4096 by default. For us, the offset to the data can be completely arbitrary.

Also, it seemed easier when returning pixel_array, to just adjust the dtype and shape of an existing memmap (for non-compressed pixels), because memmap objects act just like an ndarray; even isinstance(np_memmap) is True.

scaramallion · 2020-11-27T23:46:48Z

Hmm, fair enough. You've beat me to it, but my thinking was going to be memmapping the whole file instead, similar to the RawDataElement concept. I'd really like the frame handling to not require numpy, but I suppose it's not super important.

darcymason · 2020-11-28T01:03:08Z

my thinking was going to be memmapping the whole file instead, similar to the RawDataElement concept

Hmmm, hadn't thought of that, and I'm not clear how that would work - do you mean still parse through the data elements, and memmap just the values? If so, it seems like a performance hit because it would effectively read the file twice, since skipping bytes isn't any faster than reading them into memory (smaller values are almost certainly already completely in memory, due to buffering in file reading).

Also, many elements take memory after conversion - e.g. all text items because of decoding them to Python (unicode) strings, IS/DS to Python number types, multivalues to be split, etc. My thought here is that pixel data (uncompressed) should be able to stay in the file, but not too many values can.

Also in this PR, the file handle is not obtained until a large data element is actually accessed - if keeping file handles for all files, then we might run into too many open handles, as people have seen in the past.

I'd really like the frame handling to not require numpy, but I suppose it's not super important.

Don't they eventually have to be a numpy array anyway, if the user is to do any manipulation? Maybe I'm thinking of frames only in terms of pixel data; are there others?

I'm happy to hear your thoughts, because this looks like it could get quite complex; better to nail down the issues as early on as possible.

scaramallion · 2020-11-28T02:00:35Z

Those are all good points, using numpy seems like the way to go.

mrbean-bremen · 2020-11-28T09:21:55Z

pydicom/filereader.py

@@ -1008,22 +1021,41 @@ def read_deferred_data_element(fileobj_type, filename_or_obj, timestamp,
          if is_filename else filename_or_obj)
    is_implicit_VR = raw_data_elem.is_implicit_VR
    is_little_endian = raw_data_elem.is_little_endian
+
+    if config._memmap_size and raw_data_elem.length > config._memmap_size:


Should also check for binary VR here as above (or move the check into a variable).

mrbean-bremen · 2020-11-28T09:24:52Z

pydicom/filereader.py

+            raise ImportError(
+                "Numpy required for memmap'd values, but numpy is missing"
+            )
+        mode = 'r' if config._memmap_read_only else 'r+'


I was already asking here why not use the Python module, and why not map the whole file - but now noticed that @scaramallion had asked exactlt the same :) Makes sense to me, too.

mrbean-bremen · 2020-11-28T09:25:37Z

pydicom/config.py

+_memmap_read_only: bool = True
+
+def memmap_size(size: Union[int, str]) -> None:
+    """Set Data element value lenght beyond which values are memmap'd


typo lenght

mrbean-bremen · 2020-11-28T09:26:37Z

pydicom/filereader.py

-                         "original {1:s}".format(data_elem.VR,
-                                                 raw_data_elem.VR))
+                        "original {1:s}".format(data_elem.VR,
+                                                raw_data_elem.VR))


Seems like your editor (or black?) is messing up the indentation.

darcymason · 2020-11-28T15:27:21Z

I was already asking here why not use the Python module, and why not map the whole file - but now noticed that @scaramallion had asked exactlt the same :)

Okay, since you were both thinking this, it made me question whether I'm thinking about this correctly. And looking again, perhaps Python's mmap has some advantages. Unfortunately, I find the online explanations really don't explain well what is going on - the best I can tell is that it optimizes reading by using different system calls that more directly operate with the operating system's virtual memory (paging).

Looking into numpy's memmap, it uses mmap under the hood anyway, it just takes into account the granularity and works around it.

But one question - since mmap has all the needed file methods, read, seek, tell, then can it not just be passed right now as an open file object to dcmread? -- it seems it can - I just tried it on one of the larger (but only ~7 MB) pydicom data files, seemed to work, although I didn't check memory use to see what it is doing:

from mmap import mmap, ACCESS_READ
from pydicom.data import get_testdata_file
from pydicom import dcmread

f = open(get_testdata_file("RG3_UNCR.dcm"), 'rb')
mm = mmap(f.fileno(), 0, access=ACCESS_READ)
ds = dcmread(mm)

So, since after the large parts there isn't much file left anyway, and numpy.memmap doesn't seem to add special behavior beyond the mmap, it seems mapping the whole file probably makes more sense, but we have that already for reading, apparently, as shown above.

I'm not sure how full-file mmap would compete with the 'defer' concept which the code I started here does. Does it still try to use large parts of (real) memory or page swapping for the raw data element value if not 'defer'ing the read? Would need some tests on this with huge files.

The full file mmap is probably easier for adding a write capability, especially if want to do more than in-place changes where the size of the pixel data changes. We could add flags to dcmread, or create a new function, to allow read/write access to a file handle, which could include mmaps.

Thoughts?

mrbean-bremen · 2020-11-28T18:26:02Z

Okay, since you were both thinking this, it made me question whether I'm thinking about this correctly.

You already had me convinced that the NumPy approach makes sense (my comment was meant that way, in case that was not clear). But if using mmap is as easy as you found out, I don't see many problems with that.

I'm not sure how full-file mmap would compete with the 'defer' concept which the code I started here does. Does it still try to use large parts of (real) memory or page swapping for the raw data element value if not 'defer'ing the read? Would need some tests on this with huge files.

I don't have experience with Python's mmap, but I have used memory mapping before using OS functions. As far as I know, the virtual memory is only reserved until accessed, and the OS usually does a good job to access it efficiently and transparently (page based). The only problem I remember is that the virtual memory for a mapped file must be available as contiguous memory, which should not be a problem with 64 bit OSes, but could be with 32 bit OS and large files (not sure if this is still a thing).

darcymason · 2020-11-28T22:05:42Z

You already had me convinced that the NumPy approach makes sense (my comment was meant that way, in case that was not clear). But if using mmap is as easy as you found out, I don't see many problems with that.

And just to be clear, I didn't mean anything bad by my comment, it's just that I respect your judgment, and I jumped in without a good understanding of how these two choices actually work. And now I understand much better which will hopefully translate into a better solution that won't need reworking later.

I'll try some tests with huge files and see how it looks. I think my concern about 'double'-reading may not actually be an issue, because the data elements before the pixel data will almost certainly be in memory in a small number of pages (usually one for most images, I would think). Although some SOP classes (e.g. RTSTRUCT) can be pretty large because of non-binary data.

I think it will come down to the open file handles issue, and also making the 'defer' concept work again. I plan to think a bit more about this and the different use cases.

should not be a problem with 64 bit OSes, but could be with 32 bit OS and large files (not sure if this is still a thing).

I don't think we can do anything to help with the 32-bit 4 GB addressing limit. People with files over that size will just have to be on a 64-bit system.

mrbean-bremen · 2020-11-29T08:18:55Z

And just to be clear, I didn't mean anything bad by my comment

That goes without saying - I can't even imagine that from from you :)

I think it will come down to the open file handles issue, and also making the 'defer' concept work again. I plan to think a bit more about this and the different use cases.

Sounds like a good plan to me.

I don't think we can do anything to help with the 32-bit 4 GB addressing limit. People with files over that size will just have to be on a 64-bit system.

Agreed. Also, I couldn't find it in the documentation, but I expect to get an exception if the mapping fails, which we could handle by falling back to no mapping.

darcymason · 2020-11-29T17:21:13Z

As an interesting side comment, during my reading, I see that mmap has the "copy on write" option, which basically mirrors the existing array and only actually copies the pages that are modified, thereby conserving memory - this might be relevant for large files with respect to the solution in #717 / #742, where a copy of the numpy array is returned. It would be ideal if the user doesn't actually change anything. It wouldn't help for modifying all pixel data, but perhaps if only a section of an image were modified it might vastly improve memory use.

darcymason · 2020-12-01T20:47:00Z

In terms of setting up testing for this, I'm wondering whether anybody knows of any good test files - large ones that were previously attached to pydicom issues, or just any freely available on the web. I had a quick look but they are hard to find. I'm looking for files of ~1 GB to ~20 GB, both uncompressed and compressed (although I can create the latter from the former using dcmtk or gdcm if necessary).

Alternatively (or perhaps in addition), does anyone best practices for simulating limited memory in Python, especially with pytest. Found one example on stackoverflow that could possibly be adapted. I'm looking for ways to force virtual memory use with more reasonably-sized files (~few hundreds of MB). The alternative is to use a virtual linux box with a limited memory, but running in that is not a completely automated process.

Thanks

pieper · 2020-12-01T21:37:51Z

Here's some data that might work for you:

https://www.dropbox.com/s/r3xdpym8wb6ztan/LUNG-1-LN_DICOM_OneChannelPerImage_20200724.tar.bz2

It's public microscopy data converted to dicom multiframe by @dclunie.

https://www.synapse.org/#!Synapse:syn17774625

as described in detail in

https://www.nature.com/articles/s41597-019-0332-y
https://www.synapse.org/#!Synapse:syn17865732/wiki/592782

darcymason · 2020-12-02T01:37:20Z

Thanks, @pieper, that covers an important part of the tests for sure.

Add memmap capability for binary data element values

310e70f

* read-only so far * begin editing numpy_handler * add "TB" to size_in_bytes units

mrbean-bremen reviewed Nov 28, 2020

View reviewed changes

darcymason mentioned this pull request Dec 8, 2020

Workaround for 32bit overflow in pixel data size? #1274

Closed

darcymason mentioned this pull request Aug 23, 2021

[MRG] Add image file reader class #1447

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add memmap capability for binary data element values #1267

[WIP] Add memmap capability for binary data element values #1267

darcymason commented Nov 27, 2020

pep8speaks commented Nov 27, 2020

codecov bot commented Nov 27, 2020

scaramallion commented Nov 27, 2020

darcymason commented Nov 27, 2020

scaramallion commented Nov 27, 2020 •

edited

darcymason commented Nov 28, 2020

scaramallion commented Nov 28, 2020

mrbean-bremen Nov 28, 2020

mrbean-bremen Nov 28, 2020

mrbean-bremen Nov 28, 2020

mrbean-bremen Nov 28, 2020

darcymason commented Nov 28, 2020

mrbean-bremen commented Nov 28, 2020

darcymason commented Nov 28, 2020

mrbean-bremen commented Nov 29, 2020 •

edited

darcymason commented Nov 29, 2020

darcymason commented Dec 1, 2020

pieper commented Dec 1, 2020

darcymason commented Dec 2, 2020

[WIP] Add memmap capability for binary data element values #1267

Are you sure you want to change the base?

[WIP] Add memmap capability for binary data element values #1267

Conversation

darcymason commented Nov 27, 2020

Describe the changes

Tasks

pep8speaks commented Nov 27, 2020

codecov bot commented Nov 27, 2020

Codecov Report

scaramallion commented Nov 27, 2020

darcymason commented Nov 27, 2020

scaramallion commented Nov 27, 2020 • edited

darcymason commented Nov 28, 2020

scaramallion commented Nov 28, 2020

mrbean-bremen Nov 28, 2020

Choose a reason for hiding this comment

mrbean-bremen Nov 28, 2020

Choose a reason for hiding this comment

mrbean-bremen Nov 28, 2020

Choose a reason for hiding this comment

mrbean-bremen Nov 28, 2020

Choose a reason for hiding this comment

darcymason commented Nov 28, 2020

mrbean-bremen commented Nov 28, 2020

darcymason commented Nov 28, 2020

mrbean-bremen commented Nov 29, 2020 • edited

darcymason commented Nov 29, 2020

darcymason commented Dec 1, 2020

pieper commented Dec 1, 2020

darcymason commented Dec 2, 2020

scaramallion commented Nov 27, 2020 •

edited

mrbean-bremen commented Nov 29, 2020 •

edited