Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add memmap capability for binary data element values #1267

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

darcymason
Copy link
Member

Describe the changes

Adds memmap config capability (reading of large values deferred, then mapped directly to file, not read into memory).
Closes #139. Related to #291.

Defers reading of any large binary data element value (above specified threshold). When the data element value is accessed, returns a numpy memmap for the value. This part basically replaces previous "deferred" reading.

First step does this for any binary VR above the size threshold.

Next step is to modify pixel_array to avoid copying the PixelData into memory, but leave as memmap where possible (uncompressed pixel data).

Tasks

  • Unit tests added that reproduce the issue or prove feature is working
  • Fix or feature added
  • Documentation updated (if relevant)
    • No warnings during build
    • Preview link (CircleCI -> Artifacts -> [...]/_build/html/index.html)
  • Unit tests passing and overall coverage the same or better

* read-only so far
* begin editing numpy_handler
* add "TB" to size_in_bytes units
@pep8speaks
Copy link

Hello @darcymason! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 364:1: E302 expected 2 blank lines, found 1

Line 1046:37: E128 continuation line under-indented for visual indent
Line 1053:25: E128 continuation line under-indented for visual indent
Line 1057:25: E128 continuation line under-indented for visual indent

Line 41:5: E303 too many blank lines (2)
Line 56:38: W292 no newline at end of file

@codecov
Copy link

codecov bot commented Nov 27, 2020

Codecov Report

Merging #1267 (360ff40) into master (3a28b59) will decrease coverage by 0.02%.
The diff coverage is 87.87%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1267      +/-   ##
==========================================
- Coverage   95.22%   95.19%   -0.03%     
==========================================
  Files          60       60              
  Lines        9501     9527      +26     
==========================================
+ Hits         9047     9069      +22     
- Misses        454      458       +4     
Impacted Files Coverage Δ
pydicom/pixel_data_handlers/numpy_handler.py 98.86% <80.00%> (-1.14%) ⬇️
pydicom/filereader.py 93.47% <86.66%> (-0.30%) ⬇️
pydicom/config.py 95.45% <90.90%> (-0.70%) ⬇️
pydicom/misc.py 100.00% <100.00%> (ø)
pydicom/pixel_data_handlers/util.py 99.71% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3a28b59...310e70f. Read the comment docs.

@scaramallion
Copy link
Member

Does it have to be a numpy memmap? What about Python's mmap?

@darcymason
Copy link
Member Author

What about Python's mmap?

I got scared off by this:

... offset must be a multiple of the ALLOCATIONGRANULARITY which is equal to PAGESIZE on Unix systems.

and pagesize is I think 4096 by default. For us, the offset to the data can be completely arbitrary.

Also, it seemed easier when returning pixel_array, to just adjust the dtype and shape of an existing memmap (for non-compressed pixels), because memmap objects act just like an ndarray; even isinstance(np_memmap) is True.

@scaramallion
Copy link
Member

scaramallion commented Nov 27, 2020

Hmm, fair enough. You've beat me to it, but my thinking was going to be memmapping the whole file instead, similar to the RawDataElement concept. I'd really like the frame handling to not require numpy, but I suppose it's not super important.

@darcymason
Copy link
Member Author

my thinking was going to be memmapping the whole file instead, similar to the RawDataElement concept

Hmmm, hadn't thought of that, and I'm not clear how that would work - do you mean still parse through the data elements, and memmap just the values? If so, it seems like a performance hit because it would effectively read the file twice, since skipping bytes isn't any faster than reading them into memory (smaller values are almost certainly already completely in memory, due to buffering in file reading).

Also, many elements take memory after conversion - e.g. all text items because of decoding them to Python (unicode) strings, IS/DS to Python number types, multivalues to be split, etc. My thought here is that pixel data (uncompressed) should be able to stay in the file, but not too many values can.

Also in this PR, the file handle is not obtained until a large data element is actually accessed - if keeping file handles for all files, then we might run into too many open handles, as people have seen in the past.

I'd really like the frame handling to not require numpy, but I suppose it's not super important.

Don't they eventually have to be a numpy array anyway, if the user is to do any manipulation? Maybe I'm thinking of frames only in terms of pixel data; are there others?

I'm happy to hear your thoughts, because this looks like it could get quite complex; better to nail down the issues as early on as possible.

@scaramallion
Copy link
Member

Those are all good points, using numpy seems like the way to go.

@@ -1008,22 +1021,41 @@ def read_deferred_data_element(fileobj_type, filename_or_obj, timestamp,
if is_filename else filename_or_obj)
is_implicit_VR = raw_data_elem.is_implicit_VR
is_little_endian = raw_data_elem.is_little_endian

if config._memmap_size and raw_data_elem.length > config._memmap_size:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also check for binary VR here as above (or move the check into a variable).

raise ImportError(
"Numpy required for memmap'd values, but numpy is missing"
)
mode = 'r' if config._memmap_read_only else 'r+'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was already asking here why not use the Python module, and why not map the whole file - but now noticed that @scaramallion had asked exactlt the same :) Makes sense to me, too.

_memmap_read_only: bool = True

def memmap_size(size: Union[int, str]) -> None:
"""Set Data element value lenght beyond which values are memmap'd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo lenght

Comment on lines -1021 to +1054
"original {1:s}".format(data_elem.VR,
raw_data_elem.VR))
"original {1:s}".format(data_elem.VR,
raw_data_elem.VR))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like your editor (or black?) is messing up the indentation.

@darcymason
Copy link
Member Author

I was already asking here why not use the Python module, and why not map the whole file - but now noticed that @scaramallion had asked exactlt the same :)

Okay, since you were both thinking this, it made me question whether I'm thinking about this correctly. And looking again, perhaps Python's mmap has some advantages. Unfortunately, I find the online explanations really don't explain well what is going on - the best I can tell is that it optimizes reading by using different system calls that more directly operate with the operating system's virtual memory (paging).

Looking into numpy's memmap, it uses mmap under the hood anyway, it just takes into account the granularity and works around it.

But one question - since mmap has all the needed file methods, read, seek, tell, then can it not just be passed right now as an open file object to dcmread? -- it seems it can - I just tried it on one of the larger (but only ~7 MB) pydicom data files, seemed to work, although I didn't check memory use to see what it is doing:

from mmap import mmap, ACCESS_READ
from pydicom.data import get_testdata_file
from pydicom import dcmread

f = open(get_testdata_file("RG3_UNCR.dcm"), 'rb')
mm = mmap(f.fileno(), 0, access=ACCESS_READ)
ds = dcmread(mm)

So, since after the large parts there isn't much file left anyway, and numpy.memmap doesn't seem to add special behavior beyond the mmap, it seems mapping the whole file probably makes more sense, but we have that already for reading, apparently, as shown above.

I'm not sure how full-file mmap would compete with the 'defer' concept which the code I started here does. Does it still try to use large parts of (real) memory or page swapping for the raw data element value if not 'defer'ing the read? Would need some tests on this with huge files.

The full file mmap is probably easier for adding a write capability, especially if want to do more than in-place changes where the size of the pixel data changes. We could add flags to dcmread, or create a new function, to allow read/write access to a file handle, which could include mmaps.

Thoughts?

@mrbean-bremen
Copy link
Member

Okay, since you were both thinking this, it made me question whether I'm thinking about this correctly.

You already had me convinced that the NumPy approach makes sense (my comment was meant that way, in case that was not clear). But if using mmap is as easy as you found out, I don't see many problems with that.

I'm not sure how full-file mmap would compete with the 'defer' concept which the code I started here does. Does it still try to use large parts of (real) memory or page swapping for the raw data element value if not 'defer'ing the read? Would need some tests on this with huge files.

I don't have experience with Python's mmap, but I have used memory mapping before using OS functions. As far as I know, the virtual memory is only reserved until accessed, and the OS usually does a good job to access it efficiently and transparently (page based). The only problem I remember is that the virtual memory for a mapped file must be available as contiguous memory, which should not be a problem with 64 bit OSes, but could be with 32 bit OS and large files (not sure if this is still a thing).

@darcymason
Copy link
Member Author

You already had me convinced that the NumPy approach makes sense (my comment was meant that way, in case that was not clear). But if using mmap is as easy as you found out, I don't see many problems with that.

And just to be clear, I didn't mean anything bad by my comment, it's just that I respect your judgment, and I jumped in without a good understanding of how these two choices actually work. And now I understand much better which will hopefully translate into a better solution that won't need reworking later.

I'll try some tests with huge files and see how it looks. I think my concern about 'double'-reading may not actually be an issue, because the data elements before the pixel data will almost certainly be in memory in a small number of pages (usually one for most images, I would think). Although some SOP classes (e.g. RTSTRUCT) can be pretty large because of non-binary data.

I think it will come down to the open file handles issue, and also making the 'defer' concept work again. I plan to think a bit more about this and the different use cases.

should not be a problem with 64 bit OSes, but could be with 32 bit OS and large files (not sure if this is still a thing).

I don't think we can do anything to help with the 32-bit 4 GB addressing limit. People with files over that size will just have to be on a 64-bit system.

@mrbean-bremen
Copy link
Member

mrbean-bremen commented Nov 29, 2020

And just to be clear, I didn't mean anything bad by my comment

That goes without saying - I can't even imagine that from from you :)

I think it will come down to the open file handles issue, and also making the 'defer' concept work again. I plan to think a bit more about this and the different use cases.

Sounds like a good plan to me.

I don't think we can do anything to help with the 32-bit 4 GB addressing limit. People with files over that size will just have to be on a 64-bit system.

Agreed. Also, I couldn't find it in the documentation, but I expect to get an exception if the mapping fails, which we could handle by falling back to no mapping.

@darcymason
Copy link
Member Author

As an interesting side comment, during my reading, I see that mmap has the "copy on write" option, which basically mirrors the existing array and only actually copies the pages that are modified, thereby conserving memory - this might be relevant for large files with respect to the solution in #717 / #742, where a copy of the numpy array is returned. It would be ideal if the user doesn't actually change anything. It wouldn't help for modifying all pixel data, but perhaps if only a section of an image were modified it might vastly improve memory use.

@darcymason
Copy link
Member Author

In terms of setting up testing for this, I'm wondering whether anybody knows of any good test files - large ones that were previously attached to pydicom issues, or just any freely available on the web. I had a quick look but they are hard to find. I'm looking for files of ~1 GB to ~20 GB, both uncompressed and compressed (although I can create the latter from the former using dcmtk or gdcm if necessary).

Alternatively (or perhaps in addition), does anyone best practices for simulating limited memory in Python, especially with pytest. Found one example on stackoverflow that could possibly be adapted. I'm looking for ways to force virtual memory use with more reasonably-sized files (~few hundreds of MB). The alternative is to use a virtual linux box with a limited memory, but running in that is not a completely automated process.

Thanks

@pieper
Copy link
Contributor

pieper commented Dec 1, 2020

@darcymason
Copy link
Member Author

Thanks, @pieper, that covers an important part of the tests for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use numpy.memmap() instead of loading pixel data to memory
5 participants