New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refine and simplify the DICOM anonymise API #1011
Comments
Originally posted by @SimonBiggs in #1009 |
frozenmap (python.org/dev/peps/pep-0603) would be an elegant and transparent way of doing this but then users are restricted to python 3.9+ I'll have to learn if the frozen dataclass and frozenmap have compatible accessors (if they do, then user code that just creates and uses would probably be safe if we were to migrate from frozen dataclass to frozenmap, but it's clear that mutation will not have the same interface, so more complex use and underlying support would have some ripple). |
With dataclasses, there is a pypi package that provides 3.6 support. We already install this if a user installs pymedphys on Python 3.6: Line 31 in 864d1fb
|
On further thought here, I think we should expose the following two functions: anonymise(dataset, strategy)
check_strategy(strategy) And then have for import a range of strategies exposed just as dictionaries: from pymedphys.dicom.anonymisation import HASHING_STRATEGY, anonymise
anonymise(dataset, strategy=HASHING_STRATEGY) The strategy that is passed to anonymise should be able to be just a dictionary. Internally within anonymise a thing it'll run early on is Within pymedphys, I'm happy for def check_strategy(strategy):
AnonymisationStrategy(strategy) That way, the code can still be written as an object that undergoes verification on initialisation. But users are exposed to just two functions and a range of dictionary strategies. That way we keep our API surface area small. If we want to change the Strategy object it won't affect our API. It also means users don't have to be comfortable with object initialisation and usage in order to use these pymedphys functions. But through |
For the deprecation pathway, I propose we do the following: We mark anonymise as deprecated: Line 6 in e5f8fda
using the deprecated decorator: pymedphys/pymedphys/beta/trf.py Lines 8 to 10 in e5f8fda
We tell people, that soon from pymedphys.dicom import anonymisation
anonymisation.anonymise(dataset, strategy=anonymisation.HASHING_STRATGY) |
I've been also thinking, that people should be able to have either a function or a constant value be able to be placed within Anonymise will check to see if the item is a callable, if it is, then it will call the function with the current value. If it is not a function, then it will just assign the value to the item. Maybe something like the following? import collections
def apply_replacement(value, method):
if isinstance(method, collections.abc.Callable):
return method(value)
return method |
After thinking about it in the car I've actually come full circle... We really don't want people to be able to edit the strategies that come with pymedphys. And the natural approach to adjusting a strategy will be to adjust those dictionaries... And that can really cause some nasty mistakes. ...so I'm back at your original idea, an immutable mapping based Strategy class exposed via the API. ...however, I remembered something neat. We are free to use frozenmap in Python 3.5+. We just have to install it from https://github.com/MagicStack/immutables So, I propose we make |
It looks like the MagicStack immutables frozenmap has a compatible API with what is going in to 3.9 (if not identical, including implementation), so I'd like us to try to do this in a way that automatically switches between the parent namespace for python version >= (major=3, minor=9). |
Yup, I think we should provide them a |
At 36:00 -- 47:15 Raymond details a Validator class that I think will be quite neat and helpful here: |
There's a problem with using this because of one of the VR, "OB or OW"
So I think just using a dict that is exposed with a method that does a deepcopy and is not lru-cached would accomplish the same end. |
The current implementation of anonymise_dataset with respect to copy_dataset affecting the return value prevents putting the copy_dataset directive/modifier in an "anonymisation strategy". There is an interesting deviation between the defaults for copy_dataset in This is rooted in anonymise_dataset() returning None if copy_dataset=False. Because the return of anonymise_data() is not consistent based on copy_dataset, the programmer had to make a hard coded choice of either performing further operations on the dataset that was passed in and setting copy_dataset=False or performing further operations on the return value, which is implicitly hardcoded to None if copy_dataset=True It's not wrong to have it work that way, but it means that if one wants to put that in to the strategy itself, the strategy has to default to "It Just Works" for either anonymising a dataset "in hand" or "It Just Works" for dealing with a file. But it won't work correctly for both. So the programmer has to know that altering the "all inclusive" strategy is necessary for one or the other. While expecting in-place modification for working on a file saves on memory (and the time to make a deep copy of the dataset), it results in the above conundrum. |
I have code changes in my local environment that returns the anonymised data from anonymise_dataset() regardless of whether it's in-place or copied, and modifications to anonymise_file() that take advantage of that, and the automated tests pass. |
I think immutables works just fine in this case by using the following API: In [1]: import immutables
In [2]: map = immutables.Map()
In [3]: with map.mutate() as mm:
...: mm["OB or OW"] = 123
...: mm["OB"] = 'something'
...: mm["OW"] = 'something else!'
...: map = mm.finish()
...:
In [4]: map
Out[4]: <immutables.Map({'OW': 'something else!', 'OB or OW': 123, 'OB': 'something'}) at 0x7fc9a23857d0> or In [7]: map = immutables.Map({"OB or OW": 123})
In [8]: map
Out[8]: <immutables.Map({'OB or OW': 123}) at 0x7fc9a2250b90>
I'm okay to have this API changed so that the object is returned and the function has a consistent signature no matter what is passed. Also, I've been thinking more about the following:
replace_values=True, keywords_to_leave_unchanged=(),
delete_private_tags=True, delete_unknown_tags=None,
copy_dataset=True, replacement_strategy=None,
identifying_keywords=None
Potentially the strategy configuration object (built on top of the https://numpy.org/doc/stable/reference/generated/numpy.array.html#numpy.array So the new API would look like this: pymedphys.dicom.anonymise(dicom_dataset, strategy, copy=True) And its usage would look like the following: import pydicom
import pymedphys
dicom_dataset = pydicom.read_file('path/to/file.dcm')
strategy = pymedphys.dicom.replace_values_strategy
# or strategy = pymedphys.dicom.pseudonymise_strategy
anonymised_dicom_dataset = pymedphys.dicom.anonymise(
dicom_dataset, strategy
) |
I'm okay with making this change. In a world of pydicom/pydicom#1014 (comment) And then before actually making the switch, a major version bump would be needed. As such, if there is anything that can be done to tidy up the currently exposed API before going 1.0.0 that is massively preferable. |
alternative API to immutables.Map() works. I'm happy with that. It also means I can start by passing the existing dict (save a little typing). copy_dataset -> copy is fine by me, but I'd rather hold off on that in a second phase (first phase implementing the additional content in the strategy and using immutables.Map without breaking any signatures and issue deprecation warning). Once the change is made to get anonymise_dataset() to consistently return whichever dataset is the anonymised one, there is no need to preserve copy_dataset or copy separate from the strategy itself, it can then live inside the strategy (soon to be based on a map). I don't think 3.9 is going to get frozenmap. I've looked at 3.9.0rc1 and it doesn't include it.
I'm leaning towards the last. Sets things up nicely for the future (minimise renaming later on, just make the import python version dependent... PEP 603 makes it in at some point) and the code will read well.
|
It also isn't too painful to depend on immutables and use it even if it is also available within the standard library.
I'm personally leaning towards the first one, given that should there be small API changes between the current library and the standard lib it is clear which one we were using in the code. But more than happy to go with number three, as I see the value it gives in being able to more seamlessly saddle the two import locations. |
@SimonBiggs What do you think of explicit_checker: |
Yup, I could imagine a really nice little decorator that works a bit like: @deprecate_parameters(["deprecated", "parameter", "names"], message="Please use etc etc instead")
def function(staying, constant, deprecated, parameter=True, names="boo"):
do_stuff() And that would be quite a neat way to flag |
@sjswerdloff How about, instead of bundling the knobs and dials, given they work, and given they've been around for a while without needing a change, how bout we opt for leaving them as is for now and not bundle them into I have been thinking about the innate linking of So, how about, we bundle def some_function(identifying_value, vr):
anonymised_value = do_something()
return anonymised_value
strategy = immutables.Map({
'vr_replacement': {
"AE": "Anonymous", # can be defined as a plain lookup table
"LO": some_function # or as a function
},
'identifying_keywords': [
'PatientName',
'OtherPatientNames'
]
}) ... hmmm scratch all that ... ...so I got to this point... and I realised, sometimes what the anonymisation value is will depend upon the keyword as well as the vr: pymedphys/pymedphys/_dicom/anonymise/core.py Lines 185 to 195 in 8484043
... at that point I realised ... maybe we have made the strategy API too inflexible. Maybe what's really needed is the following. Leave def replacement_strategy(vr, keyword, value):
anonymised_value = do_something()
return anonymised_value That way, the default replacement strategy can look like the following: def replacement_strategy(vr, keyword, _):
if keyword == "PatientSex":
return "O"
if vr == "SQ":
return [pydicom.Dataset()]
return VR_TO_REPLACEMENT_MAP[vr] ...okay... I much prefer that second approach. Way simpler, far easier to understand for the user. What are your thoughts? |
I am thinking that the infrastructure is fine the way it is, with the exception of handling identifying keys that are of VR CS, and those require either:
Right now, there is only PatientSex. I am of the opinion that for humans there will not be another element of VR CS until work being done involving more nuanced descriptions of gender (non-binary, transexual, Fa'afafine, etc.) in the DICOM Standard is completed and implemented. I don't believe in trying to anticipate the outcome of that work. For gender, whether for the current definition in the Standard or any future definition, I think it is reasonable to expect the information to have significant clinical implications, and important to be able to use for classifying. Take COVID19 as an example. The outcomes have been reported as being significantly affected by gender. If a programmer or user wants to take approach 1 (leave it alone), they can do so quite easily by specifying that key/tag/attribute/element is to be ignored (or remove the entry from the list of identifiers). That is what I have done as a user of PyMedPhys pseudonymisation. The difference in value to the clinician or researcher between "" and "O" for PatientSex is minimal. So I think a default replacement value for CS should be "". If one goes to "need to know the tag", then there is no need for the VR because that can be obtained from the tag (we already do that one level higher!), and more importantly if you want "per tag anonymisation" then a dictionary of tags and replacement values/functions provides that level of detail without doing a bunch of dynamic special casing. Memory is cheap. Constructing that dictionary would not be hard, especially given the existing VR based dictionaries, because the special cases could be overwritten after constructing the tag based dictionary from the VR based dictionaries. |
I've also reconsidered the approach to "refining and simplifying the DICOM anonymise API". If they want to do something more complicated, then there are the existing more or less public functions.
Most of Python assumes that the programmers will act as adults, and accept the potential consequences for violating the norms. |
Yup, perfect. I like it. I'm convinced. |
Originally posted by @sjswerdloff in #1009
The text was updated successfully, but these errors were encountered: