Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine and simplify the DICOM anonymise API #1011

Closed
SimonBiggs opened this issue Aug 17, 2020 · 23 comments
Closed

Refine and simplify the DICOM anonymise API #1011

SimonBiggs opened this issue Aug 17, 2020 · 23 comments

Comments

@SimonBiggs
Copy link
Member

SimonBiggs commented Aug 17, 2020

if you mean to condense the various "knobs and buttons" that are currently used to control what happens within the anonymisation, then yes, having a single object represent the set of values currently defined separately:
replace_values=True, keywords_to_leave_unchanged=(), delete_private_tags=True, delete_unknown_tags=None, copy_dataset=True, replacement_strategy=None, identifying_keywords=None
While it could be done as a dictionary, it might make sense to incorporate it in to a Class, so the class can be derived from, and in that class, have the "keys validate against the dispatch table" (potentially automatically at creation and when any particular control is modified, as well as explicit if a programmer wants to confirm the strategy is valid before sending it in)
This also helps to avoid people mutating the dictionary unintentionally (although just having a function that returns a copy rather than directly hitting at the pseudonymisation_dispatch would be appropriate and is simple).
But it won't reduce the work a programmer has to do. It will simply shift it from the anonymisation call to the specification of the contents of the strategy. The benefit is in making a clear separation of where the concern is addressed.
It might motivate creation of either a variety of child classes or a set of factory creation methods, but it will at least shift the signature from invocation of anonymisation over to creation of a strategy object.

Originally posted by @sjswerdloff in #1009

@SimonBiggs
Copy link
Member Author

Yup, I think an "anonymisation strategy"/config object which has validation methods on initialisation would be quite a tidy way to go.

Building it as a frozen dataclass under the hood might be a neat solution:
docs.python.org/3/library/dataclasses.html#frozen-instances

(It would be nice to be able to use python.org/dev/peps/pep-0603 though...)

Originally posted by @SimonBiggs in #1009

@sjswerdloff
Copy link
Collaborator

frozenmap (python.org/dev/peps/pep-0603) would be an elegant and transparent way of doing this but then users are restricted to python 3.9+
dataclass would require Python 3.7+
Looks like Python 3.6 is still a rather popular version of 3.x
see:
https://dev.to/hugovk/python-version-share-over-time-6-1jb8
and
https://w3techs.com/technologies/details/pl-python/3

I'll have to learn if the frozen dataclass and frozenmap have compatible accessors (if they do, then user code that just creates and uses would probably be safe if we were to migrate from frozen dataclass to frozenmap, but it's clear that mutation will not have the same interface, so more complex use and underlying support would have some ripple).

@SimonBiggs
Copy link
Member Author

With dataclasses, there is a pypi package that provides 3.6 support. We already install this if a user installs pymedphys on Python 3.6:

dataclasses = { version = "*", python = "~3.6.0" }

@SimonBiggs
Copy link
Member Author

SimonBiggs commented Aug 17, 2020

On further thought here, I think we should expose the following two functions:

anonymise(dataset, strategy)
check_strategy(strategy)

And then have for import a range of strategies exposed just as dictionaries:

from pymedphys.dicom.anonymisation import HASHING_STRATEGY, anonymise

anonymise(dataset, strategy=HASHING_STRATEGY)

The strategy that is passed to anonymise should be able to be just a dictionary. Internally within anonymise a thing it'll run early on is check_strategy (or equivalent).

Within pymedphys, I'm happy for AnonymisationStrategy to be an object that we use internally, but, I think it shouldn't be a part of our API. Instead, check_strategy internally could just be:

def check_strategy(strategy):
    AnonymisationStrategy(strategy)

That way, the code can still be written as an object that undergoes verification on initialisation. But users are exposed to just two functions and a range of dictionary strategies.

That way we keep our API surface area small. If we want to change the Strategy object it won't affect our API. It also means users don't have to be comfortable with object initialisation and usage in order to use these pymedphys functions. But through check_strategy all the benefit of a strategy verification can be achieved.

@SimonBiggs
Copy link
Member Author

SimonBiggs commented Aug 17, 2020

For the deprecation pathway, I propose we do the following:

We mark anonymise as deprecated:

from ._dicom.anonymise import anonymise_dataset as anonymise

using the deprecated decorator:

read = _deprecated(
reason="This API has left beta. It is found at `pymedphys.trf.read`."
)(_read)

We tell people, that soon pymedphys.dicom.anonymise will be removed in favor of the function that currently (will) resides at pymedphys.beta.dicom.anonymisation.anonymise. Normally such a deep import path would make me cringe... maybe it still should. But this is pointing towards eventually having the function be imported/used as:

from pymedphys.dicom import anonymisation

anonymisation.anonymise(dataset, strategy=anonymisation.HASHING_STRATGY)

@SimonBiggs
Copy link
Member Author

I've been also thinking, that people should be able to have either a function or a constant value be able to be placed within STRATEGY['replacement'][VR]. That way there can be an "easy mode" strategy that is just a dictionary of replacement constants. And then, for some of the VRs, should people choose, they can also have a function.

Anonymise will check to see if the item is a callable, if it is, then it will call the function with the current value. If it is not a function, then it will just assign the value to the item. Maybe something like the following?

import collections

def apply_replacement(value, method):
    if isinstance(method, collections.abc.Callable):
        return method(value)
    
    return method

@SimonBiggs
Copy link
Member Author

SimonBiggs commented Aug 17, 2020

After thinking about it in the car I've actually come full circle... We really don't want people to be able to edit the strategies that come with pymedphys. And the natural approach to adjusting a strategy will be to adjust those dictionaries... And that can really cause some nasty mistakes.

...so I'm back at your original idea, an immutable mapping based Strategy class exposed via the API.

...however, I remembered something neat. We are free to use frozenmap in Python 3.5+. We just have to install it from https://github.com/MagicStack/immutables

So, I propose we make immutables a dependency, and build the strategy class on top of frozenmap.

@sjswerdloff
Copy link
Collaborator

It looks like the MagicStack immutables frozenmap has a compatible API with what is going in to 3.9 (if not identical, including implementation), so I'd like us to try to do this in a way that automatically switches between the parent namespace for python version >= (major=3, minor=9).
Or whatever version in which PEP-0603 gets implemented.
And encourage programmers who utilise it in a way that has them specifying the name space to do the same (so there can be a seamless transition), although in theory it should be possible for them to only import from anonymise and still accomplish a variety of tasks.

@SimonBiggs
Copy link
Member Author

Yup, I think we should provide them a Strategy class within pymedphys which they should use. That way we can make the API the same no matter the Python version, and we can under the hood fall back to the immutables library if Python <3.9 is installed.

@SimonBiggs
Copy link
Member Author

At 36:00 -- 47:15 Raymond details a Validator class that I think will be quite neat and helpful here:

https://youtu.be/UANN2Eu6ZnM

@sjswerdloff
Copy link
Collaborator

I propose we make immutables a dependency, and build the strategy class on top of frozenmap.

There's a problem with using this because of one of the VR, "OB or OW"

import immutable
my_map = immutables.Map(OB = 1234)
my_map = immutables.Map(OB or OW = 1234)
File "", line 1
SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
my_map = immutables.Map("OB or OW" = 1234)
File "", line 1
SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
my_map['OB']
1234

So I think just using a dict that is exposed with a method that does a deepcopy and is not lru-cached would accomplish the same end.
the dict would have additional keys beyond the dispatch table, e.g. 'delete_private_tags', and developers who create their own strategies and put in additional keys/options need to know to not use two character uppercase keys because they have the potential to clash with a future VR. At any rate, those keys would need to be known to the anonymisation harness, which is essentially a wrapper (decorator?) for pydicom.dataset.Dataset that enables the Visitor pattern. We are passing in a Visitor, which I've called a Strategy (because it is a strategy for anonymisation, but not an implementation of the Strategy pattern of overriding protected methods that are called from public methods in base/derived classes).
If a programmer wants to avoid having to keep getting the anonymisation strategy, they can just hold on to it.
If a programmer wants to alter the strategy, they can alter the dict that they have in hand, because it's a deep copy, not the original.
This also has the benefit of being backwards compatible.

@sjswerdloff
Copy link
Collaborator

The current implementation of anonymise_dataset with respect to copy_dataset affecting the return value prevents putting the copy_dataset directive/modifier in an "anonymisation strategy".

There is an interesting deviation between the defaults for copy_dataset in
anonymise_dataset() (default value for copy_dataset=True)
and the parameters passed in to it via anonymise_file() (which is hardcoded and sets the copy_dataset to False)

This is rooted in anonymise_dataset() returning None if copy_dataset=False.
I believe the paradigm for these types of operations is to not only modify the object in place, but to return the same object.
So I look at the current implementation as being either defective or at least not suitable for purpose given the desire to encode all modifiers/directives without having knowledge of whether the strategy is to be applied to data in memory vs. operating on files.

Because the return of anonymise_data() is not consistent based on copy_dataset, the programmer had to make a hard coded choice of either performing further operations on the dataset that was passed in and setting copy_dataset=False or performing further operations on the return value, which is implicitly hardcoded to None if copy_dataset=True

It's not wrong to have it work that way, but it means that if one wants to put that in to the strategy itself, the strategy has to default to "It Just Works" for either anonymising a dataset "in hand" or "It Just Works" for dealing with a file. But it won't work correctly for both. So the programmer has to know that altering the "all inclusive" strategy is necessary for one or the other.

While expecting in-place modification for working on a file saves on memory (and the time to make a deep copy of the dataset), it results in the above conundrum.
I think the choices are:
a) leave copy_dataset in the api as an individual optional parameter
b) provide the programmer with the means of getting the two variants when requesting a copy of the strategy (pass it in there)
c) change the code in anonymise_file to return the anonymised dataset (whether that was the copy or the object passed in and modified in place) appropriate to the copy_dataset directive (but always return the anonymised data).

@sjswerdloff
Copy link
Collaborator

I have code changes in my local environment that returns the anonymised data from anonymise_dataset() regardless of whether it's in-place or copied, and modifications to anonymise_file() that take advantage of that, and the automated tests pass.
My concern is whether there is code somewhere that is relying on getting None back from anonymise_dataset() as a means of informing subsequent behaviour. On the other hand, even though anonymise is not experimental, PyMedPhys doesn't claim a stable API. And in theory, pretty much the right thing should happen with anonymise_dataset() returning the in-place modified dataset that was passed in as a parameter. Unless someone made assumptions about what got returned not being their "in-place" copy...

@SimonBiggs
Copy link
Member Author

SimonBiggs commented Sep 15, 2020

There's a problem with using this because of one of the VR, "OB or OW"

I think immutables works just fine in this case by using the following API:

In [1]: import immutables                                                       

In [2]: map = immutables.Map()                                                  

In [3]: with map.mutate() as mm: 
   ...:     mm["OB or OW"] = 123 
   ...:     mm["OB"] = 'something' 
   ...:     mm["OW"] = 'something else!' 
   ...:     map = mm.finish() 
   ...:                                                                         

In [4]: map                                                                     
Out[4]: <immutables.Map({'OW': 'something else!', 'OB or OW': 123, 'OB': 'something'}) at 0x7fc9a23857d0>

or

In [7]: map = immutables.Map({"OB or OW": 123})                           

In [8]: map                                                               
Out[8]: <immutables.Map({'OB or OW': 123}) at 0x7fc9a2250b90>

This is rooted in anonymise_dataset() returning None if copy_dataset=False.
I believe the paradigm for these types of operations is to not only modify the object in place, but to return the same object.

I'm okay to have this API changed so that the object is returned and the function has a consistent signature no matter what is passed.


Also, I've been thinking more about the following:

if you mean to condense the various "knobs and buttons" that are currently used to control what happens within the anonymisation, then yes, having a single object represent the set of values currently defined separately:

	replace_values=True, keywords_to_leave_unchanged=(), 
	delete_private_tags=True, delete_unknown_tags=None, 
	copy_dataset=True, replacement_strategy=None, 
	identifying_keywords=None

While it could be done as a dictionary, it might make sense to incorporate it in to a Class, so the class can be derived from, and in that class, have the "keys validate against the dispatch table" (potentially automatically at creation and when any particular control is modified, as well as explicit if a programmer wants to confirm the strategy is valid before sending it in)
This also helps to avoid people mutating the dictionary unintentionally (although just having a function that returns a copy rather than directly hitting at the pseudonymisation_dispatch would be appropriate and is simple).

But it won't reduce the work a programmer has to do. It will simply shift it from the anonymisation call to the specification of the contents of the strategy. The benefit is in making a clear separation of where the concern is addressed.
It might motivate creation of either a variety of child classes or a set of factory creation methods, but it will at least shift the signature from invocation of anonymisation over to creation of a strategy object.

Originally posted by @sjswerdloff in #1009 (comment)

Potentially the strategy configuration object (built on top of the immutables.Map) should just take care of the parameters that specifically affect how the values themselves are adjusted replacement_strategy, identifying_keywords, replace_values, delete_private_tags, delete_unknown_tags. I believe keywords_to_leave_unchanged should be scrapped as it is now able to be achieved by removing an entry from identifying_keywords. Therefore, the only parameter left is copy_dataset. I propose we rename copy_dataset to just copy to mimic the numpy API:

https://numpy.org/doc/stable/reference/generated/numpy.array.html#numpy.array

So the new API would look like this:

pymedphys.dicom.anonymise(dicom_dataset, strategy, copy=True)

And its usage would look like the following:

import pydicom
import pymedphys

dicom_dataset = pydicom.read_file('path/to/file.dcm')

strategy = pymedphys.dicom.replace_values_strategy
# or strategy = pymedphys.dicom.pseudonymise_strategy

anonymised_dicom_dataset = pymedphys.dicom.anonymise(
    dicom_dataset, strategy
)

@SimonBiggs
Copy link
Member Author

My concern is whether there is code somewhere that is relying on getting None back from anonymise_dataset() as a means of informing subsequent behaviour.

I'm okay with making this change. In a world of pymedphys>=1.0.0 this would certainly need to have a deprecation marking, and we could then implement something like the following within pymedphys:

pydicom/pydicom#1014 (comment)

And then before actually making the switch, a major version bump would be needed.

As such, if there is anything that can be done to tidy up the currently exposed API before going 1.0.0 that is massively preferable.

@sjswerdloff
Copy link
Collaborator

sjswerdloff commented Sep 15, 2020

alternative API to immutables.Map() works. I'm happy with that. It also means I can start by passing the existing dict (save a little typing).

copy_dataset -> copy is fine by me, but I'd rather hold off on that in a second phase (first phase implementing the additional content in the strategy and using immutables.Map without breaking any signatures and issue deprecation warning).

Once the change is made to get anonymise_dataset() to consistently return whichever dataset is the anonymised one, there is no need to preserve copy_dataset or copy separate from the strategy itself, it can then live inside the strategy (soon to be based on a map).

I don't think 3.9 is going to get frozenmap. I've looked at 3.9.0rc1 and it doesn't include it.
3.9.0rc2 should be available in the next few days and the release should be out in early October, which is quite soon.
The reason I mention this is three-fold:

  1. specification of the dependency in pyproject.toml and poetry.lock (no need to constrain based on python version, yet)
  2. whether to
    import immutables
    or
    from immutables import Map
    or
    from immutables import Map as frozenmap

I'm leaning towards the last. Sets things up nicely for the future (minimise renaming later on, just make the import python version dependent... PEP 603 makes it in at some point) and the code will read well.
3) Stuff is going to break in 3.9, I think we need to get on that before 3.9 is released and people start using it (when it becomes the default Python shipping...) see Issue #958

pymedphys/tests/winstonlutz/test_end_to_end.py::test_end_to_end
/home/osboxes/PythonProjects/pymedphys4/pymedphys/pymedphys/_vendor/pylinac/core/utilities.py:135:
DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python > 3.3, and in 3.9 it will stop working
return isinstance(obj, collections.Iterable)

@SimonBiggs
Copy link
Member Author

  1. specification of the dependency in pyproject.toml and poetry.lock (no need to constrain based on python version, yet)

It also isn't too painful to depend on immutables and use it even if it is also available within the standard library.

2. whether to
import immutables
or
from immutables import Map
or
from immutables import Map as frozenmap

I'm personally leaning towards the first one, given that should there be small API changes between the current library and the standard lib it is clear which one we were using in the code. But more than happy to go with number three, as I see the value it gives in being able to more seamlessly saddle the two import locations.

@sjswerdloff
Copy link
Collaborator

@SimonBiggs What do you think of explicit_checker:
https://stackoverflow.com/questions/14749328/how-to-check-whether-optional-function-parameter-is-set/58166804#58166804
as a means of notifying a user/programmer that certain optional parameters are being deprecated?
The inspect package is a built-in as of Python 3.6, so this can be done with zero ripple.

@SimonBiggs
Copy link
Member Author

SimonBiggs commented Sep 24, 2020

as a means of notifying a user/programmer that certain optional parameters are being deprecated?

Yup, I could imagine a really nice little decorator that works a bit like:

@deprecate_parameters(["deprecated", "parameter", "names"], message="Please use etc etc instead")
def function(staying, constant, deprecated, parameter=True, names="boo"):
    do_stuff()

And that would be quite a neat way to flag

@SimonBiggs
Copy link
Member Author

SimonBiggs commented Sep 24, 2020

@sjswerdloff How about, instead of bundling the knobs and dials, given they work, and given they've been around for a while without needing a change, how bout we opt for leaving them as is for now and not bundle them into strategy.

I have been thinking about the innate linking of replacement_strategy and identifying_keywords. They have a degree of dependency on one another, and do both partially form part of the "strategy" so-to-speak.

So, how about, we bundle replacement_strategy and identifying_keywords into the one parameter called strategy which is an immutables.Map and have it be structured like so:

def some_function(identifying_value, vr):
    anonymised_value = do_something()

    return anonymised_value


strategy = immutables.Map({
    'vr_replacement': {
        "AE": "Anonymous",  # can be defined as a plain lookup table
        "LO": some_function  # or as a function
    },
    'identifying_keywords': [
        'PatientName',
        'OtherPatientNames'
    ]
})

... hmmm scratch all that ...


...so I got to this point... and I realised, sometimes what the anonymisation value is will depend upon the keyword as well as the vr:

if vr == "CS":
# An example, although this exact code breaks unit tests because
# the unit tests are expecting the CS hardcoded replacement string "ANON"
# if keyword == "PatientSex":
# replacement_value = "O" # or one can replace with an empty string because PatientSex is typically type 2
# else:
logging.warning(
"Keyword %s has Value Representation CS and may require special processing to avoid breaking DICOM conformance or interoperability",
keyword,
)
# elif ...

... at that point I realised ... maybe we have made the strategy API too inflexible. Maybe what's really needed is the following.

Leave identifying_keywords as a parameter, just as it is. Make replacement_strategy be a function instead of a dictionary or an immutables.Map. And have the function be defined as the following:

def replacement_strategy(vr, keyword, value):
    anonymised_value = do_something()
    return anonymised_value

That way, the default replacement strategy can look like the following:

def replacement_strategy(vr, keyword, _):
    if keyword == "PatientSex":
        return "O"

    if vr == "SQ":
        return [pydicom.Dataset()]

    return VR_TO_REPLACEMENT_MAP[vr]

...okay... I much prefer that second approach. Way simpler, far easier to understand for the user.

What are your thoughts?

@sjswerdloff
Copy link
Collaborator

sjswerdloff commented Sep 24, 2020

I am thinking that the infrastructure is fine the way it is, with the exception of handling identifying keys that are of VR CS, and those require either:

  1. leaving them alone
  2. putting in an empty value (fine for type 2, and not uncommon IRL)
  3. putting in an element specific value that counts as anonymising in some fashion

Right now, there is only PatientSex. I am of the opinion that for humans there will not be another element of VR CS until work being done involving more nuanced descriptions of gender (non-binary, transexual, Fa'afafine, etc.) in the DICOM Standard is completed and implemented. I don't believe in trying to anticipate the outcome of that work.

For gender, whether for the current definition in the Standard or any future definition, I think it is reasonable to expect the information to have significant clinical implications, and important to be able to use for classifying. Take COVID19 as an example. The outcomes have been reported as being significantly affected by gender.

If a programmer or user wants to take approach 1 (leave it alone), they can do so quite easily by specifying that key/tag/attribute/element is to be ignored (or remove the entry from the list of identifiers). That is what I have done as a user of PyMedPhys pseudonymisation.

The difference in value to the clinician or researcher between "" and "O" for PatientSex is minimal.
Any implementation that requires a value but accepts "O" is likely to have problems with typical data.
At least that was my experience "back in the day".

So I think a default replacement value for CS should be "".
It will work for any identifying CS that is not type 1.
And any identifying CS that is type 1... is going to fail for sure if the substituted value isn't one of the enumerated types and almost certain to fail if it is a defined type.
In other words, it works very well for what we know we have.
It should work well enough (not cause immediate failure) for almost any foreseeable situation .
But note that the test has to change (I tried it... which is why I had the example code and documented it out with explanation?).

If one goes to "need to know the tag", then there is no need for the VR because that can be obtained from the tag (we already do that one level higher!), and more importantly if you want "per tag anonymisation" then a dictionary of tags and replacement values/functions provides that level of detail without doing a bunch of dynamic special casing. Memory is cheap. Constructing that dictionary would not be hard, especially given the existing VR based dictionaries, because the special cases could be overwritten after constructing the tag based dictionary from the VR based dictionaries.
Considering we've gone from "eliminate or hardcode" and not worrying about CS getting mangled to where we are so far, going all the way to tag based when we could easily tidy up with 2 above if the user doesn't choose 1 above... Diminishing returns.

@sjswerdloff
Copy link
Collaborator

I've also reconsidered the approach to "refining and simplifying the DICOM anonymise API".
I think each strategy should provide a facade method.
If the user wants to keep it simple:
anonymise_clear(pydicom.dataset.Dataset | Path)
anonymise_hardcode(pydicom.dataset.Dataset | Path)
pseudonymise(pydicom.dataset.Dataset | Path)

If they want to do something more complicated, then there are the existing more or less public functions.
leave strategy as a dispatch table and nothing more. the knobs and buttons work. they have value, although some have diminished in value (replace_value could be done as a strategy, but at this point, why bother... it works).
The interface is not broken. It's just more awkward than we would like for the typical programmer.
That is exactly what a facade is supposed to fix.
The drawback is the larger surface area that has to be maintained, but the benefits are:

  • flexibility is maintained
  • backward compatibility is maintained
  • internally consistent behaviour is ensured for the nominal cases
  • the nominal cases have dirt simple interfaces

Most of Python assumes that the programmers will act as adults, and accept the potential consequences for violating the norms.

@SimonBiggs
Copy link
Member Author

I've also reconsidered the approach to "refining and simplifying the DICOM anonymise API".
I think each strategy should provide a facade method.
If the user wants to keep it simple:
anonymise_clear(pydicom.dataset.Dataset | Path)
anonymise_hardcode(pydicom.dataset.Dataset | Path)
pseudonymise(pydicom.dataset.Dataset | Path)

The interface is not broken. It's just more awkward than we would like for the typical programmer.
That is exactly what a facade is supposed to fix.

Yup, perfect. I like it. I'm convinced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants