-
-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to parse private tags #125
Conversation
Signed-off-by: vsoch <vsochat@stanford.edu>
this means small updates to most functions to have special handling in the case of being provided a string (field name) versus a tag (BaseTag) that will produce a DataElement for dicom.get(field). This will need robust testing to ensure that we dont need to be more specific about some if these if/else (e.g., checking for a type instead). Additionally, Ive added the dicom_dir() function under fields that will serve the same purpose as dicom.dir() but also provide the keys to private tags, which are type pydicom.tag.BaseTag Signed-off-by: vsoch <vsochat@stanford.edu>
Sorry for my delay in getting back to this. I started a review this morning and have been trying to do some testing. I believe that there's an issue that may require a change in approach. Currently, the dicom_dir function with fields.py returns a list of strings or tags. Since the type of the key value is variable, at present, this causes exceptions throughout because once we start operating on the Tag-keyed items, we're trying to compare a Tag object to a string.
My gut reaction says that it may be better to convert everything to a string within dicom_dir() or somehow separate the string-key lists vs the tag-key lists - but this would then cause some other necessary conversion later. Otherwise, keeping it as is, we'll have to introduce conversions at the points listed above (which is probably not a exhaustive list - Knowing that you'd have a better perspective of impacts across the board, I stopped and decided to comment when I got to these three). Overall, I'm starting to do some testing/review now and would be more than happy to add test cases as I work through the testing. |
@wetzelj I'm aware of the list of both tags and strings, we need to preserve the tag objects to correctly index the dicom still. For Line 222 in 90bfbd6
I'm also aware of Line 105 in 90bfbd6
Line 315 in 90bfbd6
The approach to convert everything to a string will lose the ability to index the dicom using the tag. Do you have a suggested approach then for being able to:
That seems like a lot of work to have to do, because it would mean any time a tag isn't found as a string we'd need to try to test it being a tag. And then try to generate it on the fly. It seems like the approach that I took to try to maintain using the actual tag to the largest extent possible avoids that, but you are correct that we need to look out for these edge cases / errors where the tag would be missed. |
Great thank you! I greatly appreciate this - it's really been lovely to work with you and get your help and feedback after I do a large round of changes. |
Honestly, no. I can't. :) The problem isn't that the private tag is in skip. The problem is that the private tag becomes a contender, since dicom_dir() is used to build the contenders list. If we agree that private tags could never be used as "skips", the following would be perfectly acceptable. if isinstance(contender, str) and contender in skip:
continue
Unfortunately, no, they're not guaranteed to be flat. The image below shows a sample from one of our images to be deidentified where there is a private sequence. Regardless, as with the skip value above, the breaking issue here is that "x" ends up being a Tag object and we're trying to use it in the second parameter of re.search(). I could potentially see a need to say something like
Yes, this is in the put config, so it's not 100% relevant to my recipe-based solution... As far as a suggestion goes - As I've been trying to come up with a suggestion that I like, one (probably naive) question keeps coming up (and yes, I realize that this would upend the design, so I'm not necessarily suggesting it): Is there some benefit that we get by building the list of fields vs using the pydicom Dataset object directly? It seems to me that, rather than dealing with the string identifiers and then using the get() to actually get the tag, we could be acting directly on the Dataset object. If we were to be able to act on the Dataset object itself, the entire question of name vs Tag goes away. Everything would be a Tag. Thoughts? Am I crazy? :) |
The reason I decided to leave like this:
is because someone potentially could provide a tag in the list, in which case it would be skipped.
In that the dicom.dir() is just a list of strings, and that doing a dicom.get() works for a string or tag (on the Dataset object) we don't gain anything in not extracting some lookup (of strings) and tags first. What we lose, however, is being able to run a |
A potential refactor that I could see is to derive a recursive model of a Field, where it would expose get, set, and (inner) handle all of the discrepancies between tag and other. But that would just be cleaning up the current PR here and doing a major refactor - if the design here is fundamentally flawed that probably won't work either. But it would be an overall better design to provide a consistent interface to interact with "some field object from a dicom Dataset" |
And sharing ideas is never crazy - it's great and fun, and better to get them out for discussion than let them cause a ruckus in your frontal and parietal lobes! |
There's still a bug with When performing this loop and on the iteration for (0009,0001), we get an exception "Cannot compare Tag with non-Tag item". If we want to enable Tags as skips, then I think we're going to need to include type comparisons as well. I'm going to let the rest of it sit overnight and see what new ideas come to me tomorrow. |
I'm looking over get_fields, and the primary use of the list skip is to prevent the lookup from returning PixelData and other fields that normally aren't parsed. Now that contenders comes from dicom_dir, I can see that this loop would be triggered. So - I'm going to test providing a list of tags to the function:
but there isn't an error, I get back the full data structure (private tags included!) {'AcquisitionContextSequence': '[]',
'AcquisitionDate': '20200209',
...
'ViewCodeSequence__CodeMeaning': 'antero-posterior',
'ViewPosition': 'AP',
'WindowCenter': '2048',
'WindowWidth': '4096',
(0011, 0003): 'Agfa DR'} # this is the only private tag that wasn't in the skip list so I'm not able to reproduce this (I don't get an exception):
Could there be some differences in our versions of pydicom, or something like that? Can you provide me with an exact reproduction of this error? Also note in case you missed it that we've tested expand_field_expression to use a string to return a list of tags (works!)
|
Take a look at this (recreating the issue with the cat image): wetzelj@85de348 The two new test cases (in progress - for discussion purposes only) demonstrate the issues that I've seen above. Both of these cases generate exceptions and are representative of the issues that I hit when trying to run replace_identifiers on a "real" run. I've also added a new dicom test file. The file was a Head CT. The image has been replaced and the header has been fully deidentified, but otherwise all fields included in the header maintain the structure as it was in the original image. |
…ings Signed-off-by: vsoch <vsochat@stanford.edu>
Thank you @wetzelj ! For the testing of fields, I've reproduced the issue with having expand_sequences parsing a list of fields that have private tags. The fix I added just now is to parse these as a string: 90e322b What this means is that the list of expanded sequences, if a string for a private tag matches (unlikely) is unlikely to actually be useful for a further action. However, I don't see it being reasonable to support private tags nested in other private tags, so I think this is okay for now. I'm not able to trigger this error locally when I run the test, however - is this leading to an error for you?
If you are getting an error (and I'm not) it is possibly related to a version of pydicom so we should test these details. Also, any changes that you add (tests, new datasets, which are greatly appreciated!) when all is said and done, you can PR to the branch here so that they are included. |
I agree that there shouldn't be a conflict between named and non-named fields. However, I think at some point, we will have to handle private tag sequences. If you step through the new medical image through It's probably okay to take a pass on this issue for now. Looking at the new medical dicom image I uploaded - we would just have to be aware that if the user was processing a rule like:
Whereas something like: BTW... I'm about to hop into a few meetings, but am planning to first submit a PR to this branch for just the new medical test images. |
Regarding test_skip_list - yes, this is triggering an error. On the iteration of Following is the complete list of packages with version that I'm testing against. Pydicom is still using 1.2.1 as this exact version is required in version.py. Could this be an issue with the version of python?
|
Weird, so the line for that particular error (https://github.com/pydicom/pydicom/blame/caf0db105ddf389ff5025937fd5f3aa1e61e85e7/pydicom/tag.py#L156) was added 8 years ago, so it should be triggering for me too. Based on the function that I see here it looks like it's a less than function, so it means we are asking if a tag is somehow less than another object? I'm wondering - if you open up a PR with this test, does the error reproduce? It didn't trigger for me locally when I ran the test file - only the second example did. For the expansion of private tags - I see the issue. I think we could work on this, but do you have an example image that has a sequence of nested private tags that we could use to develop? |
Oops and one correction to the above - we are using the equals comparator here https://github.com/pydicom/pydicom/blob/master/pydicom/tag.py#L168 and not less than (of course!). But this is still 8 years old, so the conclusions (that I should see the error too!) are the same. |
Oh! And here is a difference - I actually have pydicom 1.4.1
which is (almost) at the latest - I think I installed it a while back because I wanted to test if everything continued to work. Could perhaps this be something that would explain the differences? |
That did it! I downgraded to 1.2.1 and I reproduced your issue:
|
We probably need to be able to still support both at this point, I'll dig into this now that I've reproduced! I love it when that happens :) |
Yep! See PR #126. Sequence (0029, 0010) contains other private tags.
I'm pretty sure the only reason I was back at 1.2.1 was for deid. I'm going to upgrade to 1.4.1 (or latest?) Either way... Yay! You were able to reproduce the issue! :) |
Adding new sample image created from a Head CT. Updated number of expected test files.
currently, if you parse through a list of fields with a private tag, for pydicom 1.2.1 you will get an error about comparison of a non tag with a tag. This doesnt happen with 1.4.1, however 1.2.1 needs to still be supported Signed-off-by: vsoch <vsochat@stanford.edu>
okay I just added a fix for the issue of parsing through tags (and not being able to do the comparison) 87a1103#diff-0c993b1a3139505b68bd0fe50eedc74eR223. What I'm looking at now is that dicom_dir() doesn't return private tag sequences, the reason being that get_private doesn't return sequences with private tags. I'm going to look at this next - my thinking is that if dicom.dir() can return a sequence as long as it's a string index, the get_private should also be able to return private tags that are of type Sequence (or a tag equivalent?) Haven't looked deeply into it yet. |
One small note - it's |
Ah I see what is happening - so we actually are getting all private tags with get_private, however we are unwrapping them from their structure. For example, to start we have:
And then the result returns a flattened:
So this get_private might work as we want, because it's essentially providing a lookup for private tags. |
So your concern is with respect to:
meaning that we somehow need to represent these nested tags as full fledged citizens when this action is happening... :/ |
I think that approach should work. However, in starting to look in more detail at sequence expansion, I've noticed that there's actually a bug here. As currently implemented, when extract_sequences() flattens the sequence, it simply flattens the fields within the sequence, ignoring the fact that sequences can be arrays. Given a header of:
This get_fields()/extract_sequences() function converts this (correctly in this case) to:
However, when there are multiple items in the sequence:
The get_fields()/extract_sequences() function incorrectly converts this to below (losing the first occurrence item:
Instead, it is really important to maintain the integrity of these occurrences - potentially building something like this:
The cthead1.dcm file that I added with #126 has an instance of a private tag sequence with multiple occurrences. In my 125/new-samples branch, I've added an additional sample file with multiple-occurrence public tags. As well as a test case for this situation. Before submitting a PR, I want to make the test case less tightly coupled to the humans dataset. All that said, do you think it would be better to create an additional issue for sequence expansion, and keep this PR to supporting private tags in non-Sequences? |
I see what you mean about interacting with fields directly, because when we do dicom.dir() we don't get private tags, when we do dicom_dir(dicom) we get private tags, but then they are unwrapped. I'm not really sure what action to take at this point, because we are extracting flattened structures but deid doesn't have a way for the user to even reference sequences. |
I just saw your most recent comments... (after I posted my last one). I noticed that too about how private sequences are unwrapped. Ultimately, I think there is quite a bit outstanding with sequences in general. |
But @wetzelj please see the example that I just posted - #125 (comment) at least in the case of get_private we do preserve each of entries indexed with 0 and 1. In the case that they are converted to strings, however, I can see this being lost. |
I can help only to the extent there is a reasonable idea / discussion for how to implement a change or fix. If it requires a total refactor of the library that sort of sucks, but I'm open to the idea. |
I understand what you're saying in #125 (comment), but with the multiple occurrences in the lookup, looking up (0029, 1041) would result in multiple values. Maybe this is okay? Either way, do you think that it would be worthwhile to forego the sequence issue to a new issue and focus this PR on non-sequence private tags? From my perspective, being able to handle non-sequence private tags has huge benefit. |
@wetzelj this work is all being done on your behalf, so I will go with whatever you prefer. If the library is improved from where it was, I say we should move forward. But I'd still like to work with you (in likely another scoped issue) to address these other issues, because I won't sleep at night if I've realized that my software sucks. :P |
Here is something to think about - how important is it to have separate get and set identifiers functions? Because I can imagine implementing a version that removes the intermediate dictionary layer and just interacts with the pydicom Dataset. Instead of deriving values and looping through later, we figure out how to do it once. We’d still need to balance that with some ability to define custom functions and lists, however. |
Your software is great! These are just some little tweaks to make it more awesome than it already is. Regarding separate get/set identifiers functions - I'm formulating my thoughts on this, I want to make sure that I have a coherent, well thought out proposal before I respond. |
Sounds good! Let me know after you'd had some time to think:
I do think it would be worth a try to maintain the current get/set identifiers, but write a class that takes in dicom files, custom functions and a recipe, and can just be used to clean the headers. It would be akin to the DicomCleaner but for header (and not pixels). |
Ping! Hope you are doing ok :) |
This PR isn't perfect, but it improves the library quite a bit, so I'm going to merge. Please open a new issue @wetzelj when you are ready to discuss next steps. |
Description
This is a first shot to add parsing of private tag groups for deid. I'll definitely need help to do further testing, possibly add more tests for extracting the tags as fields, and tips based on dicom expertise / real world use cases that I might not be aware of. This PR addresses:
Related issues:
@wetzelj here is my thinking for how we should go about this:
Other small changes
That's about as far as my dinosaur brain can think through this today. Hopefully this is a good start, definitely fun to work on!