Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DS.save_as after ds.decode() exception #1739

Open
alipairon opened this issue Dec 1, 2022 · 9 comments
Open

DS.save_as after ds.decode() exception #1739

alipairon opened this issue Dec 1, 2022 · 9 comments
Milestone

Comments

@alipairon
Copy link

Describe the bug
There is a file = filename
The filename tag SpecificCharacterSet = ISO_IR 144

ds = dcmread(filename)
ds.decode()
ds['SpecificCharacterSet'].value = 'ISO_IR 192'
ds.save_as(fileout)

Causes an exception:
UnicodeEncodeError: 'latin-1' codec can't encode character '\u044d' in position 0: ordinal not in range(256)

I understand that there is the wrong symbol somewhere in the tags causing the exception but I dunno where.

Doing:

ds = dcmread(filename)
ds.save_as(fileout)

Causes no error.

What is the best way to catch exceptions like this one? It's totally unpredictable in my datasets.

@mrbean-bremen
Copy link
Member

Hm... the string values decoded from the Russian encoding should always be able to be written back as UTF-8. Maybe there are tags that cannot be encoded (e.g. should be ASCII) but are actualy encoded, and these are the problem? Or this is actually a bug...

Can you provide an anonymized dataset with that behavior? That would be helpful to understand this behavior.

@alipairon
Copy link
Author

Yep.
Deleted the PatientName and InstitutionName from dataset - it didn't helped thou.
Attached the file to the comment.
Thank you!
filename.zip

@alipairon
Copy link
Author

The only solution I came up with is to pass 2 identical datasets to the wrapper function, decode one of them, try to save it to a temporary file, return the original dataset if it fails. But this is an extremely inefficient way.

@mrbean-bremen
Copy link
Member

Thank you! At a first glance I can't see any regular tags with non-ASCII encoding, so this maybe an issue with the private tags.
I will have a closer look tonight.

@mrbean-bremen
Copy link
Member

As I suspected, the problem is with the private tags, specifically (01f1, 1026) and (01f1, 1027). With the private creator "ELSCINT1", these are known private tags Pitch and Rotation Time which are both of type DS. These tags contain some gibberish (certainly not numbers, but also nothing sensible in cyrillic - specifically 'э\rО0\x99*ш?' and '333333у?'), and pydicom crashes while trying to encode them as ASCII.
The only workarounds I see is either deleting these tags before encoding them back (provided they are always the same ones), or writing a fixer that ignores these tags on writing.

Here is something that I gobbled together that should work, at least for the given file:

from pydicom.dataelem import _private_vr_for_tag
from pydicom.valuerep import VR

def remove_invalid_ds(raw_elem, **kwargs):
    if raw_elem.tag.is_private and raw_elem.value:
        vr = _private_vr_for_tag(kwargs["ds"], raw_elem.tag)
        if vr == VR.DS:
            try:
                raw_elem.value.decode("ascii")
            except UnicodeDecodeError:
                print(f"Removing invalid tag value for tag {raw_elem.tag}")
                return RawDataElement(raw_elem.tag, "DS", 0, None, 0,
                                      raw_elem.is_implicit_VR,
                                      raw_elem.is_little_endian)
    return raw_elem


ds = dcmread(filename)
config.data_element_callback = remove_invalid_ds
config.data_element_callback_kwargs = {"ds": ds}
ds.decode()
ds['SpecificCharacterSet'].value = 'ISO_IR 192'
ds.save_as(new_filename)

Note that I used the "protected" method _private_vr_for_tag here, because at the time of the callback we don't have the VR yet, and there is no public method for that.

I'm not sure if we want to change the behavior here - probably not. @darcymason, what do you think?
What we may do for the next major version is to give better possibilities for writing fixups like this one.

@darcymason
Copy link
Member

I'm not sure if we want to change the behavior here - probably not. @darcymason, what do you think?

I also think probably not. I supposed there could be a case for pydicom trying to capture unicode errors and recovering gracefully, but I can't see how we could come up with a universal 'safe' choice on behalf of the user - delete the whole data element, change it to some other value, ...? Making fixers more easily available would be a big help though.

@alipairon
Copy link
Author

Thanks for the answers and for the fix!
If it would be useful, I have done an experiment with DCMTK dcmconv and this file.
It works without errors, but I'm not sure what steps are taken to avoid them, but the private tags are not removed from the dataset.
Please forgive me if my actions and conclusions are naive, it's just that I've been doing DICOM for not more than a month.

@mrbean-bremen
Copy link
Member

I have done an experiment with DCMTK dcmconv and this file.

dcmconv leaves these tags alone, and I'm inclined to say that we also may do so if using some setting (like removing the write validation). On the one hand, we don't want to write invalid data, but on the other hand, it is a common use case to modify an existing dataset and write it back without changing the data that we didn't intentionally change. Given that changing the character set should have no influence on tags that do not allow different character sets, I probably also would expect this to be left alone.
This is part of the discussion about changing the validation for pydicom 3.0 and won't be implemented anytime soon, so I hope you can solve your problem with a fixer as shown.

Please forgive me if my actions and conclusions are naive, it's just that I've been doing DICOM for not more than a month.

Your problems and questions are in no way naive. Problems with invalid DICOM data are very common and valid, and handling them is not trivial.

@darcymason
Copy link
Member

Given that changing the character set should have no influence on tags that do not allow different character sets, I probably also would expect this to be left alone.
This is part of the discussion about changing the validation for pydicom 3.0 and won't be implemented anytime soon, so I hope you can solve your problem with a fixer as shown.

Just wanted to highlight the above quote as being the heart/conclusion of this issue for v3.0 - it would be ideal to not decode text values that do not need decoding so they can be written back if they have not been modified.

@darcymason darcymason added this to the v3.0 milestone Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants