DS.save_as after ds.decode() exception #1739

alipairon · 2022-12-01T19:19:36Z

Describe the bug
There is a file = filename
The filename tag SpecificCharacterSet = ISO_IR 144

ds = dcmread(filename)
ds.decode()
ds['SpecificCharacterSet'].value = 'ISO_IR 192'
ds.save_as(fileout)

Causes an exception:
UnicodeEncodeError: 'latin-1' codec can't encode character '\u044d' in position 0: ordinal not in range(256)

I understand that there is the wrong symbol somewhere in the tags causing the exception but I dunno where.

Doing:

ds = dcmread(filename)
ds.save_as(fileout)

Causes no error.

What is the best way to catch exceptions like this one? It's totally unpredictable in my datasets.

mrbean-bremen · 2022-12-01T20:53:55Z

Hm... the string values decoded from the Russian encoding should always be able to be written back as UTF-8. Maybe there are tags that cannot be encoded (e.g. should be ASCII) but are actualy encoded, and these are the problem? Or this is actually a bug...

Can you provide an anonymized dataset with that behavior? That would be helpful to understand this behavior.

alipairon · 2022-12-02T06:14:25Z

Yep.
Deleted the PatientName and InstitutionName from dataset - it didn't helped thou.
Attached the file to the comment.
Thank you!
filename.zip

alipairon · 2022-12-02T06:21:56Z

The only solution I came up with is to pass 2 identical datasets to the wrapper function, decode one of them, try to save it to a temporary file, return the original dataset if it fails. But this is an extremely inefficient way.

mrbean-bremen · 2022-12-02T12:30:27Z

Thank you! At a first glance I can't see any regular tags with non-ASCII encoding, so this maybe an issue with the private tags.
I will have a closer look tonight.

mrbean-bremen · 2022-12-02T19:54:38Z

As I suspected, the problem is with the private tags, specifically (01f1, 1026) and (01f1, 1027). With the private creator "ELSCINT1", these are known private tags Pitch and Rotation Time which are both of type DS. These tags contain some gibberish (certainly not numbers, but also nothing sensible in cyrillic - specifically 'э\rО0\x99*ш?' and '333333у?'), and pydicom crashes while trying to encode them as ASCII.
The only workarounds I see is either deleting these tags before encoding them back (provided they are always the same ones), or writing a fixer that ignores these tags on writing.

Here is something that I gobbled together that should work, at least for the given file:

from pydicom.dataelem import _private_vr_for_tag
from pydicom.valuerep import VR

def remove_invalid_ds(raw_elem, **kwargs):
    if raw_elem.tag.is_private and raw_elem.value:
        vr = _private_vr_for_tag(kwargs["ds"], raw_elem.tag)
        if vr == VR.DS:
            try:
                raw_elem.value.decode("ascii")
            except UnicodeDecodeError:
                print(f"Removing invalid tag value for tag {raw_elem.tag}")
                return RawDataElement(raw_elem.tag, "DS", 0, None, 0,
                                      raw_elem.is_implicit_VR,
                                      raw_elem.is_little_endian)
    return raw_elem


ds = dcmread(filename)
config.data_element_callback = remove_invalid_ds
config.data_element_callback_kwargs = {"ds": ds}
ds.decode()
ds['SpecificCharacterSet'].value = 'ISO_IR 192'
ds.save_as(new_filename)

Note that I used the "protected" method _private_vr_for_tag here, because at the time of the callback we don't have the VR yet, and there is no public method for that.

I'm not sure if we want to change the behavior here - probably not. @darcymason, what do you think?
What we may do for the next major version is to give better possibilities for writing fixups like this one.

darcymason · 2022-12-02T21:43:53Z

I'm not sure if we want to change the behavior here - probably not. @darcymason, what do you think?

I also think probably not. I supposed there could be a case for pydicom trying to capture unicode errors and recovering gracefully, but I can't see how we could come up with a universal 'safe' choice on behalf of the user - delete the whole data element, change it to some other value, ...? Making fixers more easily available would be a big help though.

alipairon · 2022-12-03T05:08:47Z

Thanks for the answers and for the fix!
If it would be useful, I have done an experiment with DCMTK dcmconv and this file.
It works without errors, but I'm not sure what steps are taken to avoid them, but the private tags are not removed from the dataset.
Please forgive me if my actions and conclusions are naive, it's just that I've been doing DICOM for not more than a month.

mrbean-bremen · 2022-12-03T19:23:04Z

I have done an experiment with DCMTK dcmconv and this file.

dcmconv leaves these tags alone, and I'm inclined to say that we also may do so if using some setting (like removing the write validation). On the one hand, we don't want to write invalid data, but on the other hand, it is a common use case to modify an existing dataset and write it back without changing the data that we didn't intentionally change. Given that changing the character set should have no influence on tags that do not allow different character sets, I probably also would expect this to be left alone.
This is part of the discussion about changing the validation for pydicom 3.0 and won't be implemented anytime soon, so I hope you can solve your problem with a fixer as shown.

Please forgive me if my actions and conclusions are naive, it's just that I've been doing DICOM for not more than a month.

Your problems and questions are in no way naive. Problems with invalid DICOM data are very common and valid, and handling them is not trivial.

darcymason · 2023-04-21T20:55:53Z

Given that changing the character set should have no influence on tags that do not allow different character sets, I probably also would expect this to be left alone.
This is part of the discussion about changing the validation for pydicom 3.0 and won't be implemented anytime soon, so I hope you can solve your problem with a fixer as shown.

Just wanted to highlight the above quote as being the heart/conclusion of this issue for v3.0 - it would be ideal to not decode text values that do not need decoding so they can be written back if they have not been modified.

darcymason added this to the v3.0 milestone Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DS.save_as after ds.decode() exception #1739

DS.save_as after ds.decode() exception #1739

alipairon commented Dec 1, 2022

mrbean-bremen commented Dec 1, 2022

alipairon commented Dec 2, 2022

alipairon commented Dec 2, 2022

mrbean-bremen commented Dec 2, 2022

mrbean-bremen commented Dec 2, 2022

darcymason commented Dec 2, 2022

alipairon commented Dec 3, 2022

mrbean-bremen commented Dec 3, 2022

darcymason commented Apr 21, 2023

DS.save_as after ds.decode() exception #1739

DS.save_as after ds.decode() exception #1739

Comments

alipairon commented Dec 1, 2022

mrbean-bremen commented Dec 1, 2022

alipairon commented Dec 2, 2022

alipairon commented Dec 2, 2022

mrbean-bremen commented Dec 2, 2022

mrbean-bremen commented Dec 2, 2022

darcymason commented Dec 2, 2022

alipairon commented Dec 3, 2022

mrbean-bremen commented Dec 3, 2022

darcymason commented Apr 21, 2023