New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DS.save_as after ds.decode() exception #1739
Comments
Hm... the string values decoded from the Russian encoding should always be able to be written back as UTF-8. Maybe there are tags that cannot be encoded (e.g. should be ASCII) but are actualy encoded, and these are the problem? Or this is actually a bug... Can you provide an anonymized dataset with that behavior? That would be helpful to understand this behavior. |
Yep. |
The only solution I came up with is to pass 2 identical datasets to the wrapper function, decode one of them, try to save it to a temporary file, return the original dataset if it fails. But this is an extremely inefficient way. |
Thank you! At a first glance I can't see any regular tags with non-ASCII encoding, so this maybe an issue with the private tags. |
As I suspected, the problem is with the private tags, specifically (01f1, 1026) and (01f1, 1027). With the private creator "ELSCINT1", these are known private tags Here is something that I gobbled together that should work, at least for the given file: from pydicom.dataelem import _private_vr_for_tag
from pydicom.valuerep import VR
def remove_invalid_ds(raw_elem, **kwargs):
if raw_elem.tag.is_private and raw_elem.value:
vr = _private_vr_for_tag(kwargs["ds"], raw_elem.tag)
if vr == VR.DS:
try:
raw_elem.value.decode("ascii")
except UnicodeDecodeError:
print(f"Removing invalid tag value for tag {raw_elem.tag}")
return RawDataElement(raw_elem.tag, "DS", 0, None, 0,
raw_elem.is_implicit_VR,
raw_elem.is_little_endian)
return raw_elem
ds = dcmread(filename)
config.data_element_callback = remove_invalid_ds
config.data_element_callback_kwargs = {"ds": ds}
ds.decode()
ds['SpecificCharacterSet'].value = 'ISO_IR 192'
ds.save_as(new_filename) Note that I used the "protected" method I'm not sure if we want to change the behavior here - probably not. @darcymason, what do you think? |
I also think probably not. I supposed there could be a case for pydicom trying to capture unicode errors and recovering gracefully, but I can't see how we could come up with a universal 'safe' choice on behalf of the user - delete the whole data element, change it to some other value, ...? Making fixers more easily available would be a big help though. |
Thanks for the answers and for the fix! |
Your problems and questions are in no way naive. Problems with invalid DICOM data are very common and valid, and handling them is not trivial. |
Just wanted to highlight the above quote as being the heart/conclusion of this issue for v3.0 - it would be ideal to not decode text values that do not need decoding so they can be written back if they have not been modified. |
Describe the bug
There is a file = filename
The filename tag SpecificCharacterSet = ISO_IR 144
ds = dcmread(filename)
ds.decode()
ds['SpecificCharacterSet'].value = 'ISO_IR 192'
ds.save_as(fileout)
Causes an exception:
UnicodeEncodeError: 'latin-1' codec can't encode character '\u044d' in position 0: ordinal not in range(256)
I understand that there is the wrong symbol somewhere in the tags causing the exception but I dunno where.
Doing:
ds = dcmread(filename)
ds.save_as(fileout)
Causes no error.
What is the best way to catch exceptions like this one? It's totally unpredictable in my datasets.
The text was updated successfully, but these errors were encountered: