New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fix decoding of multibyte text with backslash #1724
Conversation
- backslashes in multibyte encodings were mishandled as separators * Add unit test for the issue
Codecov Report
@@ Coverage Diff @@
## master #1724 +/- ##
==========================================
- Coverage 97.58% 88.38% -9.21%
==========================================
Files 66 66
Lines 10744 10773 +29
==========================================
- Hits 10485 9522 -963
- Misses 259 1251 +992
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - please also add an entry in the release notes.
pydicom/values.py
Outdated
as_strings = [convert_single_string(value, encodings, vr) | ||
for value in values] | ||
single_string = convert_single_string(byte_string, encodings, vr) | ||
as_strings = single_string.split('\\') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - good catch!
pydicom/values.py
Outdated
if len(as_strings) == 1: | ||
return as_strings[0] | ||
as_strings = [value.rstrip('\0 ') for value in as_strings] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? As far as I know, null bytes are only used in UI
strings for padding, and the space padding for the last value should already be handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the DICOM standard, but test cases below failed without this line.
pydicom/pydicom/tests/test_values.py
Lines 114 to 122 in a8be738
bytestring = b'Values \\with spaces ' | |
assert ['Values', 'with spaces'] == convert_text(bytestring) | |
bytestring = b'Value ending with zeros\0\0\0' | |
assert 'Value ending with zeros' == convert_single_string(bytestring) | |
assert 'Value ending with zeros' == convert_text(bytestring) | |
bytestring = b'Values\0\0\\with zeros\0' | |
assert ['Values', 'with zeros'] == convert_text(bytestring) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, maybe there has been some invalid DICOM that did this, so the tests have been added.
Anyway, I'm not sure why these test cases would fail - I will have a closer look tonight or over the weekend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I had another look and understand it now. The removal of the padding is done in convert_single_string
, alongside with the validation - which may be a problem, because the validation assumes single values and may fail if the combined string is longer than a single string is allowed to be. I don't have a nice solution right now. We could do the decoding part first, and then do the validation and stripping on each part, e.g. something like:
def convert_text(...):
def handle_value(v):
if vr is not None:
validate_value(
vr, v, config.settings.reading_validation_mode)
return v.rstrip('\0 ')
encodings = encodings or [default_encoding]
decoded_string = decode_bytes(byte_string, encodings, TEXT_VR_DELIMS)
values = decoded_string.split("\\")
as_strings = [handle_value(value) for value in values]
if len(as_strings) == 1:
return as_strings[0]
return MultiValue(str, as_strings,
validation_mode=config.settings.reading_validation_mode)
I'm not sure about the validation, because it will have to encode the string again to check the length, so this is a bit ugly. I will think about it over the weekend, or maybe you have a better idea meanwhile...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering convert_text
is applied to LO, SH, and UC and the maximum lengths for LO and SH are specified in chars not in bytes (and validation for UC is unnecessary?), the code you provided looks perfect to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank YOU for reviewing and everything!
I've committed what I thought was minimally required for this PR. I tried to fix similar/related issues as you suggested but the scope of the relevant code seemed broader than I could handle at the moment. So, I hope you can handle the rest (I don't mind where it would be done; whether in this PR or in a new PR).
FWIW, let me dump here what I thought had to be added/fixed (and gave up actually doing).
Add unit test for convert_PN
Decoding of PN seemed need fixing.
class TestConvertPN:
def test_valid_PN(self):
bytestring = b'John^Doe'
assert 'John^Doe' == convert_PN(bytestring)
bytestring = b'Yamada^Tarou=\x1b$B;3ED\x1b(B^\x1b$BB@O:'
assert 'Yamada^Tarou=山田^太郎' == convert_PN(bytestring, ['latin_1', 'iso2022_jp'])
# 0x5c (\) in multibyte codes
bytestring = b'Baishaku^Honme=\x1b$BG\\<\\\x1b(B^\x1b$BK\\L\\'
assert 'Baishaku^Honme=倍尺^本目' == convert_PN(bytestring, ['latin_1', 'iso2022_jp'])
# 0x3d (=) in multibyte codes
bytestring = b'Sunatouge^Ayano=\x1b$B:=F=\x1b(B^\x1b$B0=G='
assert 'Sunatouge^Ayano=砂峠^綾能' == convert_PN(bytestring, ['latin_1', 'iso2022_jp'])
Fix tests for value length
pydicom/pydicom/tests/test_dataelem.py
Lines 939 to 982 in 1db21d6
def test_write_invalid_length_non_ascii_text( | |
self, enforce_writing_invalid_values): | |
fp = DicomBytesIO() | |
ds = Dataset() | |
ds.is_implicit_VR = True | |
ds.is_little_endian = True | |
ds.SpecificCharacterSet = "ISO_IR 192" # UTF-8 | |
# the string length is 9, so constructing the data element | |
# is possible | |
ds.add(DataElement(0x00080050, "SH", "洪^吉洞=홍^길동")) | |
# encoding the element during writing shall fail, | |
# as the encoded length is 21, while only 16 bytes are allowed for SH | |
msg = r"The value length \(21\) exceeds the maximum length of 16 *" | |
with pytest.raises(ValueError, match=msg): | |
ds.save_as(fp) | |
def test_write_invalid_non_ascii_pn(self, enforce_writing_invalid_values): | |
fp = DicomBytesIO() | |
ds = Dataset() | |
ds.is_implicit_VR = False | |
ds.is_little_endian = True | |
ds.SpecificCharacterSet = "ISO_IR 192" # UTF-8 | |
# string length is 40 | |
ds.add(DataElement(0x00100010, "PN", "洪^吉洞" * 10)) | |
msg = r"The PN component length \(100\) exceeds the maximum allowed *" | |
with pytest.raises(ValueError, match=msg): | |
ds.save_as(fp) | |
def test_read_invalid_length_non_ascii_text(self): | |
fp = DicomBytesIO() | |
ds = Dataset() | |
ds.is_implicit_VR = True | |
ds.is_little_endian = True | |
ds.SpecificCharacterSet = "ISO_IR 192" # UTF-8 | |
ds.add(DataElement(0x00080050, "SH", "洪^吉洞=홍^길동")) | |
# disable value validation to write an invalid value | |
with config.disable_value_validation(): | |
ds.save_as(fp) | |
# no warning will be issued during reading, as only RawDataElement | |
# objects are read | |
ds = dcmread(fp, force=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I added a commit which should fix the issue, though that raises another question (probably outside of this PR).
@darcymason - I just basically removed the write_validation part, as the string validation I thought was needed is not actually needed. I'm not sure if we should add validation to all values on writing instead... otherwise that write validation config does not make much sense. Or maybe this is another thing that shall be considered for 3.0...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been sorting through my thoughts on validation config and playing with some code snippets. Hopefully will start a separate discussion on that in the next couple of days.
I'll look at and comment on the specifics here too, hopefully tomorrow, but in general I don't think we should worry too much because I expect it will all be overhauled after the next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked at this again, and I think there still is a role for validation in these (at the Setting stage in discussion #1726), but it should happen before encoding, maybe just checking max length, e.g. But we can do that in a separate PR (for 2.4) with some initial moves in the direction of pydicom 3.0 where possible.
- DICOM standard specifies some maximum value lengths in characters rather than bytes
- also adapt ot remove related tests
Hi guys, def convert_PN(
byte_string: bytes, encodings: Optional[List[str]] = None
) -> Union[PersonName, MutableSequence[PersonName]]:
def get_valtype(x: str) -> PersonName:
return PersonName(x)
encodings = encodings or [default_encoding]
value = decode_bytes(byte_string.rstrip(b'\x00 '), encodings, TEXT_VR_DELIMS)
value_splitted = value.split('\\')
if len(value_splitted) == 1:
return get_valtype(value_splitted[0])
return MultiValue(get_valtype, value_splitted) Please let me know if you want us to fix this on this PR, or another one 🙏 |
@LaurentTrk - thanks, I had missed that! I think this also belongs to this PR, so feel free to add it. |
Hi guys, |
[MRG] Fix decoding PersonName in convert_PN
Those typing errors each time a new mypy version is released get a bit annoying. @darcymason - maybe we pin the mypy version (like we did for the patch release) and update |
Sounds good to me. |
@darcymason - I think we somewhat forgot this PR. From my POV this is ready to merge, what do you think? |
Sure, this is more your expertise than mine, so please merge if you are happy with it. |
Describe the changes
Fix handling of multibyte characters with backslashes decoded in
values.convert_text
.Tasks
doc/_build/html/index.html
)