[MRG] Fix decoding of multibyte text with backslash #1724

ykszk · 2022-11-04T07:11:14Z

Describe the changes

Fix handling of multibyte characters with backslashes decoded in values.convert_text.

Multibyte characters that contain backslash (0x5C) in their codes (e.g. 倍 in Japanese) were mishandled as separators.
Unit test is added to test decoding of text with such characters and actual separator.

Tasks

Unit tests added that reproduce the issue or prove feature is working
Fix or feature added
Code typed and mypy shows no errors
Documentation updated (if relevant)
- No warnings during build
- Preview link (CircleCI -> Artifacts -> doc/_build/html/index.html)
Unit tests passing and overall coverage the same or better

- backslashes in multibyte encodings were mishandled as separators * Add unit test for the issue

codecov · 2022-11-04T08:22:46Z

Codecov Report

Merging #1724 (9b02444) into master (a8be738) will decrease coverage by 9.20%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1724      +/-   ##
==========================================
- Coverage   97.58%   88.38%   -9.21%     
==========================================
  Files          66       66              
  Lines       10744    10773      +29     
==========================================
- Hits        10485     9522     -963     
- Misses        259     1251     +992

Impacted Files	Coverage Δ
pydicom/valuerep.py	`99.13% <ø> (-0.28%)`	⬇️
pydicom/filewriter.py	`97.56% <100.00%> (-0.27%)`	⬇️
pydicom/values.py	`90.19% <100.00%> (-7.22%)`	⬇️
pydicom/pixel_data_handlers/util.py	`17.93% <0.00%> (-82.07%)`	⬇️
pydicom/waveforms/numpy_handler.py	`20.31% <0.00%> (-79.69%)`	⬇️
pydicom/encoders/gdcm.py	`20.96% <0.00%> (-75.81%)`	⬇️
pydicom/pixel_data_handlers/numpy_handler.py	`25.67% <0.00%> (-74.33%)`	⬇️
pydicom/pixel_data_handlers/gdcm_handler.py	`26.01% <0.00%> (-66.67%)`	⬇️
pydicom/overlays/numpy_handler.py	`35.18% <0.00%> (-64.82%)`	⬇️
pydicom/pixel_data_handlers/pillow_handler.py	`31.73% <0.00%> (-64.43%)`	⬇️
... and 15 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

mrbean-bremen

Thanks - please also add an entry in the release notes.

mrbean-bremen · 2022-11-04T08:26:00Z

pydicom/values.py

-    as_strings = [convert_single_string(value, encodings, vr)
-                  for value in values]
+    single_string = convert_single_string(byte_string, encodings, vr)
+    as_strings = single_string.split('\\')


Thanks - good catch!

mrbean-bremen · 2022-11-04T08:28:16Z

pydicom/values.py

    if len(as_strings) == 1:
        return as_strings[0]
+    as_strings = [value.rstrip('\0 ') for value in as_strings]


Why is this needed? As far as I know, null bytes are only used in UI strings for padding, and the space padding for the last value should already be handled.

I'm not familiar with the DICOM standard, but test cases below failed without this line.

pydicom/pydicom/tests/test_values.py

Lines 114 to 122 in a8be738

bytestring = b'Values \\with spaces '

assert ['Values', 'with spaces'] == convert_text(bytestring)

bytestring = b'Value ending with zeros\0\0\0'

assert 'Value ending with zeros' == convert_single_string(bytestring)

assert 'Value ending with zeros' == convert_text(bytestring)

bytestring = b'Values\0\0\\with zeros\0'

assert ['Values', 'with zeros'] == convert_text(bytestring)

Hm, maybe there has been some invalid DICOM that did this, so the tests have been added.
Anyway, I'm not sure why these test cases would fail - I will have a closer look tonight or over the weekend.

Ok, I had another look and understand it now. The removal of the padding is done in convert_single_string, alongside with the validation - which may be a problem, because the validation assumes single values and may fail if the combined string is longer than a single string is allowed to be. I don't have a nice solution right now. We could do the decoding part first, and then do the validation and stripping on each part, e.g. something like:

def convert_text(...): def handle_value(v): if vr is not None: validate_value( vr, v, config.settings.reading_validation_mode) return v.rstrip('\0 ') encodings = encodings or [default_encoding] decoded_string = decode_bytes(byte_string, encodings, TEXT_VR_DELIMS) values = decoded_string.split("\\") as_strings = [handle_value(value) for value in values] if len(as_strings) == 1: return as_strings[0] return MultiValue(str, as_strings, validation_mode=config.settings.reading_validation_mode)

I'm not sure about the validation, because it will have to encode the string again to check the length, so this is a bit ugly. I will think about it over the weekend, or maybe you have a better idea meanwhile...

Considering convert_text is applied to LO, SH, and UC and the maximum lengths for LO and SH are specified in chars not in bytes (and validation for UC is unnecessary?), the code you provided looks perfect to me.

Thank YOU for reviewing and everything!

I've committed what I thought was minimally required for this PR. I tried to fix similar/related issues as you suggested but the scope of the relevant code seemed broader than I could handle at the moment. So, I hope you can handle the rest (I don't mind where it would be done; whether in this PR or in a new PR).

FWIW, let me dump here what I thought had to be added/fixed (and gave up actually doing).

Add unit test for convert_PN

Decoding of PN seemed need fixing.

class TestConvertPN: def test_valid_PN(self): bytestring = b'John^Doe' assert 'John^Doe' == convert_PN(bytestring) bytestring = b'Yamada^Tarou=\x1b$B;3ED\x1b(B^\x1b$BB@O:' assert 'Yamada^Tarou=山田^太郎' == convert_PN(bytestring, ['latin_1', 'iso2022_jp']) # 0x5c (\) in multibyte codes bytestring = b'Baishaku^Honme=\x1b$BG\\<\\\x1b(B^\x1b$BK\\L\\' assert 'Baishaku^Honme=倍尺^本目' == convert_PN(bytestring, ['latin_1', 'iso2022_jp']) # 0x3d (=) in multibyte codes bytestring = b'Sunatouge^Ayano=\x1b$B:=F=\x1b(B^\x1b$B0=G=' assert 'Sunatouge^Ayano=砂峠^綾能' == convert_PN(bytestring, ['latin_1', 'iso2022_jp'])

Fix tests for value length

pydicom/pydicom/tests/test_dataelem.py

Lines 939 to 982 in 1db21d6

def test_write_invalid_length_non_ascii_text(

self, enforce_writing_invalid_values):

fp = DicomBytesIO()

ds = Dataset()

ds.is_implicit_VR = True

ds.is_little_endian = True

ds.SpecificCharacterSet = "ISO_IR 192" # UTF-8

# the string length is 9, so constructing the data element

# is possible

ds.add(DataElement(0x00080050, "SH", "洪^吉洞=홍^길동"))

# encoding the element during writing shall fail,

# as the encoded length is 21, while only 16 bytes are allowed for SH

msg = r"The value length $21$ exceeds the maximum length of 16 *"

with pytest.raises(ValueError, match=msg):

ds.save_as(fp)

def test_write_invalid_non_ascii_pn(self, enforce_writing_invalid_values):

fp = DicomBytesIO()

ds = Dataset()

ds.is_implicit_VR = False

ds.is_little_endian = True

ds.SpecificCharacterSet = "ISO_IR 192" # UTF-8

# string length is 40

ds.add(DataElement(0x00100010, "PN", "洪^吉洞" * 10))

msg = r"The PN component length $100$ exceeds the maximum allowed *"

with pytest.raises(ValueError, match=msg):

ds.save_as(fp)

def test_read_invalid_length_non_ascii_text(self):

fp = DicomBytesIO()

ds = Dataset()

ds.is_implicit_VR = True

ds.is_little_endian = True

ds.SpecificCharacterSet = "ISO_IR 192" # UTF-8

ds.add(DataElement(0x00080050, "SH", "洪^吉洞=홍^길동"))

# disable value validation to write an invalid value

with config.disable_value_validation():

ds.save_as(fp)

# no warning will be issued during reading, as only RawDataElement

# objects are read

ds = dcmread(fp, force=True)

Thank you! I added a commit which should fix the issue, though that raises another question (probably outside of this PR).

@darcymason - I just basically removed the write_validation part, as the string validation I thought was needed is not actually needed. I'm not sure if we should add validation to all values on writing instead... otherwise that write validation config does not make much sense. Or maybe this is another thing that shall be considered for 3.0...

I've been sorting through my thoughts on validation config and playing with some code snippets. Hopefully will start a separate discussion on that in the next couple of days.

I'll look at and comment on the specifics here too, hopefully tomorrow, but in general I don't think we should worry too much because I expect it will all be overhauled after the next release.

Ok, thanks!

I've looked at this again, and I think there still is a role for validation in these (at the Setting stage in discussion #1726), but it should happen before encoding, maybe just checking max length, e.g. But we can do that in a separate PR (for 2.4) with some initial moves in the direction of pydicom 3.0 where possible.

- DICOM standard specifies some maximum value lengths in characters rather than bytes

- also adapt ot remove related tests

LaurentTrk · 2022-11-09T20:52:01Z

Hi guys,
We ran into a similar issue but with the PatientName tag, not correctly decoded and splitted into 2 values.
Seems that we need to fix the convert_PN function as well.
Something like this :

def convert_PN(
    byte_string: bytes, encodings: Optional[List[str]] = None
) -> Union[PersonName, MutableSequence[PersonName]]:
    def get_valtype(x: str) -> PersonName:
        return PersonName(x)

    encodings = encodings or [default_encoding]
    value = decode_bytes(byte_string.rstrip(b'\x00 '), encodings, TEXT_VR_DELIMS)
    value_splitted = value.split('\\')
    if len(value_splitted) == 1:
        return get_valtype(value_splitted[0])

    return MultiValue(get_valtype, value_splitted)

Please let me know if you want us to fix this on this PR, or another one 🙏

mrbean-bremen · 2022-11-09T21:10:36Z

@LaurentTrk - thanks, I had missed that! I think this also belongs to this PR, so feel free to add it.

LaurentTrk · 2022-11-10T06:06:44Z

Hi guys,
I made some changes in this PR for the PersonName decoding.
Please let me know if you need anything else.

[MRG] Fix decoding PersonName in convert_PN

mrbean-bremen · 2022-11-14T18:25:49Z

Those typing errors each time a new mypy version is released get a bit annoying. @darcymason - maybe we pin the mypy version (like we did for the patch release) and update mypy ourselves when needed? Or switch to pre-commit, which will create PRs for updated tools, which can then be adapted to pass.
My suggestion would actually be to pin mypy to 0.971 in this PR and handle the new mypy version separately whatever way we decide to.

darcymason · 2022-11-14T19:47:19Z

My suggestion would actually be to pin mypy to 0.971 in this PR

Sounds good to me.

mrbean-bremen · 2023-01-20T19:58:05Z

@darcymason - I think we somewhat forgot this PR. From my POV this is ready to merge, what do you think?

darcymason · 2023-01-20T21:40:24Z

Sure, this is more your expertise than mine, so please merge if you are happy with it.

Fix decoding of multibyte text

b8a7e09

- backslashes in multibyte encodings were mishandled as separators * Add unit test for the issue

mrbean-bremen reviewed Nov 4, 2022

View reviewed changes

ykszk and others added 2 commits November 6, 2022 12:56

Decode string before validating its length

31ae1ad

- DICOM standard specifies some maximum value lengths in characters rather than bytes

Remove incorrect validation of byte values

e412706

- also adapt ot remove related tests

mrbean-bremen requested a review from darcymason November 6, 2022 17:46

mrbean-bremen approved these changes Nov 6, 2022

View reviewed changes

Fix decoding PersonName in convert_PN

4b03356

LaurentTrk mentioned this pull request Nov 10, 2022

[MRG] Fix decoding PersonName in convert_PN ykszk/pydicom#1

Merged

7 tasks

LaurentTrk and others added 2 commits November 10, 2022 10:29

FIX Handle multivalued PN

de3008c

Merge pull request #1 from LaurentTrk/backslash-in-multibytes

ea195d8

[MRG] Fix decoding PersonName in convert_PN

Pin mypy to a fix version

9b02444

mrbean-bremen merged commit 9c7f4dc into pydicom:master Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Fix decoding of multibyte text with backslash #1724

[MRG] Fix decoding of multibyte text with backslash #1724

ykszk commented Nov 4, 2022 •

edited by mrbean-bremen

codecov bot commented Nov 4, 2022 •

edited

mrbean-bremen left a comment

mrbean-bremen Nov 4, 2022

mrbean-bremen Nov 4, 2022

ykszk Nov 4, 2022

mrbean-bremen Nov 4, 2022

mrbean-bremen Nov 4, 2022

ykszk Nov 4, 2022

ykszk Nov 6, 2022

mrbean-bremen Nov 6, 2022

darcymason Nov 6, 2022

mrbean-bremen Nov 6, 2022

darcymason Nov 9, 2022

LaurentTrk commented Nov 9, 2022

mrbean-bremen commented Nov 9, 2022

LaurentTrk commented Nov 10, 2022

mrbean-bremen commented Nov 14, 2022

darcymason commented Nov 14, 2022

mrbean-bremen commented Jan 20, 2023

darcymason commented Jan 20, 2023

	bytestring = b'Values \\with spaces '
	assert ['Values', 'with spaces'] == convert_text(bytestring)

	bytestring = b'Value ending with zeros\0\0\0'
	assert 'Value ending with zeros' == convert_single_string(bytestring)
	assert 'Value ending with zeros' == convert_text(bytestring)

	bytestring = b'Values\0\0\\with zeros\0'
	assert ['Values', 'with zeros'] == convert_text(bytestring)

	def test_write_invalid_length_non_ascii_text(
	self, enforce_writing_invalid_values):
	fp = DicomBytesIO()
	ds = Dataset()
	ds.is_implicit_VR = True
	ds.is_little_endian = True
	ds.SpecificCharacterSet = "ISO_IR 192" # UTF-8
	# the string length is 9, so constructing the data element
	# is possible
	ds.add(DataElement(0x00080050, "SH", "洪^吉洞=홍^길동"))

	# encoding the element during writing shall fail,
	# as the encoded length is 21, while only 16 bytes are allowed for SH
	msg = r"The value length \(21\) exceeds the maximum length of 16 *"
	with pytest.raises(ValueError, match=msg):
	ds.save_as(fp)

	def test_write_invalid_non_ascii_pn(self, enforce_writing_invalid_values):
	fp = DicomBytesIO()
	ds = Dataset()
	ds.is_implicit_VR = False
	ds.is_little_endian = True
	ds.SpecificCharacterSet = "ISO_IR 192" # UTF-8
	# string length is 40
	ds.add(DataElement(0x00100010, "PN", "洪^吉洞" * 10))

	msg = r"The PN component length \(100\) exceeds the maximum allowed *"
	with pytest.raises(ValueError, match=msg):
	ds.save_as(fp)

	def test_read_invalid_length_non_ascii_text(self):
	fp = DicomBytesIO()
	ds = Dataset()
	ds.is_implicit_VR = True
	ds.is_little_endian = True
	ds.SpecificCharacterSet = "ISO_IR 192" # UTF-8
	ds.add(DataElement(0x00080050, "SH", "洪^吉洞=홍^길동"))
	# disable value validation to write an invalid value
	with config.disable_value_validation():
	ds.save_as(fp)

	# no warning will be issued during reading, as only RawDataElement
	# objects are read
	ds = dcmread(fp, force=True)

[MRG] Fix decoding of multibyte text with backslash #1724

[MRG] Fix decoding of multibyte text with backslash #1724

Conversation

ykszk commented Nov 4, 2022 • edited by mrbean-bremen

Describe the changes

Tasks

codecov bot commented Nov 4, 2022 • edited

Codecov Report

mrbean-bremen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add unit test for convert_PN

Fix tests for value length

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LaurentTrk commented Nov 9, 2022

mrbean-bremen commented Nov 9, 2022

LaurentTrk commented Nov 10, 2022

mrbean-bremen commented Nov 14, 2022

darcymason commented Nov 14, 2022

mrbean-bremen commented Jan 20, 2023

darcymason commented Jan 20, 2023

ykszk commented Nov 4, 2022 •

edited by mrbean-bremen

codecov bot commented Nov 4, 2022 •

edited

Add unit test for `convert_PN`