Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fix decoding of multibyte text with backslash #1724

Merged
merged 7 commits into from Jan 20, 2023

Conversation

ykszk
Copy link
Contributor

@ykszk ykszk commented Nov 4, 2022

Describe the changes

Fix handling of multibyte characters with backslashes decoded in values.convert_text.

  • Multibyte characters that contain backslash (0x5C) in their codes (e.g. 倍 in Japanese) were mishandled as separators.
  • Unit test is added to test decoding of text with such characters and actual separator.

Tasks

  • Unit tests added that reproduce the issue or prove feature is working
  • Fix or feature added
  • Code typed and mypy shows no errors
  • Documentation updated (if relevant)
    • No warnings during build
    • Preview link (CircleCI -> Artifacts -> doc/_build/html/index.html)
  • Unit tests passing and overall coverage the same or better

- backslashes in multibyte encodings were mishandled as separators

* Add unit test for the issue
@codecov
Copy link

codecov bot commented Nov 4, 2022

Codecov Report

Merging #1724 (9b02444) into master (a8be738) will decrease coverage by 9.20%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1724      +/-   ##
==========================================
- Coverage   97.58%   88.38%   -9.21%     
==========================================
  Files          66       66              
  Lines       10744    10773      +29     
==========================================
- Hits        10485     9522     -963     
- Misses        259     1251     +992     
Impacted Files Coverage Δ
pydicom/valuerep.py 99.13% <ø> (-0.28%) ⬇️
pydicom/filewriter.py 97.56% <100.00%> (-0.27%) ⬇️
pydicom/values.py 90.19% <100.00%> (-7.22%) ⬇️
pydicom/pixel_data_handlers/util.py 17.93% <0.00%> (-82.07%) ⬇️
pydicom/waveforms/numpy_handler.py 20.31% <0.00%> (-79.69%) ⬇️
pydicom/encoders/gdcm.py 20.96% <0.00%> (-75.81%) ⬇️
pydicom/pixel_data_handlers/numpy_handler.py 25.67% <0.00%> (-74.33%) ⬇️
pydicom/pixel_data_handlers/gdcm_handler.py 26.01% <0.00%> (-66.67%) ⬇️
pydicom/overlays/numpy_handler.py 35.18% <0.00%> (-64.82%) ⬇️
pydicom/pixel_data_handlers/pillow_handler.py 31.73% <0.00%> (-64.43%) ⬇️
... and 15 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@mrbean-bremen mrbean-bremen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - please also add an entry in the release notes.

as_strings = [convert_single_string(value, encodings, vr)
for value in values]
single_string = convert_single_string(byte_string, encodings, vr)
as_strings = single_string.split('\\')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - good catch!

if len(as_strings) == 1:
return as_strings[0]
as_strings = [value.rstrip('\0 ') for value in as_strings]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? As far as I know, null bytes are only used in UI strings for padding, and the space padding for the last value should already be handled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the DICOM standard, but test cases below failed without this line.

bytestring = b'Values \\with spaces '
assert ['Values', 'with spaces'] == convert_text(bytestring)
bytestring = b'Value ending with zeros\0\0\0'
assert 'Value ending with zeros' == convert_single_string(bytestring)
assert 'Value ending with zeros' == convert_text(bytestring)
bytestring = b'Values\0\0\\with zeros\0'
assert ['Values', 'with zeros'] == convert_text(bytestring)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, maybe there has been some invalid DICOM that did this, so the tests have been added.
Anyway, I'm not sure why these test cases would fail - I will have a closer look tonight or over the weekend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I had another look and understand it now. The removal of the padding is done in convert_single_string, alongside with the validation - which may be a problem, because the validation assumes single values and may fail if the combined string is longer than a single string is allowed to be. I don't have a nice solution right now. We could do the decoding part first, and then do the validation and stripping on each part, e.g. something like:

def convert_text(...):
    def handle_value(v):
        if vr is not None:
            validate_value(
                vr, v, config.settings.reading_validation_mode)
        return v.rstrip('\0 ')

    encodings = encodings or [default_encoding]
    decoded_string = decode_bytes(byte_string, encodings, TEXT_VR_DELIMS)
    values = decoded_string.split("\\")
    as_strings = [handle_value(value) for value in values]
    if len(as_strings) == 1:
        return as_strings[0]
    return MultiValue(str, as_strings,
                      validation_mode=config.settings.reading_validation_mode)

I'm not sure about the validation, because it will have to encode the string again to check the length, so this is a bit ugly. I will think about it over the weekend, or maybe you have a better idea meanwhile...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering convert_text is applied to LO, SH, and UC and the maximum lengths for LO and SH are specified in chars not in bytes (and validation for UC is unnecessary?), the code you provided looks perfect to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank YOU for reviewing and everything!

I've committed what I thought was minimally required for this PR. I tried to fix similar/related issues as you suggested but the scope of the relevant code seemed broader than I could handle at the moment. So, I hope you can handle the rest (I don't mind where it would be done; whether in this PR or in a new PR).

FWIW, let me dump here what I thought had to be added/fixed (and gave up actually doing).

Add unit test for convert_PN

Decoding of PN seemed need fixing.

class TestConvertPN:
    def test_valid_PN(self):
        bytestring = b'John^Doe'
        assert 'John^Doe' == convert_PN(bytestring)

        bytestring = b'Yamada^Tarou=\x1b$B;3ED\x1b(B^\x1b$BB@O:'
        assert 'Yamada^Tarou=山田^太郎' == convert_PN(bytestring, ['latin_1', 'iso2022_jp'])

        # 0x5c (\) in multibyte codes
        bytestring = b'Baishaku^Honme=\x1b$BG\\<\\\x1b(B^\x1b$BK\\L\\'
        assert 'Baishaku^Honme=倍尺^本目' == convert_PN(bytestring, ['latin_1', 'iso2022_jp'])

        # 0x3d (=) in multibyte codes
        bytestring = b'Sunatouge^Ayano=\x1b$B:=F=\x1b(B^\x1b$B0=G='
        assert 'Sunatouge^Ayano=砂峠^綾能' == convert_PN(bytestring, ['latin_1', 'iso2022_jp'])

Fix tests for value length

def test_write_invalid_length_non_ascii_text(
self, enforce_writing_invalid_values):
fp = DicomBytesIO()
ds = Dataset()
ds.is_implicit_VR = True
ds.is_little_endian = True
ds.SpecificCharacterSet = "ISO_IR 192" # UTF-8
# the string length is 9, so constructing the data element
# is possible
ds.add(DataElement(0x00080050, "SH", "洪^吉洞=홍^길동"))
# encoding the element during writing shall fail,
# as the encoded length is 21, while only 16 bytes are allowed for SH
msg = r"The value length \(21\) exceeds the maximum length of 16 *"
with pytest.raises(ValueError, match=msg):
ds.save_as(fp)
def test_write_invalid_non_ascii_pn(self, enforce_writing_invalid_values):
fp = DicomBytesIO()
ds = Dataset()
ds.is_implicit_VR = False
ds.is_little_endian = True
ds.SpecificCharacterSet = "ISO_IR 192" # UTF-8
# string length is 40
ds.add(DataElement(0x00100010, "PN", "洪^吉洞" * 10))
msg = r"The PN component length \(100\) exceeds the maximum allowed *"
with pytest.raises(ValueError, match=msg):
ds.save_as(fp)
def test_read_invalid_length_non_ascii_text(self):
fp = DicomBytesIO()
ds = Dataset()
ds.is_implicit_VR = True
ds.is_little_endian = True
ds.SpecificCharacterSet = "ISO_IR 192" # UTF-8
ds.add(DataElement(0x00080050, "SH", "洪^吉洞=홍^길동"))
# disable value validation to write an invalid value
with config.disable_value_validation():
ds.save_as(fp)
# no warning will be issued during reading, as only RawDataElement
# objects are read
ds = dcmread(fp, force=True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I added a commit which should fix the issue, though that raises another question (probably outside of this PR).

@darcymason - I just basically removed the write_validation part, as the string validation I thought was needed is not actually needed. I'm not sure if we should add validation to all values on writing instead... otherwise that write validation config does not make much sense. Or maybe this is another thing that shall be considered for 3.0...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been sorting through my thoughts on validation config and playing with some code snippets. Hopefully will start a separate discussion on that in the next couple of days.

I'll look at and comment on the specifics here too, hopefully tomorrow, but in general I don't think we should worry too much because I expect it will all be overhauled after the next release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked at this again, and I think there still is a role for validation in these (at the Setting stage in discussion #1726), but it should happen before encoding, maybe just checking max length, e.g. But we can do that in a separate PR (for 2.4) with some initial moves in the direction of pydicom 3.0 where possible.

ykszk and others added 2 commits November 6, 2022 12:56
- DICOM standard specifies some maximum value lengths in characters rather than bytes
- also adapt ot remove related tests
@LaurentTrk
Copy link
Contributor

Hi guys,
We ran into a similar issue but with the PatientName tag, not correctly decoded and splitted into 2 values.
Seems that we need to fix the convert_PN function as well.
Something like this :

def convert_PN(
    byte_string: bytes, encodings: Optional[List[str]] = None
) -> Union[PersonName, MutableSequence[PersonName]]:
    def get_valtype(x: str) -> PersonName:
        return PersonName(x)

    encodings = encodings or [default_encoding]
    value = decode_bytes(byte_string.rstrip(b'\x00 '), encodings, TEXT_VR_DELIMS)
    value_splitted = value.split('\\')
    if len(value_splitted) == 1:
        return get_valtype(value_splitted[0])

    return MultiValue(get_valtype, value_splitted)

Please let me know if you want us to fix this on this PR, or another one 🙏

@mrbean-bremen
Copy link
Member

@LaurentTrk - thanks, I had missed that! I think this also belongs to this PR, so feel free to add it.

@LaurentTrk
Copy link
Contributor

Hi guys,
I made some changes in this PR for the PersonName decoding.
Please let me know if you need anything else.

@mrbean-bremen
Copy link
Member

Those typing errors each time a new mypy version is released get a bit annoying. @darcymason - maybe we pin the mypy version (like we did for the patch release) and update mypy ourselves when needed? Or switch to pre-commit, which will create PRs for updated tools, which can then be adapted to pass.
My suggestion would actually be to pin mypy to 0.971 in this PR and handle the new mypy version separately whatever way we decide to.

@darcymason
Copy link
Member

My suggestion would actually be to pin mypy to 0.971 in this PR

Sounds good to me.

@mrbean-bremen
Copy link
Member

@darcymason - I think we somewhat forgot this PR. From my POV this is ready to merge, what do you think?

@darcymason
Copy link
Member

Sure, this is more your expertise than mine, so please merge if you are happy with it.

@mrbean-bremen mrbean-bremen merged commit 9c7f4dc into pydicom:master Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants