Round-tripping through dicom.Parse()/dicom.Write() does not yield bit-for-bit identical files. #198

bpeake-illuscio · 2021-04-02T21:48:37Z

Hello!

Opening an issue with some findings after kicking the tires a little more around the write feature. I bumped into this when running some code that reads in a dicom, scrapes some data, then writes it out to a storage location. I am migrating this application from python, where I use pydicom to do the task currently. The Go version is much faster! So that's good.

As part of the process, I do a checksum of the original file and make sure that the newly written file has the same checksum before completing the job. When using the Go code, I am getting written files who's checksum do not match the original source. It seems like theres some value-cleaning either on the read-in or write-out that yields differing data.

The good news is that this still seems to result in a correctly written DICOM, I can open them in OSIRIX, so I don't think this is a CRITICAL issue, just one to be aware of.

I put some tests together to show this. There's a copy-paste of the ROUNDTRIP_TESTS_README.MD at the end of this issue for you to see how the tests are laid out, and my branch with these tests can be found here.

I've attached an html-dump of the tests from Goland so you can see what sort of issues they are catching: Test Results - roundtrip_test-with_env_go.html.zip

Here's a few examples of errors it is catching:

Mismatch VR on Private Elements

All private elements are being re-written with a VR of "UN":

I have a fix for this one up already: #194 (dedicated issue: #193)

Value Discrepancy For *PlaetteColorLookupTableData

This is an interesting one. It seems like the values for elements like RedPaletteColorLookupTableData have been prefixed with a number of zero-bytes equal to their size. Each written element is exactly twice as large as the source element.

I wonder if there is some sort of buffer allocation error happening either on the read or write side where a buffer is being created for the value size, then appended to rather than written to. These are the only 'OW' values in the files aside from the PixelData, which is handled specially, so the bug may be somewhere in the read or write method for that VR.

Even more interesting is that pydicom reports these elements as matching, so it seems maybe they are doing some value-cleaning on their end that catches this? Super interesting / odd. I'll poke around a little and see if I can get to the bottom of it.

Raw bytes Mismatch

Once re-written, non of the files match the original checksum:

Given the discrepancies above I think this is expected, hopefully by resolving them we will solve the issue, but I think there might be more going on. 4.dcm passes the element equality tests, but fails the checksum comparisons.

Other thoughts

One interesting thing of note, is that pydicom can parse all these files fine, and actually passes comparing the source elements to the re-written elements.

I think this may be due to a lot of error-correction codepaths on their side. For instance pydicom compiles a list of well-known private elements and will inset those known VRs into their tag info, masking our writing all private elements out to UN. I imagine something like that is also going on with the Color Palette elements.

But, it's also encouraging. It means we are writing dicoms that pydicom can parse without complaining!

If we want to add a more thorough set of test files in the future, I also found this page from the Aliza DICOM viewer website which has a good set of files as a starting point, though we would need to double-check licensing.

Finally, here is the text of ROUNDTRIP_TESTS_README.MD, for reference, let me know if you have any questions!

ROUNDTRIP_TESTS_README.MD

Overview

A few of our go tests are designed to verify that this library creates correct and compliant DICOM files by feeding parsed and re-written DICOMS to the python pydicom library and examining that the results are as we would expect.

Requirements and Installation

Python 3.6+

pydicom
pytest

These can be installed by running the following command from the top-level directory:

pip install -r ./roundtrip_requirements.txt

You can optionally set the following environmental variable:

PYTHON_INTERPRETER=[path_to_your_interpreter]

This tells go test which python interpreter to call in the case that you wish to use a non-default interpreter (such as a venv when running tests in an IDE). If this variable is not set, the default will be python3.

Test Files

The following files are used to run these tests:

roundtrip_test.go

The main test file to be run by go test. This test reads in each of the test-data files then writes them to a buffer. This buffer is parsed again and both datasets are compared element-by-element to ensure they are identical.

The resulting bytes of the write are also compared against the original bytes of the source file to ensure they are identical. If the results are not a bit-for-bit match, the test will fail. Bytes are compared both directly and by an md5 checksum. If the comparison fails, the position of the first non-matching byte will be reported for investigation.

roundtrip_test.py

This script is called by roundtrip_test.go, and uses pydicom to parse our written file and ensure that it can be parsed. The script then ALSO loads the original file and repeats our Go test of making sure all newly written elements are identical to their source file.

By using pydicom, a widely-used and popular library, to validate our read-in, write-out results, we increase confidence that we are correctly parsing and writing DICOM files.

roundtrip_pydicom_assessment.py

This script shows that pydicom is capable of doing a round-trip read/write of a dicom file in which the original file and the file written by pydicom are exactly identical byte-for-byte.

Running the Tests

To run the roundtrip_test.go Go tests, simply run:

go test ./... -v

... as normal after installing the required python version and libraries. The go test runs roundtrip_test.py as a sub-test for each test file and reports the results.

To run the roundtrip_pydicom_assessment.py tests, run:

pytest ./roundtrip_pydicom_assessment.py

ReadWriter Utility

There is a new go utility at ./testdata/readwriter which can be used to read all dicom files in a directory and re-write them to a sub-directory 'rewritten'. This is to aid in manually testing reading and-rewriting files when verifying that DICOM viewers can read in the written files.

The utility can be used as follows:

go run ./testdata/readwriter [directory]

If no argument is passed, the utility will process the current directory.

Example usage:

go run ./testdata/readwriter ./testdata
REWRITING: 'testdata/1.dcm' to 'testdata/rewritten/1_rewrite.dcm'
REWRITING: 'testdata/2.dcm' to 'testdata/rewritten/2_rewrite.dcm'
REWRITING: 'testdata/3.dcm' to 'testdata/rewritten/3_rewrite.dcm'
REWRITING: 'testdata/4.dcm' to 'testdata/rewritten/4_rewrite.dcm'
REWRITING: 'testdata/5.dcm' to 'testdata/rewritten/5_rewrite.dcm'
5 DICOM files successfully read and rewritten

The text was updated successfully, but these errors were encountered:

bpeake-illuscio · 2021-04-02T23:11:00Z

Just posted #200 and it's specific issue #199, which fixes the OW-leading-zeroed-bytes-issue.

bpeake-illuscio · 2021-04-03T19:15:58Z

Just tried merging #200 and #194 into the round-trip test branch and that took care of the element equality tests not passing. All that's failing now is the checksum tests, which are less concerning to me, though still worth investigating I think.

Also, let me know if you would like me to open up a PR for that branch. I think the round-trip tests are a good high-level test for verifying that read and writes work as intended.

I would probably remove roundtrip_pydicom_assessment.py and the /testdata/readwriter utility, and would just need to add a pip install step to the CI pipeline for running the pydicom verification tests. Happy to do that work if you would like to get it in.

suyashkumar · 2021-04-04T11:42:35Z

Hi! Thanks so much for spending time on this. The current write package should probably be considered more of a "draft"--lots of tires to kick. It was a known issue (that simplified some internal assumptions) that reading and then immediately writing out DICOMs would not be byte-for-byte equal, but they should be conformant DICOMs.

One of these assumptions is that we would always write out files in little endian implicit regardless of the input files, which may lead to this issue.

I think it would be good to work towards byte-for-byte equal, but I think it's a tradeoff between complexity and functionality--writing out an equivalent, fully conformant DICOM is a good first goal.

Will go through this in more detail shortly this week!

suyashkumar · 2021-04-04T11:50:15Z

Regarding the tests--I agree we should have some flavor of that.

I'm not sure if we need to test bit-for-bit identity yet (I don't think that will pass). We do already have a write test that tests roundtripping within this library (e.g. write out a known dataset, and then read it back in and see that it's exactly equal to the original known dataset).

I think a python-linked test that tests we can parse the written dicom by pydicom is certainly a good one. We will have to think about what the best test infrastructure for that would be. (Currently the development setup is very simple and doesn't rely on python, which is nice). One option is to detail the python requirements, and have a helper script setup a python envrironment for running this test (perhaps it should be seperate from the typical go test that runs by default). We should definitly make sure this test always runs on CI, and locally you can specifically go to that test and run it if needed, but may not run as part of go test ./... (we could probably also have make test run the python setup and run the test if needed).

For the python stuff locally, it would probably be good to run the test in a virtualenv so as to not interact with the system python installation. I am pretty wary about adding a hard python dependency for local development, but certainly as a test suite that runs on CI I am good with at the moment

bpeake-illuscio · 2021-04-04T22:10:15Z

It was a known issue (that simplified some internal assumptions) that reading and then immediately writing out DICOMs would not be byte-for-byte equal, but they should be conformant DICOMs.

Thanks, that's good context to have. Do you think it would be worth putting that somewhere near the beginning of the README and / or Docs as a warning to developers looking at this lib for DICOM writes? Since the lib is 1.0 some users may find it surprising, but a warning near the beginning on the write feature would make it quick to discover.

We do already have a write test that tests roundtripping within this library (e.g. write out a known dataset, and then read it back in and see that it's exactly equal to the original known dataset).

Totally, I think it's good to have targeted tests like the ones currently in the write feature, where you know EXACTLY what values are involved.

I also think it's worth while to have a round-trip test of non-synthesized data, where we pull in a number of full DICOMS with real-world data and make sure every element in them is correctly read, written, and read again. That was what let me find #199 and #193. And the tests as-written would be easy to expand to larger test datasets, should we want. Further down, I discuss using build tags for some, more complicated tests. So maybe we could add an integration tag for these tests so they are not run by default, but are still run on CI or when a developer wants to run the full test suite.

I think it would be good to work towards byte-for-byte equal, but I think it's a tradeoff between complexity and functionality--writing out an equivalent, fully conformant DICOM is a good first goal.

Agreed completely.

I'm not sure if we need to test bit-for-bit identity yet (I don't think that will pass).

Also agreed, although we could always call t.Skip() on it for now and leave it in as an aspirational test if bit-for-bit parse / write is something you would like to strive for in general in the future, anyone working towards that goal then has a pre-wrritten set of tests they can re-enable if they so desire.

We should definitly make sure this test always runs on CI, and locally you can specifically go to that test and run it if needed, but may not run as part of go test ./... (we could probably also have make test run the python setup and run the test if needed).

I think this a natural use case for build tags, IMO. That way by default, developers don't have to worry themselves with setting up python unless they want to, while requiring minimal scripting on our part. We could have a validation build tag, and maybe also have a bytes_symmetry tag for the bit-for-but tests instead of using t.Skip().

Then we could have make test run the basic tests using the current test command and maybe make tests-validation run
go test ./... -v -tags=validation and make test-future run go test ./... tags=bytes_symmetry or something like that. I've also been looking into other third-party dicom validation tools, so we could add other tests as well that call other third-party validation tools.

For the python stuff locally, it would probably be good to run the test in a virtualenv so as to not interact with the system python installation.

IMO, this is a choice we should leave to the developer. I use python -m venv to setup my virtual environments, but some people use pyenv, and still others use poetry. I think having a python_requirements.txt file and an explanation in the readme is probably enough to let anyone who wants to run those tests locally to run them without having to do too much setup, and for CI I don't think there's much of a problem using the system Python.

I'll open a couple draft PR's for the tings discussed here, and we can maybe discuss the pros and cons of the individual elements there and close this issue.

bpeake-illuscio mentioned this issue Apr 2, 2021

Bug: OW values have leading 0-bytes equal to their vl #199

Closed

This was referenced Apr 5, 2021

Feature/dcmtest package #201

Draft

Testing/round trip go #202

Closed

Testing/round trip go #203

Draft

Testing/pydicom validation #204

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Round-tripping through dicom.Parse()/dicom.Write() does not yield bit-for-bit identical files. #198

Round-tripping through dicom.Parse()/dicom.Write() does not yield bit-for-bit identical files. #198

bpeake-illuscio commented Apr 2, 2021 •

edited

Loading

bpeake-illuscio commented Apr 2, 2021

bpeake-illuscio commented Apr 3, 2021

suyashkumar commented Apr 4, 2021

suyashkumar commented Apr 4, 2021

bpeake-illuscio commented Apr 4, 2021

Round-tripping through dicom.Parse()/dicom.Write() does not yield bit-for-bit identical files. #198

Round-tripping through dicom.Parse()/dicom.Write() does not yield bit-for-bit identical files. #198

Comments

bpeake-illuscio commented Apr 2, 2021 • edited Loading

Mismatch VR on Private Elements

Value Discrepancy For *PlaetteColorLookupTableData

Raw bytes Mismatch

Other thoughts

ROUNDTRIP_TESTS_README.MD

Overview

Requirements and Installation

Test Files

Running the Tests

ReadWriter Utility

bpeake-illuscio commented Apr 2, 2021

bpeake-illuscio commented Apr 3, 2021

suyashkumar commented Apr 4, 2021

suyashkumar commented Apr 4, 2021

bpeake-illuscio commented Apr 4, 2021

bpeake-illuscio commented Apr 2, 2021 •

edited

Loading