Output file names and forbidden characters #425

chrisfoulon · 2020-09-10T15:55:15Z

Hi!

I am trying to reproduce the file naming of dcm2niix and I noticed something that I am not sure how to handle it.
When I try the format %d_%t%s so SeriesDescription, Time and InstanceNumber I have 2 different behaviours.
For example, 2 files where the SeriesDescription contains forbidden characters, as explained in https://github.com/rordenlab/dcm2niix/blob/master/FILENAMING.md:
'COR T2* GRE' gives me 'COR_T2_GRE_20160812173101_6'
But:
'MPR Ob_Ax_I -> S_Min IP_sp:10.0_th:10.0' gives me 'MPR_Ob_Ax_I_-__S_Min_IP_sp_10.0_th_10.0_20150805144117_900'

So, I don't know why '>' adds an underscore in the second case but not the other ':' '*' ... characters?
Is there a different rule for some of the forbidden characters, or is it not normal behaviour?

Thank you in advance!
Chris.

neurolabusc · 2020-09-11T12:21:25Z

@chrisfoulon the forbidden characters are converted to an underscore at the file naming stage. However, the asterisk is deleted from the string earlier (during the header reading stage) - not only will it not be allowed for file names, it will not be allowed in any strings saved to the JSON file, along with a few other characters (comma, space, /, , %, Tab, LineFeed, Vertical Tab, Carriage Return) - all of which are replaced by an underscore ("_") with multiple sequential underscores replaced by a single underscore. This has been the consistent behavior, but does describe a major difference in the BIDS JSON files created by dcm2niix versus dicm2nii. I really do not have a strong opinion about this, and do understand that reporting "T2*" in the series description may be less misleading than "T2" (as a T2* image attempts to capture only the susceptibility artifact of a T2-weighted image). However, it will mean all JSON files created by future versions of dcm2niix will be different from prior versions..

@satra, @xiangruili and @effigies would you have any concerns if future versions of dcm2niix preserved line spaces, commas, slashes and backslashes (converted to double backslash), and asterisk symbols to mimic dicm2nii:

	"SeriesDescription": "COR T2* GRE",
	"SequenceName": "*epfid2d1_64",
	"InstitutionAddress": "Medical Center Dr 5,Columbia,SC,US,29203",

versus the current behavior:

	"SeriesDescription": "COR_T2_GRE",
	"SequenceName": "_epfid2d1_64",
	"InstitutionAddress": "Medical_Center_Dr_5_Columbia_SC_US_29203",

Since dicm2nii is widely used, I suspect there will not be many unintended consequences.

If we include the asterisks, we should also consider other symbols like the comma, slash and backslash. However, I am less sure we should include the control Characters, CR, LF, FF, and ESC allowed by DICOM. In general, it might be a good time to make dcm2niix and dicm2nii consistent.

A full list of dicm2nii behavior:

space: allowed
,: allowed
%: allowed
asterisk: allowed
/: allowed
ascii9 tab: allowed
ascii10 line feed: allowed
ascii11 vertical tab: allowed
asci13 carriage return: allowed
backslash converted to double backslash. Note that this character is only allowed for some DICOM VRs.

effigies · 2020-09-11T13:36:45Z

space: allowed

,: allowed

%: allowed

asterisk: allowed

/: allowed

backslash converted to double backslash.

These all make sense to me. I really can't see any good reason to include '\r', '\n', '\t' or vertical tab, and would drop them.

Alternatively, I would be okay with translating any unprintable or non-space whitespace characters to hex escapes and double-backslashing, like \\x00-\\xFF. This would allow downstream tools to reconstruct the original values if desired.

satra · 2020-09-11T13:47:03Z

@neurolabusc - i didn't realize dcm2niix was sanitizing info. i think the values should just reflect what is in dicom and should use any standard json encoding to convert to the json file.

@effigies - if someone wanted to put a structured field in patient description, that should be pulled out as was entered on the scanner console without sanitization. is there a specific reason to sanitize anything at the dcm2niix level other than json compatibility?

also is there any mandate on which form of utf encoding to adopt on the bids side? i can see many dicom fields containing arbitrary characters depending on locations on the planet.

effigies · 2020-09-11T13:58:02Z

Oh right, JSON will just accept \uABCD in strings. That probably makes the most sense.

xiangruili · 2020-09-11T14:10:27Z

I agree with @satra that we won't bother to convert anything, but simply keep what it is in dicom, unless it is forbidden by JSON.

dicm2nii.m uses output file names which are also legal variable names for Matlab. The consideration was to save a full dicom header in a Matlab struct (field name has the same requirement as variable name) with the same name as the file name.

neurolabusc · 2020-09-11T17:04:47Z

@effigies @satra and @xiangruili it is great to have a consensus! While we are changing this, do you all have opinions on how to handle non-English characters for filenames. Consider the name "Søren Müller‬":

DICOM will store this string using ISO 8859, so Sø is stored 0x53F8
JSON and Unix file systems use UTF-8, so Sø is stored 0x53C3B8
Windows filenames require UCS-2/UTF-16, so Sø is stored 0x005300F8

In my experience, many tools have issues with filenames/paths that include non-English characters. This may have changed over the last few years, but I recall large projects like Matlab and Slicer 3D both having issues with this. I suspect many smaller projects will also have issues.

Currently, dcm2niix will translate "Søren Müller‬" to read "Soren Muller‬" in the JSON and "Soren_Muller‬" for the filename (preserving the string length). dicm2nii directly copies the byte values, so the JSON reports "S¯ren M¸ller" and the filename is "S_ren_M_ller". My own sense is we want to stick with English characters, as there may be a lot of unintended consequences if other characters are used in filenames. However, I wonder if some will advocate for a better translation, e.g. "Soren Mueller‬".

effigies · 2020-09-11T19:51:00Z

I don't really have a strong opinion. I do generally think that it's each tool's job to address itself to standards, and not to implement sanitization to accommodate other tools' failures. It seems a shame to spend time implementing a hodge podge of heuristic conversions to ASCII, when this is part of what Unicode was designed to put an end to.

I guess I would say that you should either do nothing, or do something simple. It's not your job to maintain an elaborate normalization scheme.

satra · 2020-09-11T19:54:24Z

@neurolabusc - i would make a distinction between filenames and json files. json files can encode a lot more things than filenames in different OS'es can. so json files should be able to directly save anything in dicom which can be sanitized for filename compliance across OSes. but the value of the key in the json should just directly reflect the dicom value.

neurolabusc · 2020-09-11T20:31:52Z

@effigies I agree. UTF-8 must be rated as one of the most successful hacks in computer science. It provides robust backward compatibility with old data, compact representation for English characters, and the extensibility for other scripts. Unfortunately, both DICOM and the Windows file system predate UTF-8.

@satra - To support the UTF-8 used by JSON we need to convert the 95 written characters that differ with ISO 8859. It is worth noting that the Attribute Specific Character Set (0008,0005) allows one to specify additional non-UTF8 character sets, none of which I have examples of. It is interesting that neither gdcmdump or dcmdump support non-English characters, even though they are led by European developers. Horos does support these features.

neurolabusc · 2020-09-11T21:28:16Z

I have submitted a path that converts ISO8859 to UTF8 for the JSON (e.g. Søren Müller‬ -> Søren Müller‬ )‬, and applies a simple ISO8859 filter to the filename (e.g. Søren Müller‬ -> Soren Muller)‬. Due to the changes in the JSON, this will cause the dcm_qa_* repositories to report differences. It would be great if those with large datasets can check for unintended consequences (@captainnova). It is about time for a new stable release, so hopefully this modification has no unintended consequences.

Unix users can test the latest developmental commit:

git clone --branch development https://github.com/rordenlab/dcm2niix.git
cd dcm2niix/console
make
./dcm2niix

Windows users should be able to get a compiled version from AppVeyor.

captainnova · 2020-09-11T22:47:37Z

It would be great if those with large datasets can check for unintended consequences (@captainnova).

I look forward to testing it, but don't think we have much non-ASCII in our data, which is mostly from countries where keyboards tend to make it difficult to enter non-ASCII characters.

I think using UTF8 in the BIDS file and simple ASCIIization of the filenames is the right approach. I know it's a chicken-and-egg problem, but because unicode would break much of our existing downstream processing, we tend to manually convert fields to restricted ASCII when the need arises. At this moment a colon in a PatientID is causing heartburn because colons are not allowed in qsub job names. (Apparently Siemens scanners fill it in with the time if it is left blank, as often happens with volunteer/phantom scans.)

captainnova · 2020-09-19T19:37:24Z

It works fine with the test suite I have gathered, and minimized alteration of values in the .json matches my expectation better. That said, the test suite I used does not include any nonASCII PatientNames or IDs.

neurolabusc · 2020-09-24T12:57:40Z

OK, I also had several European users test the new version. New behavior will be noted in the release notes for the upcoming stable release.

…ation and Windows function GetOpenFileName will truncate (#425)

* tag 'v1.0.20210317': (23 commits) CI: Travis updates submodules always from remote. Update dcm_qa_nih submodule. Remove diagnostic message help should describe accession number (rordenlab#496) Describe and provide kludge for mangled Canon DICOMs (rordenlab#495) Update Philips notes (rordenlab#493) Update dcm_qa submodule. Support Canon Enhanced DICOM (rordenlab#491) Use first recognized manufacturer tag (rordenlab#487) If mosaics where ANY volume is non-planar, save ALL volumes as 3D(rordenlab#481) Add notes Kludge to separate vNav files as 3D (rordenlab#481) Tell user to override merging (-m o) to separate coils, specific with fmrib usage that disrupts Siemens DCE processing (rordenlab#187) Forbid semicolon in filenames as Linux uses this for command concatenation and Windows function GetOpenFileName will truncate (rordenlab#425) report totalReadoutTime once EstimatedEffectiveEchoSpacing -> EffectiveEchoSpacing (rordenlab#480) GE Round factor for Total Readout Time Update explicit naming of DICOMs (rordenlab#252), GE file naming changes (rordenlab#476) Update issue templates GE TotalReadouTime, BEP009 fixes (rordenlab#476) ...

neurolabusc added a commit that referenced this issue Sep 11, 2020

JSON ISO8859 -> UTF8, filenames use simple ISO8859 filter (#425)

c88e02e

neurolabusc mentioned this issue Sep 17, 2020

DICOM files with Icon Image Sequence where the icon has been purged #415

Closed

neurolabusc closed this as completed Sep 24, 2020

neurolabusc added a commit to neurolabusc/dcm_qa that referenced this issue Oct 30, 2020

Support more UTF8 characters (rordenlab/dcm2niix#425)

2f59a06

neurolabusc mentioned this issue Nov 3, 2020

version 2-November-2020 (v1.0.20201102) #444

Merged

neurolabusc mentioned this issue Jan 11, 2021

Series Description and Protocol Name in the .json file #473

Closed

neurolabusc added a commit that referenced this issue Feb 9, 2021

Forbid semicolon in filenames as Linux uses this for command concaten…

4a4ec3e

…ation and Windows function GetOpenFileName will truncate (#425)

neurolabusc mentioned this issue Feb 24, 2021

JSON keys with dots (periods) #483

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output file names and forbidden characters #425

Output file names and forbidden characters #425

chrisfoulon commented Sep 10, 2020

neurolabusc commented Sep 11, 2020

effigies commented Sep 11, 2020

satra commented Sep 11, 2020

effigies commented Sep 11, 2020

xiangruili commented Sep 11, 2020 •

edited

neurolabusc commented Sep 11, 2020

effigies commented Sep 11, 2020

satra commented Sep 11, 2020

neurolabusc commented Sep 11, 2020

neurolabusc commented Sep 11, 2020

captainnova commented Sep 11, 2020

captainnova commented Sep 19, 2020

neurolabusc commented Sep 24, 2020

Output file names and forbidden characters #425

Output file names and forbidden characters #425

Comments

chrisfoulon commented Sep 10, 2020

neurolabusc commented Sep 11, 2020

effigies commented Sep 11, 2020

satra commented Sep 11, 2020

effigies commented Sep 11, 2020

xiangruili commented Sep 11, 2020 • edited

neurolabusc commented Sep 11, 2020

effigies commented Sep 11, 2020

satra commented Sep 11, 2020

neurolabusc commented Sep 11, 2020

neurolabusc commented Sep 11, 2020

captainnova commented Sep 11, 2020

captainnova commented Sep 19, 2020

neurolabusc commented Sep 24, 2020

xiangruili commented Sep 11, 2020 •

edited