Support dutch timestamp within headers #1095

SimonBiggs · 2020-10-14T09:09:19Z

See #1093 for the base PR.

The new testing file can be seen over at https://zenodo.org/record/4087961. The new files added are highlighted below:

I ran pymedphys trf to-csv on that file without this fix and it failed to decode with an error message of "unexpected header format". After this, it worked as intended. No other baseline or reference decoding results within the testing suite were altered with this change.

See the error message from pytest when this new TRF file is included but this change is not made:

simon@dads-desktop:~/git/pymedphys$ poetry run pytest pymedphys/tests/trf --run-only-slow
Test session starts (platform: linux, Python 3.7.8, pytest 6.1.0, pytest-sugar 0.9.4)
rootdir: /home/simon/git/pymedphys
plugins: hypothesis-5.36.1, pylint-0.17.0, sugar-0.9.4
collecting ... 2020-10-14 20:25:16.469664: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1


――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― test_conversions ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

    @pytest.mark.slow
    def test_conversions():
        data_paths = pymedphys.zip_data_paths("trf-references-and-baselines.zip")
    
        files_with_references = [
            path
            for path in data_paths
            if path.parent.name == "with_reference" and path.suffix == ".trf"
        ]
    
        assert len(files_with_references) >= 5
    
        files_without_references = [
            path
            for path in data_paths
            if path.parent.name == "with_baseline" and path.suffix == ".trf"
        ]
    
        assert len(files_without_references) >= 4
    
        with tempfile.TemporaryDirectory() as output_directory:
            for filepath in files_with_references:
                convert_and_check_against_reference(filepath, output_directory)
    
            for filepath in files_without_references:
>               convert_and_check_against_baseline(filepath, output_directory)

pymedphys/tests/trf/test_decode.py:99: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pymedphys/tests/trf/test_decode.py:56: in convert_and_check_against_baseline
    convert_and_check(filepath, output_directory, baseline_dataframe)
pymedphys/tests/trf/test_decode.py:68: in convert_and_check
    _, table_filepath = trf2csv(filepath, output_directory=output_directory)
pymedphys/_trf/decode/trf2csv.py:64: in trf2csv
    dataframes["header"], dataframes["table"] = trf2pandas(trf_filepath)
pymedphys/_trf/decode/trf2pandas.py:32: in trf2pandas
    header_dataframe = header_as_dataframe(trf_header_contents)
pymedphys/_trf/decode/trf2pandas.py:43: in header_as_dataframe
    header = decode_header(trf_header_contents)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

trf_header_contents = b'\x1320-09-24 06:29:58 Z\x06+02:00\x0b6_320/6_320\x042325\x00\x00\x00\x00\x00\xb3\xc0@\x03\x00\x00\x00^\x01\x00\x00\x...dc\x00\xe4\t\xdc\x00\xe5\t\xdc\x00\xe6\t\xdc\x00\xe7\t\xdc\x00\xe8\t\xdc\x00\xe9\t\xdc\x00\xea\t\xdc\x00\xeb\t\xdc\x00'

    def decode_header(trf_header_contents):
        match = re.match(
            br"[\x00-\x19]"  # start bit
            br"(\d\d/\d\d/\d\d \d\d:\d\d:\d\d Z)"  # date
            br"[\x00-\x19]"  # divider bit
            br"((\+|\-)\d\d:\d\d)"  # time zone
            br"[\x00-\x25]"  # divider bit
            br"([\x20-\xFF]*)"  # field label and name
            br"[\x00-\x19]"  # divider bit
            br"([\x20-\xFF]+)"  # machine name
            br"[\x00-\x19]",  # divider bit
            trf_header_contents,
        )
    
        if match is None:
            print(trf_header_contents)
>           raise ValueError("Logfile header not of an expected form.")
E           ValueError: Logfile header not of an expected form.

pymedphys/_trf/decode/header.py:49: ValueError
------------------------------------------------------------- Captured stdout call --------------------------------------------------------------
b'\x1320-09-24 06:29:58 Z\x06+02:00\x0b6_320/6_320\x042325\x00\x00\x00\x00\x00\xb3\xc0@\x03\x00\x00\x00^\x01\x00\x00\xc0\x08o\x00\x81\x08d\x00\xef\to\x00\xee\to\x00\xbe\x08o\x00r\x08e\x00\x98\x08o\x00\xed\to\x00\xb0\x08\x81\x00\xb0\x08\xdc\x00\xb1\x08\x81\x00\xb1\x08\xdc\x00\xb2\x08\x81\x00\xb2\x08\xdc\x00\xb3\x08\x81\x00\xb3\x08\xdc\x00\xb4\x08\x81\x00\xb4\x08\xdc\x00\xb5\x08\x81\x00\xb5\x08\xdc\x00\x0c\x08\x81\x00\r\x08\x81\x00\x0c\x08\xdc\x00\r\x08\xdc\x00\x10\x08\x81\x00\x11\x08\x81\x00\x10\x08\xdc\x00\x11\x08\xdc\x00t\t\xe3\x00\xc4\t\xe3\x00L\t\x81\x00M\t\x81\x00N\t\x81\x00O\t\x81\x00P\t\x81\x00Q\t\x81\x00R\t\x81\x00S\t\x81\x00T\t\x81\x00U\t\x81\x00V\t\x81\x00W\t\x81\x00X\t\x81\x00Y\t\x81\x00Z\t\x81\x00[\t\x81\x00\\\t\x81\x00]\t\x81\x00^\t\x81\x00_\t\x81\x00`\t\x81\x00a\t\x81\x00b\t\x81\x00c\t\x81\x00d\t\x81\x00e\t\x81\x00f\t\x81\x00g\t\x81\x00h\t\x81\x00i\t\x81\x00j\t\x81\x00k\t\x81\x00l\t\x81\x00m\t\x81\x00n\t\x81\x00o\t\x81\x00p\t\x81\x00q\t\x81\x00r\t\x81\x00s\t\x81\x00t\t\x81\x00u\t\x81\x00v\t\x81\x00w\t\x81\x00x\t\x81\x00y\t\x81\x00z\t\x81\x00{\t\x81\x00|\t\x81\x00}\t\x81\x00~\t\x81\x00\x7f\t\x81\x00\x80\t\x81\x00\x81\t\x81\x00\x82\t\x81\x00\x83\t\x81\x00\x84\t\x81\x00\x85\t\x81\x00\x86\t\x81\x00\x87\t\x81\x00\x88\t\x81\x00\x89\t\x81\x00\x8a\t\x81\x00\x8b\t\x81\x00\x8c\t\x81\x00\x8d\t\x81\x00\x8e\t\x81\x00\x8f\t\x81\x00\x90\t\x81\x00\x91\t\x81\x00\x92\t\x81\x00\x93\t\x81\x00\x94\t\x81\x00\x95\t\x81\x00\x96\t\x81\x00\x97\t\x81\x00\x98\t\x81\x00\x99\t\x81\x00\x9a\t\x81\x00\x9b\t\x81\x00\x9c\t\x81\x00\x9d\t\x81\x00\x9e\t\x81\x00\x9f\t\x81\x00\xa0\t\x81\x00\xa1\t\x81\x00\xa2\t\x81\x00\xa3\t\x81\x00\xa4\t\x81\x00\xa5\t\x81\x00\xa6\t\x81\x00\xa7\t\x81\x00\xa8\t\x81\x00\xa9\t\x81\x00\xaa\t\x81\x00\xab\t\x81\x00\xac\t\x81\x00\xad\t\x81\x00\xae\t\x81\x00\xaf\t\x81\x00\xb0\t\x81\x00\xb1\t\x81\x00\xb2\t\x81\x00\xb3\t\x81\x00\xb4\t\x81\x00\xb5\t\x81\x00\xb6\t\x81\x00\xb7\t\x81\x00\xb8\t\x81\x00\xb9\t\x81\x00\xba\t\x81\x00\xbb\t\x81\x00\xbc\t\x81\x00\xbd\t\x81\x00\xbe\t\x81\x00\xbf\t\x81\x00\xc0\t\x81\x00\xc1\t\x81\x00\xc2\t\x81\x00\xc3\t\x81\x00\xc4\t\x81\x00\xc5\t\x81\x00\xc6\t\x81\x00\xc7\t\x81\x00\xc8\t\x81\x00\xc9\t\x81\x00\xca\t\x81\x00\xcb\t\x81\x00\xcc\t\x81\x00\xcd\t\x81\x00\xce\t\x81\x00\xcf\t\x81\x00\xd0\t\x81\x00\xd1\t\x81\x00\xd2\t\x81\x00\xd3\t\x81\x00\xd4\t\x81\x00\xd5\t\x81\x00\xd6\t\x81\x00\xd7\t\x81\x00\xd8\t\x81\x00\xd9\t\x81\x00\xda\t\x81\x00\xdb\t\x81\x00\xdc\t\x81\x00\xdd\t\x81\x00\xde\t\x81\x00\xdf\t\x81\x00\xe0\t\x81\x00\xe1\t\x81\x00\xe2\t\x81\x00\xe3\t\x81\x00\xe4\t\x81\x00\xe5\t\x81\x00\xe6\t\x81\x00\xe7\t\x81\x00\xe8\t\x81\x00\xe9\t\x81\x00\xea\t\x81\x00\xeb\t\x81\x00L\t\xdc\x00M\t\xdc\x00N\t\xdc\x00O\t\xdc\x00P\t\xdc\x00Q\t\xdc\x00R\t\xdc\x00S\t\xdc\x00T\t\xdc\x00U\t\xdc\x00V\t\xdc\x00W\t\xdc\x00X\t\xdc\x00Y\t\xdc\x00Z\t\xdc\x00[\t\xdc\x00\\\t\xdc\x00]\t\xdc\x00^\t\xdc\x00_\t\xdc\x00`\t\xdc\x00a\t\xdc\x00b\t\xdc\x00c\t\xdc\x00d\t\xdc\x00e\t\xdc\x00f\t\xdc\x00g\t\xdc\x00h\t\xdc\x00i\t\xdc\x00j\t\xdc\x00k\t\xdc\x00l\t\xdc\x00m\t\xdc\x00n\t\xdc\x00o\t\xdc\x00p\t\xdc\x00q\t\xdc\x00r\t\xdc\x00s\t\xdc\x00t\t\xdc\x00u\t\xdc\x00v\t\xdc\x00w\t\xdc\x00x\t\xdc\x00y\t\xdc\x00z\t\xdc\x00{\t\xdc\x00|\t\xdc\x00}\t\xdc\x00~\t\xdc\x00\x7f\t\xdc\x00\x80\t\xdc\x00\x81\t\xdc\x00\x82\t\xdc\x00\x83\t\xdc\x00\x84\t\xdc\x00\x85\t\xdc\x00\x86\t\xdc\x00\x87\t\xdc\x00\x88\t\xdc\x00\x89\t\xdc\x00\x8a\t\xdc\x00\x8b\t\xdc\x00\x8c\t\xdc\x00\x8d\t\xdc\x00\x8e\t\xdc\x00\x8f\t\xdc\x00\x90\t\xdc\x00\x91\t\xdc\x00\x92\t\xdc\x00\x93\t\xdc\x00\x94\t\xdc\x00\x95\t\xdc\x00\x96\t\xdc\x00\x97\t\xdc\x00\x98\t\xdc\x00\x99\t\xdc\x00\x9a\t\xdc\x00\x9b\t\xdc\x00\x9c\t\xdc\x00\x9d\t\xdc\x00\x9e\t\xdc\x00\x9f\t\xdc\x00\xa0\t\xdc\x00\xa1\t\xdc\x00\xa2\t\xdc\x00\xa3\t\xdc\x00\xa4\t\xdc\x00\xa5\t\xdc\x00\xa6\t\xdc\x00\xa7\t\xdc\x00\xa8\t\xdc\x00\xa9\t\xdc\x00\xaa\t\xdc\x00\xab\t\xdc\x00\xac\t\xdc\x00\xad\t\xdc\x00\xae\t\xdc\x00\xaf\t\xdc\x00\xb0\t\xdc\x00\xb1\t\xdc\x00\xb2\t\xdc\x00\xb3\t\xdc\x00\xb4\t\xdc\x00\xb5\t\xdc\x00\xb6\t\xdc\x00\xb7\t\xdc\x00\xb8\t\xdc\x00\xb9\t\xdc\x00\xba\t\xdc\x00\xbb\t\xdc\x00\xbc\t\xdc\x00\xbd\t\xdc\x00\xbe\t\xdc\x00\xbf\t\xdc\x00\xc0\t\xdc\x00\xc1\t\xdc\x00\xc2\t\xdc\x00\xc3\t\xdc\x00\xc4\t\xdc\x00\xc5\t\xdc\x00\xc6\t\xdc\x00\xc7\t\xdc\x00\xc8\t\xdc\x00\xc9\t\xdc\x00\xca\t\xdc\x00\xcb\t\xdc\x00\xcc\t\xdc\x00\xcd\t\xdc\x00\xce\t\xdc\x00\xcf\t\xdc\x00\xd0\t\xdc\x00\xd1\t\xdc\x00\xd2\t\xdc\x00\xd3\t\xdc\x00\xd4\t\xdc\x00\xd5\t\xdc\x00\xd6\t\xdc\x00\xd7\t\xdc\x00\xd8\t\xdc\x00\xd9\t\xdc\x00\xda\t\xdc\x00\xdb\t\xdc\x00\xdc\t\xdc\x00\xdd\t\xdc\x00\xde\t\xdc\x00\xdf\t\xdc\x00\xe0\t\xdc\x00\xe1\t\xdc\x00\xe2\t\xdc\x00\xe3\t\xdc\x00\xe4\t\xdc\x00\xe5\t\xdc\x00\xe6\t\xdc\x00\xe7\t\xdc\x00\xe8\t\xdc\x00\xe9\t\xdc\x00\xea\t\xdc\x00\xeb\t\xdc\x00'

 pymedphys/tests/trf/test_decode.py ⨯                                                                                             100% ██████████
=============================================================== warnings summary ================================================================
pymedphys/tests/trf/test_decode.py::test_conversions
pymedphys/tests/trf/test_decode.py::test_conversions
  /home/simon/.pyenv/versions/3.7.8/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
    return f(*args, **kwds)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================ short test summary info ============================================================
FAILED pymedphys/tests/trf/test_decode.py::test_conversions - ValueError: Logfile header not of an expected form.

Results (19.19s):
       1 failed
         - pymedphys/tests/trf/test_decode.py:74 test_conversions

sjswerdloff

Worth taking a look at the detailed comments and having separate issues raised for them if there isn't time to address them right now.

sjswerdloff · 2020-10-14T22:09:38Z

pymedphys/_trf/decode/header.py

@@ -33,7 +33,7 @@ def determine_header_length(trf_contents):
 def decode_header(trf_header_contents):


needs docstring
and it would be helpful to have comments that discuss what the various regex pieces were expected to look like, perhaps with an example.
Or a reference to any trf file format documentation...

Yup, fair regarding the docstring. Wasn't a part of this PR though so will address this separately.

Regarding the other questions this has been reverse engineered based on the data we have. TRF file format documentation would help, but unfortunately I don't have anything official on that front.

sjswerdloff · 2020-10-14T22:14:29Z

pymedphys/_trf/decode/header.py

@@ -33,7 +33,7 @@ def determine_header_length(trf_contents):
 def decode_header(trf_header_contents):
    match = re.match(
        br"[\x00-\x19]"  # start bit
-        br"(\d\d/\d\d/\d\d \d\d:\d\d:\d\d Z)"  # date
+        br"(\d\d[/-]\d\d[/-]\d\d \d\d:\d\d:\d\d Z)"  # date


what other date formats might be expected?
Dutch timestamp format is presumably not really just Dutch, but might include other parts of Western Europe. Which then raises the question regarding what other region formats might be expected. Might Eastern Europe use a different date format? Different places in Asia (Japan, Korea, China)?

Is anyone concerned about rolling over the century mark (previously known as the Y2K problem)?

Further down, the bytes are being decoded as ascii.
but the regex that got parsed includes up to \xFF, while ascii is only up to \xEF or maybe \xF0.
so it is really ascii or is it Latin1 (or iso-8859, or something similar)?

Fair point about decoding with ASCII, I should possibily be using something else for sure. Would be helpful to have someone create a field with some very weird characters in it and then get the resulting TRF file.

Regarding most of the other points, I am far more comfortable only actually making changes to the decoding logic if a file shows up that isn't properly decoded. And then make the changes to appropriately decode said file with a change. Not so keen on trying to predict what possible formats of TRF are out there.

Not so keen on trying to predict what possible formats of TRF are out there.

Without the official documentation, that's a very fair approach.

SimonBiggs

Yup, thanks @sjswerdloff. Let's pull these comments out into a new issue and address them separately to this PR.

SimonBiggs · 2020-10-15T02:19:11Z

pymedphys/_trf/decode/header.py

@@ -33,7 +33,7 @@ def determine_header_length(trf_contents):
 def decode_header(trf_header_contents):


Yup, fair regarding the docstring. Wasn't a part of this PR though so will address this separately.

Regarding the other questions this has been reverse engineered based on the data we have. TRF file format documentation would help, but unfortunately I don't have anything official on that front.

SimonBiggs · 2020-10-15T02:23:25Z

pymedphys/_trf/decode/header.py

@@ -33,7 +33,7 @@ def determine_header_length(trf_contents):
 def decode_header(trf_header_contents):
    match = re.match(
        br"[\x00-\x19]"  # start bit
-        br"(\d\d/\d\d/\d\d \d\d:\d\d:\d\d Z)"  # date
+        br"(\d\d[/-]\d\d[/-]\d\d \d\d:\d\d:\d\d Z)"  # date


Fair point about decoding with ASCII, I should possibily be using something else for sure. Would be helpful to have someone create a field with some very weird characters in it and then get the resulting TRF file.

Regarding most of the other points, I am far more comfortable only actually making changes to the decoding logic if a file shows up that isn't properly decoded. And then make the changes to appropriately decode said file with a change. Not so keen on trying to predict what possible formats of TRF are out there.

SimonBiggs · 2020-10-15T02:26:58Z

Before merging, I'm keen for @chrisootes to take the dev version for a spin and make sure it works across his files.

#1093 (comment)

ootesc and others added 5 commits October 13, 2020 15:35

support for headers with other date format

e7ffc8b

Merge pull request #1093 from chrisootes/master

0f9468a

update trf baselines to include dutch timestamp baseline

cdb7cc3

bump dev version

59bda35

appropriately bump dev version

cd11493

SimonBiggs mentioned this pull request Oct 14, 2020

support for headers with other date format #1093

Merged

SimonBiggs requested a review from sjswerdloff October 14, 2020 09:17

SimonBiggs changed the title ~~Support dutch headers~~ Support dutch timestamp within headers Oct 14, 2020

sjswerdloff approved these changes Oct 14, 2020

View reviewed changes

SimonBiggs commented Oct 15, 2020

View reviewed changes

SimonBiggs mentioned this pull request Oct 15, 2020

TRF header.py improvements #1096

Closed

SimonBiggs merged commit d15464a into master Oct 21, 2020

SimonBiggs deleted the support-dutch-headers branch November 8, 2020 02:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dutch timestamp within headers #1095

Support dutch timestamp within headers #1095

SimonBiggs commented Oct 14, 2020 •

edited

sjswerdloff left a comment

sjswerdloff Oct 14, 2020

SimonBiggs Oct 15, 2020

sjswerdloff Oct 14, 2020

SimonBiggs Oct 15, 2020

sjswerdloff Oct 15, 2020

SimonBiggs left a comment

SimonBiggs Oct 15, 2020

SimonBiggs Oct 15, 2020

SimonBiggs commented Oct 15, 2020

		@@ -33,7 +33,7 @@ def determine_header_length(trf_contents):
		def decode_header(trf_header_contents):

Support dutch timestamp within headers #1095

Support dutch timestamp within headers #1095

Conversation

SimonBiggs commented Oct 14, 2020 • edited

sjswerdloff left a comment

Choose a reason for hiding this comment

sjswerdloff Oct 14, 2020

Choose a reason for hiding this comment

SimonBiggs Oct 15, 2020

Choose a reason for hiding this comment

sjswerdloff Oct 14, 2020

Choose a reason for hiding this comment

SimonBiggs Oct 15, 2020

Choose a reason for hiding this comment

sjswerdloff Oct 15, 2020

Choose a reason for hiding this comment

SimonBiggs left a comment

Choose a reason for hiding this comment

SimonBiggs Oct 15, 2020

Choose a reason for hiding this comment

SimonBiggs Oct 15, 2020

Choose a reason for hiding this comment

SimonBiggs commented Oct 15, 2020

SimonBiggs commented Oct 14, 2020 •

edited