Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dutch timestamp within headers #1095

Merged
merged 5 commits into from Oct 21, 2020
Merged

Conversation

SimonBiggs
Copy link
Member

@SimonBiggs SimonBiggs commented Oct 14, 2020

See #1093 for the base PR.

The new testing file can be seen over at https://zenodo.org/record/4087961. The new files added are highlighted below:

image

I ran pymedphys trf to-csv on that file without this fix and it failed to decode with an error message of "unexpected header format". After this, it worked as intended. No other baseline or reference decoding results within the testing suite were altered with this change.

See the error message from pytest when this new TRF file is included but this change is not made:

simon@dads-desktop:~/git/pymedphys$ poetry run pytest pymedphys/tests/trf --run-only-slow
Test session starts (platform: linux, Python 3.7.8, pytest 6.1.0, pytest-sugar 0.9.4)
rootdir: /home/simon/git/pymedphys
plugins: hypothesis-5.36.1, pylint-0.17.0, sugar-0.9.4
collecting ... 2020-10-14 20:25:16.469664: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1


――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― test_conversions ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

    @pytest.mark.slow
    def test_conversions():
        data_paths = pymedphys.zip_data_paths("trf-references-and-baselines.zip")
    
        files_with_references = [
            path
            for path in data_paths
            if path.parent.name == "with_reference" and path.suffix == ".trf"
        ]
    
        assert len(files_with_references) >= 5
    
        files_without_references = [
            path
            for path in data_paths
            if path.parent.name == "with_baseline" and path.suffix == ".trf"
        ]
    
        assert len(files_without_references) >= 4
    
        with tempfile.TemporaryDirectory() as output_directory:
            for filepath in files_with_references:
                convert_and_check_against_reference(filepath, output_directory)
    
            for filepath in files_without_references:
>               convert_and_check_against_baseline(filepath, output_directory)

pymedphys/tests/trf/test_decode.py:99: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pymedphys/tests/trf/test_decode.py:56: in convert_and_check_against_baseline
    convert_and_check(filepath, output_directory, baseline_dataframe)
pymedphys/tests/trf/test_decode.py:68: in convert_and_check
    _, table_filepath = trf2csv(filepath, output_directory=output_directory)
pymedphys/_trf/decode/trf2csv.py:64: in trf2csv
    dataframes["header"], dataframes["table"] = trf2pandas(trf_filepath)
pymedphys/_trf/decode/trf2pandas.py:32: in trf2pandas
    header_dataframe = header_as_dataframe(trf_header_contents)
pymedphys/_trf/decode/trf2pandas.py:43: in header_as_dataframe
    header = decode_header(trf_header_contents)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

trf_header_contents = b'\x1320-09-24 06:29:58 Z\x06+02:00\x0b6_320/6_320\x042325\x00\x00\x00\x00\x00\xb3\xc0@\x03\x00\x00\x00^\x01\x00\x00\x...dc\x00\xe4\t\xdc\x00\xe5\t\xdc\x00\xe6\t\xdc\x00\xe7\t\xdc\x00\xe8\t\xdc\x00\xe9\t\xdc\x00\xea\t\xdc\x00\xeb\t\xdc\x00'

    def decode_header(trf_header_contents):
        match = re.match(
            br"[\x00-\x19]"  # start bit
            br"(\d\d/\d\d/\d\d \d\d:\d\d:\d\d Z)"  # date
            br"[\x00-\x19]"  # divider bit
            br"((\+|\-)\d\d:\d\d)"  # time zone
            br"[\x00-\x25]"  # divider bit
            br"([\x20-\xFF]*)"  # field label and name
            br"[\x00-\x19]"  # divider bit
            br"([\x20-\xFF]+)"  # machine name
            br"[\x00-\x19]",  # divider bit
            trf_header_contents,
        )
    
        if match is None:
            print(trf_header_contents)
>           raise ValueError("Logfile header not of an expected form.")
E           ValueError: Logfile header not of an expected form.

pymedphys/_trf/decode/header.py:49: ValueError
------------------------------------------------------------- Captured stdout call --------------------------------------------------------------
b'\x1320-09-24 06:29:58 Z\x06+02:00\x0b6_320/6_320\x042325\x00\x00\x00\x00\x00\xb3\xc0@\x03\x00\x00\x00^\x01\x00\x00\xc0\x08o\x00\x81\x08d\x00\xef\to\x00\xee\to\x00\xbe\x08o\x00r\x08e\x00\x98\x08o\x00\xed\to\x00\xb0\x08\x81\x00\xb0\x08\xdc\x00\xb1\x08\x81\x00\xb1\x08\xdc\x00\xb2\x08\x81\x00\xb2\x08\xdc\x00\xb3\x08\x81\x00\xb3\x08\xdc\x00\xb4\x08\x81\x00\xb4\x08\xdc\x00\xb5\x08\x81\x00\xb5\x08\xdc\x00\x0c\x08\x81\x00\r\x08\x81\x00\x0c\x08\xdc\x00\r\x08\xdc\x00\x10\x08\x81\x00\x11\x08\x81\x00\x10\x08\xdc\x00\x11\x08\xdc\x00t\t\xe3\x00\xc4\t\xe3\x00L\t\x81\x00M\t\x81\x00N\t\x81\x00O\t\x81\x00P\t\x81\x00Q\t\x81\x00R\t\x81\x00S\t\x81\x00T\t\x81\x00U\t\x81\x00V\t\x81\x00W\t\x81\x00X\t\x81\x00Y\t\x81\x00Z\t\x81\x00[\t\x81\x00\\\t\x81\x00]\t\x81\x00^\t\x81\x00_\t\x81\x00`\t\x81\x00a\t\x81\x00b\t\x81\x00c\t\x81\x00d\t\x81\x00e\t\x81\x00f\t\x81\x00g\t\x81\x00h\t\x81\x00i\t\x81\x00j\t\x81\x00k\t\x81\x00l\t\x81\x00m\t\x81\x00n\t\x81\x00o\t\x81\x00p\t\x81\x00q\t\x81\x00r\t\x81\x00s\t\x81\x00t\t\x81\x00u\t\x81\x00v\t\x81\x00w\t\x81\x00x\t\x81\x00y\t\x81\x00z\t\x81\x00{\t\x81\x00|\t\x81\x00}\t\x81\x00~\t\x81\x00\x7f\t\x81\x00\x80\t\x81\x00\x81\t\x81\x00\x82\t\x81\x00\x83\t\x81\x00\x84\t\x81\x00\x85\t\x81\x00\x86\t\x81\x00\x87\t\x81\x00\x88\t\x81\x00\x89\t\x81\x00\x8a\t\x81\x00\x8b\t\x81\x00\x8c\t\x81\x00\x8d\t\x81\x00\x8e\t\x81\x00\x8f\t\x81\x00\x90\t\x81\x00\x91\t\x81\x00\x92\t\x81\x00\x93\t\x81\x00\x94\t\x81\x00\x95\t\x81\x00\x96\t\x81\x00\x97\t\x81\x00\x98\t\x81\x00\x99\t\x81\x00\x9a\t\x81\x00\x9b\t\x81\x00\x9c\t\x81\x00\x9d\t\x81\x00\x9e\t\x81\x00\x9f\t\x81\x00\xa0\t\x81\x00\xa1\t\x81\x00\xa2\t\x81\x00\xa3\t\x81\x00\xa4\t\x81\x00\xa5\t\x81\x00\xa6\t\x81\x00\xa7\t\x81\x00\xa8\t\x81\x00\xa9\t\x81\x00\xaa\t\x81\x00\xab\t\x81\x00\xac\t\x81\x00\xad\t\x81\x00\xae\t\x81\x00\xaf\t\x81\x00\xb0\t\x81\x00\xb1\t\x81\x00\xb2\t\x81\x00\xb3\t\x81\x00\xb4\t\x81\x00\xb5\t\x81\x00\xb6\t\x81\x00\xb7\t\x81\x00\xb8\t\x81\x00\xb9\t\x81\x00\xba\t\x81\x00\xbb\t\x81\x00\xbc\t\x81\x00\xbd\t\x81\x00\xbe\t\x81\x00\xbf\t\x81\x00\xc0\t\x81\x00\xc1\t\x81\x00\xc2\t\x81\x00\xc3\t\x81\x00\xc4\t\x81\x00\xc5\t\x81\x00\xc6\t\x81\x00\xc7\t\x81\x00\xc8\t\x81\x00\xc9\t\x81\x00\xca\t\x81\x00\xcb\t\x81\x00\xcc\t\x81\x00\xcd\t\x81\x00\xce\t\x81\x00\xcf\t\x81\x00\xd0\t\x81\x00\xd1\t\x81\x00\xd2\t\x81\x00\xd3\t\x81\x00\xd4\t\x81\x00\xd5\t\x81\x00\xd6\t\x81\x00\xd7\t\x81\x00\xd8\t\x81\x00\xd9\t\x81\x00\xda\t\x81\x00\xdb\t\x81\x00\xdc\t\x81\x00\xdd\t\x81\x00\xde\t\x81\x00\xdf\t\x81\x00\xe0\t\x81\x00\xe1\t\x81\x00\xe2\t\x81\x00\xe3\t\x81\x00\xe4\t\x81\x00\xe5\t\x81\x00\xe6\t\x81\x00\xe7\t\x81\x00\xe8\t\x81\x00\xe9\t\x81\x00\xea\t\x81\x00\xeb\t\x81\x00L\t\xdc\x00M\t\xdc\x00N\t\xdc\x00O\t\xdc\x00P\t\xdc\x00Q\t\xdc\x00R\t\xdc\x00S\t\xdc\x00T\t\xdc\x00U\t\xdc\x00V\t\xdc\x00W\t\xdc\x00X\t\xdc\x00Y\t\xdc\x00Z\t\xdc\x00[\t\xdc\x00\\\t\xdc\x00]\t\xdc\x00^\t\xdc\x00_\t\xdc\x00`\t\xdc\x00a\t\xdc\x00b\t\xdc\x00c\t\xdc\x00d\t\xdc\x00e\t\xdc\x00f\t\xdc\x00g\t\xdc\x00h\t\xdc\x00i\t\xdc\x00j\t\xdc\x00k\t\xdc\x00l\t\xdc\x00m\t\xdc\x00n\t\xdc\x00o\t\xdc\x00p\t\xdc\x00q\t\xdc\x00r\t\xdc\x00s\t\xdc\x00t\t\xdc\x00u\t\xdc\x00v\t\xdc\x00w\t\xdc\x00x\t\xdc\x00y\t\xdc\x00z\t\xdc\x00{\t\xdc\x00|\t\xdc\x00}\t\xdc\x00~\t\xdc\x00\x7f\t\xdc\x00\x80\t\xdc\x00\x81\t\xdc\x00\x82\t\xdc\x00\x83\t\xdc\x00\x84\t\xdc\x00\x85\t\xdc\x00\x86\t\xdc\x00\x87\t\xdc\x00\x88\t\xdc\x00\x89\t\xdc\x00\x8a\t\xdc\x00\x8b\t\xdc\x00\x8c\t\xdc\x00\x8d\t\xdc\x00\x8e\t\xdc\x00\x8f\t\xdc\x00\x90\t\xdc\x00\x91\t\xdc\x00\x92\t\xdc\x00\x93\t\xdc\x00\x94\t\xdc\x00\x95\t\xdc\x00\x96\t\xdc\x00\x97\t\xdc\x00\x98\t\xdc\x00\x99\t\xdc\x00\x9a\t\xdc\x00\x9b\t\xdc\x00\x9c\t\xdc\x00\x9d\t\xdc\x00\x9e\t\xdc\x00\x9f\t\xdc\x00\xa0\t\xdc\x00\xa1\t\xdc\x00\xa2\t\xdc\x00\xa3\t\xdc\x00\xa4\t\xdc\x00\xa5\t\xdc\x00\xa6\t\xdc\x00\xa7\t\xdc\x00\xa8\t\xdc\x00\xa9\t\xdc\x00\xaa\t\xdc\x00\xab\t\xdc\x00\xac\t\xdc\x00\xad\t\xdc\x00\xae\t\xdc\x00\xaf\t\xdc\x00\xb0\t\xdc\x00\xb1\t\xdc\x00\xb2\t\xdc\x00\xb3\t\xdc\x00\xb4\t\xdc\x00\xb5\t\xdc\x00\xb6\t\xdc\x00\xb7\t\xdc\x00\xb8\t\xdc\x00\xb9\t\xdc\x00\xba\t\xdc\x00\xbb\t\xdc\x00\xbc\t\xdc\x00\xbd\t\xdc\x00\xbe\t\xdc\x00\xbf\t\xdc\x00\xc0\t\xdc\x00\xc1\t\xdc\x00\xc2\t\xdc\x00\xc3\t\xdc\x00\xc4\t\xdc\x00\xc5\t\xdc\x00\xc6\t\xdc\x00\xc7\t\xdc\x00\xc8\t\xdc\x00\xc9\t\xdc\x00\xca\t\xdc\x00\xcb\t\xdc\x00\xcc\t\xdc\x00\xcd\t\xdc\x00\xce\t\xdc\x00\xcf\t\xdc\x00\xd0\t\xdc\x00\xd1\t\xdc\x00\xd2\t\xdc\x00\xd3\t\xdc\x00\xd4\t\xdc\x00\xd5\t\xdc\x00\xd6\t\xdc\x00\xd7\t\xdc\x00\xd8\t\xdc\x00\xd9\t\xdc\x00\xda\t\xdc\x00\xdb\t\xdc\x00\xdc\t\xdc\x00\xdd\t\xdc\x00\xde\t\xdc\x00\xdf\t\xdc\x00\xe0\t\xdc\x00\xe1\t\xdc\x00\xe2\t\xdc\x00\xe3\t\xdc\x00\xe4\t\xdc\x00\xe5\t\xdc\x00\xe6\t\xdc\x00\xe7\t\xdc\x00\xe8\t\xdc\x00\xe9\t\xdc\x00\xea\t\xdc\x00\xeb\t\xdc\x00'

 pymedphys/tests/trf/test_decode.py ⨯                                                                                             100% ██████████
=============================================================== warnings summary ================================================================
pymedphys/tests/trf/test_decode.py::test_conversions
pymedphys/tests/trf/test_decode.py::test_conversions
  /home/simon/.pyenv/versions/3.7.8/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
    return f(*args, **kwds)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================ short test summary info ============================================================
FAILED pymedphys/tests/trf/test_decode.py::test_conversions - ValueError: Logfile header not of an expected form.

Results (19.19s):
       1 failed
         - pymedphys/tests/trf/test_decode.py:74 test_conversions

@SimonBiggs SimonBiggs changed the title Support dutch headers Support dutch timestamp within headers Oct 14, 2020
Copy link
Collaborator

@sjswerdloff sjswerdloff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth taking a look at the detailed comments and having separate issues raised for them if there isn't time to address them right now.

@@ -33,7 +33,7 @@ def determine_header_length(trf_contents):
def decode_header(trf_header_contents):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs docstring
and it would be helpful to have comments that discuss what the various regex pieces were expected to look like, perhaps with an example.
Or a reference to any trf file format documentation...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, fair regarding the docstring. Wasn't a part of this PR though so will address this separately.

Regarding the other questions this has been reverse engineered based on the data we have. TRF file format documentation would help, but unfortunately I don't have anything official on that front.

@@ -33,7 +33,7 @@ def determine_header_length(trf_contents):
def decode_header(trf_header_contents):
match = re.match(
br"[\x00-\x19]" # start bit
br"(\d\d/\d\d/\d\d \d\d:\d\d:\d\d Z)" # date
br"(\d\d[/-]\d\d[/-]\d\d \d\d:\d\d:\d\d Z)" # date
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what other date formats might be expected?
Dutch timestamp format is presumably not really just Dutch, but might include other parts of Western Europe. Which then raises the question regarding what other region formats might be expected. Might Eastern Europe use a different date format? Different places in Asia (Japan, Korea, China)?

Is anyone concerned about rolling over the century mark (previously known as the Y2K problem)?

Further down, the bytes are being decoded as ascii.
but the regex that got parsed includes up to \xFF, while ascii is only up to \xEF or maybe \xF0.
so it is really ascii or is it Latin1 (or iso-8859, or something similar)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point about decoding with ASCII, I should possibily be using something else for sure. Would be helpful to have someone create a field with some very weird characters in it and then get the resulting TRF file.

Regarding most of the other points, I am far more comfortable only actually making changes to the decoding logic if a file shows up that isn't properly decoded. And then make the changes to appropriately decode said file with a change. Not so keen on trying to predict what possible formats of TRF are out there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not so keen on trying to predict what possible formats of TRF are out there.

Without the official documentation, that's a very fair approach.

Copy link
Member Author

@SimonBiggs SimonBiggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, thanks @sjswerdloff. Let's pull these comments out into a new issue and address them separately to this PR.

@@ -33,7 +33,7 @@ def determine_header_length(trf_contents):
def decode_header(trf_header_contents):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, fair regarding the docstring. Wasn't a part of this PR though so will address this separately.

Regarding the other questions this has been reverse engineered based on the data we have. TRF file format documentation would help, but unfortunately I don't have anything official on that front.

@@ -33,7 +33,7 @@ def determine_header_length(trf_contents):
def decode_header(trf_header_contents):
match = re.match(
br"[\x00-\x19]" # start bit
br"(\d\d/\d\d/\d\d \d\d:\d\d:\d\d Z)" # date
br"(\d\d[/-]\d\d[/-]\d\d \d\d:\d\d:\d\d Z)" # date
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point about decoding with ASCII, I should possibily be using something else for sure. Would be helpful to have someone create a field with some very weird characters in it and then get the resulting TRF file.

Regarding most of the other points, I am far more comfortable only actually making changes to the decoding logic if a file shows up that isn't properly decoded. And then make the changes to appropriately decode said file with a change. Not so keen on trying to predict what possible formats of TRF are out there.

@SimonBiggs
Copy link
Member Author

Before merging, I'm keen for @chrisootes to take the dev version for a spin and make sure it works across his files.

#1093 (comment)

@SimonBiggs SimonBiggs merged commit d15464a into master Oct 21, 2020
@SimonBiggs SimonBiggs deleted the support-dutch-headers branch November 8, 2020 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants