New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dutch timestamp within headers #1095
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth taking a look at the detailed comments and having separate issues raised for them if there isn't time to address them right now.
@@ -33,7 +33,7 @@ def determine_header_length(trf_contents): | |||
def decode_header(trf_header_contents): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs docstring
and it would be helpful to have comments that discuss what the various regex pieces were expected to look like, perhaps with an example.
Or a reference to any trf file format documentation...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, fair regarding the docstring. Wasn't a part of this PR though so will address this separately.
Regarding the other questions this has been reverse engineered based on the data we have. TRF file format documentation would help, but unfortunately I don't have anything official on that front.
@@ -33,7 +33,7 @@ def determine_header_length(trf_contents): | |||
def decode_header(trf_header_contents): | |||
match = re.match( | |||
br"[\x00-\x19]" # start bit | |||
br"(\d\d/\d\d/\d\d \d\d:\d\d:\d\d Z)" # date | |||
br"(\d\d[/-]\d\d[/-]\d\d \d\d:\d\d:\d\d Z)" # date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what other date formats might be expected?
Dutch timestamp format is presumably not really just Dutch, but might include other parts of Western Europe. Which then raises the question regarding what other region formats might be expected. Might Eastern Europe use a different date format? Different places in Asia (Japan, Korea, China)?
Is anyone concerned about rolling over the century mark (previously known as the Y2K problem)?
Further down, the bytes are being decoded as ascii.
but the regex that got parsed includes up to \xFF, while ascii is only up to \xEF or maybe \xF0.
so it is really ascii or is it Latin1 (or iso-8859, or something similar)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point about decoding with ASCII, I should possibily be using something else for sure. Would be helpful to have someone create a field with some very weird characters in it and then get the resulting TRF file.
Regarding most of the other points, I am far more comfortable only actually making changes to the decoding logic if a file shows up that isn't properly decoded. And then make the changes to appropriately decode said file with a change. Not so keen on trying to predict what possible formats of TRF are out there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not so keen on trying to predict what possible formats of TRF are out there.
Without the official documentation, that's a very fair approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, thanks @sjswerdloff. Let's pull these comments out into a new issue and address them separately to this PR.
@@ -33,7 +33,7 @@ def determine_header_length(trf_contents): | |||
def decode_header(trf_header_contents): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, fair regarding the docstring. Wasn't a part of this PR though so will address this separately.
Regarding the other questions this has been reverse engineered based on the data we have. TRF file format documentation would help, but unfortunately I don't have anything official on that front.
@@ -33,7 +33,7 @@ def determine_header_length(trf_contents): | |||
def decode_header(trf_header_contents): | |||
match = re.match( | |||
br"[\x00-\x19]" # start bit | |||
br"(\d\d/\d\d/\d\d \d\d:\d\d:\d\d Z)" # date | |||
br"(\d\d[/-]\d\d[/-]\d\d \d\d:\d\d:\d\d Z)" # date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point about decoding with ASCII, I should possibily be using something else for sure. Would be helpful to have someone create a field with some very weird characters in it and then get the resulting TRF file.
Regarding most of the other points, I am far more comfortable only actually making changes to the decoding logic if a file shows up that isn't properly decoded. And then make the changes to appropriately decode said file with a change. Not so keen on trying to predict what possible formats of TRF are out there.
Before merging, I'm keen for @chrisootes to take the dev version for a spin and make sure it works across his files. |
See #1093 for the base PR.
The new testing file can be seen over at https://zenodo.org/record/4087961. The new files added are highlighted below:
I ran
pymedphys trf to-csv
on that file without this fix and it failed to decode with an error message of "unexpected header format". After this, it worked as intended. No other baseline or reference decoding results within the testing suite were altered with this change.See the error message from pytest when this new TRF file is included but this change is not made: