Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggest test for date format consistency and suggested revision #143

Closed
debpaul opened this issue Jul 2, 2018 · 9 comments
Closed

suggest test for date format consistency and suggested revision #143

debpaul opened this issue Jul 2, 2018 · 9 comments
Labels

Comments

@debpaul
Copy link

debpaul commented Jul 2, 2018

Hi all, from a DwC Hour observation that dates are sometimes MM-DD-YYYY and DD-MM-YYYY and so cannot be distinguished, we suggest a two part test. See tdwg/dwc#100 for complete issue description.

  1. test to see if the date format is consistent or not (all MM-DD-YYYY or DD-MM-YYYY, but not mixed) in a given dataset.
  2. feedback to provider to note any clearly mixed datasets where both formats exist
  3. and suggestion change to standard compliant, unambiguous YYYY-MM-DD
  4. this also helps the researcher looking for date issues before using a dataset.

Note @tucotuco has added this test idea to the Kurator repo.

@mobb
Copy link

mobb commented Jul 2, 2018

Hi all-
I am new to this group, but not to the concepts of checking data. I work with the LTER and EDI ( a repository that grew out of LTER). We have checks for EML-based datasets, including one for datetimes. Here is the a link directly to that material (related to Deb's comment); the rest should be browseable from here, if you are interested. https://github.com/EDIorg/ECC/blob/master/practices/dateTimeFormatString/best_practice.md

@ArthurChapman
Copy link
Collaborator

ArthurChapman commented Jul 3, 2018

Thanks Deb for raising this again. Ambiguous dates are a real problem. There are a number of ways of helping to decide which of the ambiguities may be the correct one. In addition to your list above, you can make some reasonable guesses - if it is an Australian dataset you are pretty sure it is going to be DD-MM-YYYY, but if North American (USA or Canada) it is a lot more difficult. See some of the discussion under Issue #86 where it is ambiguous we have suggested a range. There are also a number of other DATE/TIME tests - if you click on the TIME label - they will come up as a group.

@ArthurChapman
Copy link
Collaborator

Thanks Mobb for the link. Our problem is not with the ISO format per se - it is interpretting ambiguous dates in datasets where in may be 3rd July or 6th March - e.g. where it is written in a Verbatim field 03/06/2018

@mobb
Copy link

mobb commented Jul 3, 2018

@ArthurChapman , that was the issue for us as well: that a data value alone cannot be interpreted, and at a minimum, interpretation requires some metadata. I thought you might be interested in seeing what other EML-based systems were doing in the area of data checking.

There are some significant differences between our system and TDWGs:
Our data packages all use entity-level EML, where each data column's metadata contains more than simply a column name (eg, several forms of typing). For datetimes, EML metadata requires a string that ostensibly matches the data value, and our datetime checks are designed to confirm that.

However, in a DC archive (the format I imagine TDWG is interested in interpreting), the vocabularies are external, so your solution will be different, and probably involves interpreting that external metadata (analogous to the way you might confirm a species binomial from the taxonomic_id + system_name in related fields)

@cgendreau
Copy link
Contributor

A mix of MM-DD-YYYY and DD-MM-YYYY in verbatimEventDate within the same dataset is possible and not necessarily wrong. I would make it explicit here that the target is eventDate.

@debpaul
Copy link
Author

debpaul commented Jul 3, 2018 via email

@Tasilee
Copy link
Collaborator

Tasilee commented Aug 14, 2018

Do we agree that this issue is covered (as best as we can for now) by #61, #66, #67, #84, #86 and #141?

@ArthurChapman
Copy link
Collaborator

The original discussion was really methodology on how one could determine some of the difficult ones - for example ambiguous dates by looking in different places to resolve possible ambiguities etc. I think our #86 handles this for now and then if someone needs to resolve it further they could try different options for resolving the problem. I don't think we can hard wire those methodologies in at present - may be with more coding in the future (I see that @tuco has added it to the Kurator repo) - or through the Profiles. I think it can be closed for now.

@tucotuco
Copy link
Member

tucotuco commented Aug 14, 2018 via email

@Tasilee Tasilee closed this as completed Aug 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants