Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 008 field in for machine-processable dates #1151

Open
ahankinson opened this issue Oct 21, 2021 · 9 comments
Open

Add 008 field in for machine-processable dates #1151

ahankinson opened this issue Oct 21, 2021 · 9 comments

Comments

@ahankinson
Copy link
Contributor

ahankinson commented Oct 21, 2021

Edit: The decision (10.11.2021) is to implement this as an 008 field; see the discussion below.


For RISM Online, we are parsing the 260 $c statement to try and extract numeric dates for sources so that we can do proper date range searches (e.g., "Find me sources between year XXXX and year YYYY"). In order to do ranges, it is a requirement that we use numeric data so that arithmetic can be done in the search system.

The 260 $c field is uncontrolled, which makes it difficult to extract dates when attempting to parse this to numbers. We are using some advanced heuristics, but problems remain, and we are reaching the point where correcting some dates causes other corrections to start failing.

Some examples of various systemic problems include:

  • data entry typos, e.g., 171784-1799, 1846-01-05{8'
  • use of natural language(s), Ende 17. Jh., 23 gennaio 1973
  • non-arabic numerals, Año IX
  • multiple ways of writing 'no date': [s/d/] [s..d], (n.d.),
  • ambiguous number formats: 05-1836-1853, 1845-1874-03-30
  • possible direct transcriptions from the source, "BONN 14.XI 79", " 23.X.76| Reiner Bredemeyer"

... and many, many others. This is leading to many problems in RISM Online, where we have extreme dates (e.g., 1871 BC, or 171784 CE).

It would be really useful if we had a field that could handle numeric-only dates, and validate these values. I recognize that this is likely not the goal for 260 $c, since cataloguers will need the flexibility to capture a date statement as written.

I would like to request that we re-introduce the 033 field (Date/Time and Place of an Event), but with the requirement that it contains a standard formatted date. The MARC specification includes a format for how incomplete dates should be encoded as well. This would be restricted to allowing only yyyymmdd values.

Validation on this field would restrict the values to only allow digits and the - character. While this field would be optional, cataloguers would be encouraged to fill this field in with a value if any datable evidence for the source allows.

The $a on the 031 field is repeatable, allowing for a single date or a range of dates. We should restrict this to no more than two -- a "start" (or a single date) and an optional "end" for dates that may be a range.

In the case of a single date, the first field indicator should be a 0. In the case that the optional end date is provided, the first field indicator should be a 2.

A few examples:

Single date, no month or year
031  0#$a1879----

November, 1973
031 0#$a197311--

November 11, 1973
031 0#$a19731111

1685 to 1750
031 2#$a1685----$a1750----

March, 1685 to 1750
031 2#$a168503--$a1750----

Sometime in the 1860s
031 2#$a1860----$a1869----

Sometime in the 1700s
031 2#$a1700----$a1799----

A span of 300 years
031 2#$a1200----$a1500----

None of the hyphens should be omitted in the MARC source, but they could optionally be omitted in the data entry field.

@jenniferward
Copy link
Contributor

I agree that an encoded date field is necessary, but 033 isn't exactly correct for this. The 033 is not for the date of creation of an item but rather it encodes what is stated in the 518 field, which is for performance notes - in libraries, generally when a CD was recorded, in our context generally when a score was used in a performance (somewhat rare). See here for an explanation of standard MARC practice from Yale: https://web.library.yale.edu/cataloging/music/033field

The sort of encoded date that we need is recorded in the 008 and this information is generally derived from the 260 (264), so a connection between the two fields is an established one in MARC.
https://www.loc.gov/marc/bibliographic/bd008a.html
We would need position 06 to tell us what kind of date(s) it is, then position 07-10 for the first year and 11-14 for the second year.

There are several options for position 06 but I think we can limit it to:
s = single date
q = questionable date (for estimates)
i = known range of dates for a collection

Position 06 includes the possibility of detailed dates that include months and days (e, "detailed date"), but I don't think this level of granularity is needed for retrieval - as Andrew says, we just want "between year XXXX and year YYYY". So I suggest skipping months and days.

Unknowns are recorded with u.

So adapting and expanding on the examples above, in the 008 it would look like this (with some links to the Princeton catalog, for example of usage in libraries):
Single year, 1739
008/06: s
008/07-10: 1739
https://catalog.princeton.edu/catalog/9935454763506421/staff_view

Circa 1745
008/06: s
008/07-10: 1745
https://catalog.princeton.edu/catalog/9935507213506421/staff_view

Possibly 1740?
008/06: s
008/07-10: 1740
https://catalog.princeton.edu/catalog/9933418353506421/staff_view

Before 1748
008/06: q
008/07-10: 1uuu
008/11-14: 1748

After 1748
008/06: q
008/07-10: 1748
008/11-14: 1uuu

Note: Princeton encodes just a single year in before/after statements ; not sure if we want that:
"Not before 1771": s1771 https://catalog.princeton.edu/catalog/9971601333506421/staff_view
"after 1774": s1774 https://catalog.princeton.edu/catalog/99107566203506421/staff_view

Middle of the 18th century using RISM standard 1740-1760
008/06: q
008/07-10: 1740
008/11-14: 1760

November 11, 1973
008/06: s
008/07-10: 1973

1750 to 1799 (estimated by the cataloger)
008/06: q
008/07-10: 1750
008/11-14: 1799
https://catalog.princeton.edu/catalog/9935470083506421/staff_view

1738 to 1743 (based on evidence in the source)
008/06: i
008/07-10: 1738
008/11-14: 1743
https://catalog.princeton.edu/catalog/9935399813506421/staff_view

Sometime in the 1740s
008/06: s
008/07-10: 174u

Between 1740 and 1749
008/06: q
008/07-10: 1740
008/11-14: 1749
https://catalog.princeton.edu/catalog/9980042393506421/staff_view

Sometime in the 1800s
008/06: s
008/07-10: 18uu
https://catalog.princeton.edu/catalog/99104392823506421/staff_view

No date
008/06: q
008/07-10: 1600
008/11-14: 1900 (see Comments)

A span of 300 years
008/06: q
008/07-10: 12uu
008/11-14: 15uu (see Comments)

Comments:
The span of 300 years does not come out as clearly in the last example. I'd be open to having a local RISM practice that uses 1200 / 1500 instead of 12uu / 15uu.

In my opinion this field should be required, otherwise people won't fill it out. We really need to encourage people to date their sources more often. It is comfortable to say "s.d." but surely we can figure out at least reasonable centuries based on archival context, institutional history, composer life dates, etc. Perhaps we can come up with standardized estimates that people can apply to make it feel less rigid, for example 1600-1900 or so.

@ahankinson
Copy link
Contributor Author

ahankinson commented Nov 1, 2021

008 looks OK. It also allows BCE dates, which the 033 does not. 008 has the disadvantage that it also encodes a lot of other material (Place of publication, Language, etc.) while 033 is specifically date and time.

I'm not completely convinced by the Yale application note; the MARC21 description just says it's a date/time of an event, with no additional semantics. The examples in the documentation seem to indicate that it can be for any material. There are also at least two examples where a 033 does not have a corresponding 518. So I don't think those should be hard-and-fast reasons to not use it.

For 008, I'm not particularly crazy about using 9999. I can't see a case for encoding this sort of date -- we have no serials or things that have not ceased publication. So I don't think that should be an option. The latest possible date should be the current year.

I'm also not particularly keen showing uu to the users for unknown parts of a date. A dash is a neutral space indicator; u implies some semantics. (e.g., "unknown"), which may give some cataloguers pause. ("Is it really unknown? How do I know if it's unknown?")

I envision a field with a fixed number of spaces indicated by dashes. Typing in the field would fill the spaces from left to right; no more, no less. The field would not allow any more than four digits; backspacing would clear a digit and replace it with a dash.

We can, of course, transform the dashes to u behind the scenes if we wanted to stick with the MARC spec when we store it. (Tangentially, I wish the MARC people would sort out their date-time format specs... it seems like there's a different standard for each field!).

We also need to be clear that 12-- does not mean 1300-1399.

There is a difference between 12-- and 1200 -- the latter is a certain date, the former makes no claims on the fixed year. Numerically we would transform it to 1200, but we could also analyze it for extra data like "is an uncertain date" if it contains the uncertainty characters.

We could make it a guideline policy that the field was required, but we would still need it to be optional when saving a record, because people editing a record would need to be able to save it without filling in a date for an item they don't have to hand.

@ahankinson ahankinson changed the title Add 033 field in for machine-processable dates Add 008 field in for machine-processable dates Nov 10, 2021
@ahankinson
Copy link
Contributor Author

At the 10 November meeting we agreed to use the 008 field for storing machine-readable dates, so I think we can move forward with this. I'll change the title and description above to avoid confusion.

One thing left to decide is how to enter the value for position 06 of the 008 field.

@jenniferward
Copy link
Contributor

I was thinking the cataloger would enter the 06 (see above for my reduced list) from a dropdown list but now I wonder if we could simplify this in the Muscat application of 008 and agree to just 2 choices, perhaps:
s = single date
i = range of dates

Would it be possible for an 06 to be added automatically by Muscat, based on whether a single year or a range of years is entered?

@ahankinson
Copy link
Contributor Author

I recently came across the 046 field, which seems like it might be a better fit for this.

The advantages of 046 over 008 are:

  • 008 only accepts 4-digit years, while 046 can accept full dates
  • 008 mixes a number of other material descriptions, while 046 is dedicated to the dates only
  • 046 is specifically designed to represent dates that 008 cannot handle, including BCE dates
  • 046 can accept EDTF, which is a formalized method of encoding date uncertainty; if we don't want to accept EDTF, we can also specify the date standard (ISO8601, etc.) in $2. This means we can validate the input.

@HirschSt
Copy link
Contributor

@ahankinson
Copy link
Contributor Author

Yes; the idea would be to actually add the dates of the record in there, rather than the 'created' date.

@jenniferward
Copy link
Contributor

Sure! Thanks for spotting it. I don't quite see the range of current applications in library catalogs as 008 (I checked Princeton above, Northwestern, and the DNB and found only a few, and can't generalize), but the 046 is clear enough, and our application of it would be clear enough, that I don't see any misunderstandings.

@ahankinson
Copy link
Contributor Author

Keeping well-structured data in 046 could mean that we could automatically add it (or an extract of it) to the 008, but I don't think we could go the other way around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants