Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create relaton-data-nist #53

Closed
ronaldtse opened this issue Jul 5, 2021 · 30 comments
Closed

Create relaton-data-nist #53

ronaldtse opened this issue Jul 5, 2021 · 30 comments
Assignees
Labels
enhancement New feature or request

Comments

@ronaldtse
Copy link
Contributor

ronaldtse commented Jul 5, 2021

There are two kinds of NIST bibdata:

We should synchronise this information daily into relaton-data-nist for easy citation.

For relaton-nist, if a document is found in the former, use it. Otherwise, search in the latter set.

@ronaldtse ronaldtse added the enhancement New feature or request label Jul 5, 2021
@ronaldtse ronaldtse added this to TRIAGE in Andrei Kislichenko via automation Jul 5, 2021
@ronaldtse
Copy link
Contributor Author

Related to usnistgov/NIST-Tech-Pubs#1

@ronaldtse ronaldtse moved this from TRIAGE to High priority in Andrei Kislichenko Jul 7, 2021
@andrew2net andrew2net moved this from High priority to In progress in Andrei Kislichenko Jul 18, 2021
@andrew2net
Copy link
Contributor

andrew2net commented Jul 18, 2021

@ronaldtse what are the references for those documents should be? For example, the first document has citation-id 78696207 and report-number NBS BH 1. Should we cite it by the "NIST 78696207" or the "NIST NBS BH 1" reference?

<body>
   <query key="BH">
      <doi type="report-paper_title">10.6028/NBS.BH.1</doi>
      <crm-item name="publisher-name" type="string">National Institute of Standards and Technology (NIST)</crm-item>
      <crm-item name="prefix-name" type="string">National Institute of Standards and Technology</crm-item>
      <crm-item name="member-id" type="number">4068</crm-item>
      <crm-item name="citation-id" type="number">78696207</crm-item>
      <crm-item name="book-id" type="number">2050209</crm-item>
      <crm-item name="deposit-timestamp" type="number">201511031134</crm-item>
      <crm-item name="owner-prefix" type="string">10.6028</crm-item>
      <crm-item name="last-update" type="date">2018-03-06T09:55:24Z</crm-item>
      <crm-item name="created" type="date">2015-11-04T17:31:05Z</crm-item>
      <crm-item name="citedby-count" type="number">0</crm-item>
      <doi_record>
         <report-paper>
            <report-paper_metadata language="en">
               <contributors>
                  <person_name sequence="first" contributor_role="author">
                     <given_name>Ira H</given_name>
                     <surname>Woolson</surname>
                  </person_name>
                  <person_name sequence="additional" contributor_role="author">
                     <given_name>Edwin H</given_name>
                     <surname>Brown</surname>
                  </person_name>
                  <person_name sequence="additional" contributor_role="author">
                     <given_name>John A</given_name>
                     <surname>Newlin</surname>
                  </person_name>
                  <person_name sequence="additional" contributor_role="author">
                     <given_name>William K</given_name>
                     <surname>Hatt</surname>
                  </person_name>
                  <person_name sequence="additional" contributor_role="author">
                     <given_name>Ernest J</given_name>
                     <surname>Russell</surname>
                  </person_name>
                  <person_name sequence="additional" contributor_role="author">
                     <given_name>Rudolph P</given_name>
                     <surname>Miller</surname>
                  </person_name>
                  <person_name sequence="additional" contributor_role="author">
                     <given_name>Joseph R</given_name>
                     <surname>Worcester</surname>
                  </person_name>
                  <person_name sequence="additional" contributor_role="author">
                     <given_name>Frank P</given_name>
                     <surname>Cartwright</surname>
                  </person_name>
               </contributors>
               <titles>
                  <title>Recommended minimum requirements for small dwelling construction :</title>
                  <subtitle>report of Building Code Committee July 20, 1922</subtitle>
               </titles>
               <edition_number>0</edition_number>
               <publication_date media_type="online">
                  <year>1923</year>
               </publication_date>
               <publisher>
                  <publisher_name>National Bureau of Standards</publisher_name>
                  <publisher_place>Gaithersburg, MD</publisher_place>
               </publisher>
               <institution>
                  <institution_name>National Bureau of Standards</institution_name>
                  <institution_acronym>NBS</institution_acronym>
                  <institution_place>Gaithersburg, MD</institution_place>
               </institution>
               <publisher_item>
                  <item_number item_number_type="report-number">NBS BH 1</item_number>
               </publisher_item>
               <doi_data>
                  <doi>10.6028/NBS.BH.1</doi>
                  <resource>https://nvlpubs.nist.gov/nistpubs/Legacy/BH/nbsbuildinghousing1.pdf</resource>
               </doi_data>
            </report-paper_metadata>
         </report-paper>
      </doi_record>
   </query>
...

@ronaldtse
Copy link
Contributor Author

@andrew2net the proper citation document identifier is "NBS BH 1" in this case.

NBS is the predecessor of NIST, so:

  • "NIST NBS BH 1" is incorrect
  • "NBS BH 1" is correct.

We can actually take hint from this:

      <doi type="report-paper_title">10.6028/NBS.BH.1</doi>

The IDs that look like integer are clearly machine generated and possibly not for human citational use.

@andrew2net
Copy link
Contributor

@ronaldtse NBS IR 87-363 contains "error:" Maybe NIST shoud know about it?

               <publisher>
                  <publisher_name>error:</publisher_name>
                  <publisher_place>Gaithersburg, MD</publisher_place>
               </publisher>
               <institution>
                  <institution_name>error:</institution_name>
                  <institution_acronym>error:</institution_acronym>
                  <institution_place>Gaithersburg, MD</institution_place>
               </institution>

@ronaldtse
Copy link
Contributor Author

Yes! @andrew2net can you file a new issue here?

@andrew2net
Copy link
Contributor

andrew2net commented Jul 20, 2021

@ronaldtse the source contains relations with doi type identifiers. Can we use doi id as a formattedref?

<related_item>
  <intra_work_relation relationship-type="replaces" identifier-type="doi">10.6028/NIST.SP.1108r3</intra_work_relation>
</related_item>
<related_item>
  <intra_work_relation relationship-type="isVersionOf" identifier-type="doi">10.6028/NIST.SP.1108</intra_work_relation>
</related_item>

@ronaldtse
Copy link
Contributor Author

  1. We can use the doi ID as input to formattedref.
  2. doi is not the formattedref.

Metanorma already implements the new NIST PubID scheme, which has defined transforms from machine-readable IDs to:

  • human readable IDs (the formattedref)
  • DOI IDs

And we need to parse these old DOIs back to PubID.

So we need to extract that code out from metanorma-nist:
metanorma/pubid-nist#1

Then we can re-use that in relaton-nist.

@andrew2net
Copy link
Contributor

@ronaldtse there are documents like NBS.BMS.140e2. It looks like it's a second edition but the document contains

<edition_number>0</edition_number>

should we ignore the edition_number tag if there is an edition in ID?

@ronaldtse
Copy link
Contributor Author

@andrew2net usnistgov/NIST-Tech-Pubs#1 has been fixed, can you help update the location of the XML file? Thanks.

@ronaldtse
Copy link
Contributor Author

Issue #53 (comment) is posted in #55.

Can we close this ticket?

@andrew2net
Copy link
Contributor

@ronaldtse no, the relaton-data-nist isn't ready. It needs to convert DOI IDs to PubIDs to be able to reference the documents. But the DOI IDs in the source aren't the same as MR IDs. I have many questions about how to map parts of DOI IDs to PubIDs. I'll ask you later. Have a lot of other tasks to finish.

@andrew2net
Copy link
Contributor

Also, we need to move documents from the https://csrc.nist.gov/CSRC/media/feeds/metanorma/pubs-export.zip file to this repo to solve a problem similar to relaton/relaton-calconnect#11

@ronaldtse
Copy link
Contributor Author

@andrew2net sure, let's merge the bibdata from CSRC into this collection.

@andrew2net
Copy link
Contributor

@ronaldtse the source has some DOI identifiers what need clarification how should they be mapped to PubID:

  1. NBS.CIRC.15-April1909 - is this docnumber 15 and update-date April 1909?
  2. NBS.CIRC.25insert - what does the insert mean in this reference? How shoud it be mapped to PubID?
  3. NBS.CIRC.25sup-1924, NBS.CIRC.398sup1937, NBS.CIRC.154suprev, NBS.HB.28supp1949 - Whai is the sup? Is the supp same as sup?
  4. NBS.CIRC.488sec1 - How should the sec be mapped to PubID?
  5. NBS.CIRC.54index, NBS.NSRDS.63indx - index and indx?
  6. NBS.CIRC.74errata - errata?
  7. NBS.CRPL.1-2_3-1, NBS.CRPL.1-2_3-1A, NBS.CRPL.4-m-5, NBS.CRPL.c4-4 - Are the 1-2_3-1, 1-2_3-1A, 4-m-5, c4-4 docnumbers or doncumbers with parts?
  8. NBS.FIPS.100-1-1991 - is this part 1 and update-date 1991?
  9. NIST.IR.6867es - es?
  10. NIST.IR.7297c - c?
  11. NIST.IR.8115chi - chi?
  12. NIST.IR.8115viet - viet?
  13. NIST.IR.8178port - port?
  14. NIST.NCSTAR.1-1av1, NCSTAR.1-1cv1, NIST.NCSTAR.1-2bv1 - av, cv, bv?
  15. NIST.SP.1011-I-2.0 - is 1011-I-2.0 a docnumber?
  16. NIST.SP.1075-NCNR - NCNR?
  17. NIST.SP.800-131Ar1 - Ar?
  18. NIST.SP.800-28ver2 - Is ver a version? How should it be mapped to PubID?
  19. NIST.SP.800-38a-add - add?
  20. NIST.SP.800-57pt1r4 - pt?
  21. NIST.SP.801-errata - errata?
  22. NIST.SP.955.Suppl - Suppl?
  23. NIST.AMS.300-8r1/upd, NIST.IR.8115r1-upd - upd?

@ronaldtse
Copy link
Contributor Author

ronaldtse commented Aug 17, 2021

  1. NBS.CIRC.15-April1909 - is this docnumber 15 and update-date April 1909?

https://nvlpubs.nist.gov/nistpubs/Legacy/circ/nbscircular15-April1909.pdf

Screenshot 2021-08-17 at 8 49 37 AM

This is NBS CIRC ("Circular") No. 15. Yes docnumber=15, series CIRC/Circular, date=1909-04.

  • The original long citation form is "Circ. Nat. Bur. Std, No. 15"
  • In PubID:
    • Full form "National Bureau of Standards Bureau Circular No. 15."
    • Abbreviated form "Nat. Bur. Std. Circ. No. 15."
    • Short form "NBS CIRC 15"
    • MR form "NBS.CIRC.15"
  1. NBS.CIRC.25insert - what does the insert mean in this reference? How shoud it be mapped to PubID?

I think insert means that it's an "included document" inside another document.

In this case, it means this is an "insert" of NBS CIRC 25. The "ins" part can be considered as in the same category like "supplement". Just as we can have "Supplement 1", we can have "Insert 1".

https://www.govinfo.gov/app/details/GOVPUB-C13-45974defbd2f3d7ab324bcd3506831b7

Screenshot 2021-08-17 at 8 51 29 AM

  • In PubID:
    • Full form "National Bureau of Standards Bureau Circular No. 25. Insert"
    • Abbreviated form "Nat. Bur. Std. Circ. No. 25. Ins."
    • Short form "NBS CIRC 25 Ins"
    • MR form "NBS.CIRC.15.i1"
  1. NBS.CIRC.25sup-1924, NBS.CIRC.398sup1937, NBS.CIRC.154suprev, NBS.HB.28supp1949 - Whai is the sup? Is the supp same as sup?

"sup" and "supp" probably mean Supplement. Supplement is a supported type.

  1. NBS.CIRC.488sec1 - How should the sec be mapped to PubID?

"sec" is Section. Treat it as similar to "Part", where we can have "Part 1" (pt1), we can have "Section 1" (sec1).

  • In PubID:
    • Full form "National Bureau of Standards Bureau Circular No. 488 Section 1"
    • Abbreviated form "Nat. Bur. Std. Circ. No. 488. Section 1"
    • Short form "NBS CIRC 488 Section 1"
    • MR form "NBS.CIRC.488.sec1"
  1. NBS.CIRC.54index, NBS.NSRDS.63indx - index and indx?

Both mean "index". Treat it as like Supplement and Insert.

  • In PubID:
    • Full form "National Bureau of Standards Bureau Circular No. 54 Index"
    • Abbreviated form "Nat. Bur. Std. Circ. No. 54. Index"
    • Short form "NBS CIRC 54 Index"
    • MR form "NBS.CIRC.54.index"
  1. NBS.CIRC.74errata - errata?

Errata. Treat it as like Supplement and Insert.

  • In PubID:
    • Full form "National Bureau of Standards Bureau Circular No. 74 Errata"
    • Abbreviated form "Nat. Bur. Std. Circ. No. 74. Errata"
    • Short form "NBS CIRC 74 Errata"
    • MR form "NBS.CIRC.74.errata"
  1. NBS.CRPL.1-2_3-1, NBS.CRPL.1-2_3-1A, NBS.CRPL.4-m-5, NBS.CRPL.c4-4 - Are the 1-2_3-1, 1-2_3-1A, 4-m-5, c4-4 docnumbers or doncumbers with parts?
  • CRPL means "CENTRAL RADIO PROPAGATION LABORATORY". So we treat this as a series.
  • 1-2_3-1 means "1-2, 3-1"
  • 1-2_3-1A means "Supplement to report CRPL-1-2, 3-1"
  • 4-m-5 was "CRPL-4-M-5"

Let's treat them as docnumbers, yes. But did you notice these entries have assigned numbers? Then we don't need to parse the DOIs for them. See this: https://pages.nist.gov/NIST-Tech-Pubs/CRPL.html .
Screenshot 2021-08-17 at 9 19 15 AM

https://nvlpubs.nist.gov/nistpubs/Legacy/crpl/crpl-1-2_3-1.pdf
Screenshot 2021-08-17 at 9 13 37 AM

  1. NBS.FIPS.100-1-1991 - is this part 1 and update-date 1991?

Yes.

  1. NIST.IR.6867es - es?

es means Spanish. This is the language, which PubID supports.

https://nvlpubs.nist.gov/nistpubs/Legacy/IR/nistir6867es.pdf
Screenshot 2021-08-17 at 9 20 02 AM

  1. NIST.IR.7297c - c?

Part C.

https://nvlpubs.nist.gov/nistpubs/Legacy/IR/nistir7297c.pdf
Screenshot 2021-08-17 at 9 20 40 AM

  1. NIST.IR.8115chi - chi?

Language: Chinese.

  1. NIST.IR.8115viet - viet?

Language: Vietnamese.

  1. NIST.IR.8178port - port?

Language: Portuguese.

  1. NIST.NCSTAR.1-1av1, NCSTAR.1-1cv1, NIST.NCSTAR.1-2bv1 - av, cv, bv?
  • Part 1A, version 1.
  • Part 1B, version 1.
  • Part 1C, version 1.

https://nvlpubs.nist.gov/nistpubs/Legacy/NCSTAR/ncstar1-1av1.pdf
Screenshot 2021-08-17 at 9 22 08 AM

  1. NIST.SP.1011-I-2.0 - is 1011-I-2.0 a docnumber?

Docnumber is 1011. Volume is 1. Version is 2.0.

https://www.nist.gov/system/files/documents/el/isd/ks/NISTSP_1011-I-2-0.pdf
Screenshot 2021-08-17 at 9 22 54 AM

  1. NIST.SP.1075-NCNR - NCNR?

NCNR is the "NIST Center for Neutron Research".

This is very funny -- this is a case of a "duplicated" SP 1075!!

https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication1075-NCNR.pdf
Screenshot 2021-08-17 at 9 28 06 AM

https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication1075-PML.pdf

Screenshot 2021-08-17 at 9 28 34 AM

So we need to find a way to resolve this... argh.

In this case, "1075-NCNR" is the docnumber.

Will report this to NIST.

  1. NIST.SP.800-131Ar1 - Ar?

This means Part A, Revision 1.

https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-131Ar1.pdf
Screenshot 2021-08-17 at 9 30 16 AM

  1. NIST.SP.800-28ver2 - Is ver a version? How should it be mapped to PubID?

"Version" is a supported element just like "Revision".

  1. NIST.SP.800-38a-add - add?

Addendum to SP 800-38 Part A.

  1. NIST.SP.800-57pt1r4 - pt?

Part 1.

  1. NIST.SP.801-errata - errata?

As above.

  1. NIST.SP.955.Suppl - Suppl?

Supplement.

  1. NIST.AMS.300-8r1/upd, NIST.IR.8115r1-upd - upd?

https://nvlpubs.nist.gov/nistpubs/ams/NIST.AMS.300-8r1.pdf
Screenshot 2021-08-17 at 9 31 45 AM

https://nvlpubs.nist.gov/nistpubs/ams/NIST.AMS.300-8r1-upd.pdf
Screenshot 2021-08-17 at 9 32 17 AM

"INCLUDES UPDATES AS OF 02-08-2021".

This is an "errata update". From https://github.com/metanorma/nist-pubid/blob/master/README.adoc#4-machine-readable-form , this applies:

If a superseding edition is just an errata update, we can use the update date from the title page (“includes updates as of…”) to uniquely identify this edition. Preferably use -yyyymmdd format.

@ronaldtse
Copy link
Contributor Author

ronaldtse commented Aug 17, 2021

@andrew2net I've updated nist-pubid's README to reflect these element changes, please check.

UPDATE: I actually went through the full set of documents for all series (see metanorma/pubid-nist#4), so the PubID scheme should work.

@andrew2net
Copy link
Contributor

Let's treat them as docnumbers, yes. But did you notice these entries have assigned numbers? Then we don't need to parse the DOIs for them.

@ronaldtse I've tried to use the assigned numbers but some of them are duplicated. For example: NBS CIRC 46e2, NIST HB 105-1-1990, NBS HB 67suppJune1965 ...

@ronaldtse
Copy link
Contributor Author

@andrew2net do you mean that NBS CIRC 46e2 has an identical assigned number with NBS CIRC 46?

@andrew2net
Copy link
Contributor

andrew2net commented Aug 17, 2021

@ronaldtse I found NBS.CIRC.36e2 and NBS.CIRC.46e2 with NBS CIRC 46e2 item number, which looks like a mistake.

UPDATE:
Here are all duplicates:

["NBS CIRC 46e2",
 "NIST HB 105-1-1990",
 "NBS HB 67suppJune1965",
 "NIST IR 89-4220",
 "NBS TN 789-1",
 "NIST HB 150-10",
 "NIST IR 8115",
 "NIST IR 8117",
 "NIST IR 8119",
 "NIST IR 8178",
 "NIST TN 1648"]

@ronaldtse
Copy link
Contributor Author

@andrew2net in this case can you create an issue at nist-pubid about that mistake? Thanks.

@andrew2net
Copy link
Contributor

andrew2net commented Aug 18, 2021

@ronaldtse These references NBS.CIRC.sup, NBS.CIRC.supJun1925-Jun1926, NBS.CIRC.supJun1925-Jun1927 don't have docnumber. Is it possible to have PubID without docnumber?
Another question is: how to handle 2 dates in the last couple of references?

UPDATE
There are also references like NBS.RPT.Apr-Jun1948.

@ronaldtse
Copy link
Contributor Author

@andrew2net I've moved your last comment to a new issue. Let's not stack up the requests in this issue 😉

@andrew2net
Copy link
Contributor

@ronaldtse there are DOIs with language and the documents with the DOIs has translated titles. It seems PubID doesn't support languages. Instead we have language attribute within titles in our data model. So we need to collect all the title translations into one document, do we?
Chinees documents don't have translated titles. However the Chinees documents (and other non English documents) have link to translated PDF files. But we don't have a laguage attribute for TypedUri in the data model. Do we need to collect all these links? May be we need to add a laguage attribute to the TypedUri element. What do you think?

@ronaldtse
Copy link
Contributor Author

@andrew2net we do not need to parse the set perfectly right now.

Let’s make sure we have most done and then file additional issues. Relationships between translated documents are not important right now.

We are in a hurry to have the first cut.

@andrew2net
Copy link
Contributor

  • documents from the NIST CSRC (NIST SP 800, etc), should still come from the NIST Metanorma endpoint (which is much richer in information and updated daily)

@ronaldtse now we have 3 sources for NIST documents:

  1. https://csrc.nist.gov/CSRC/media/feeds/metanorma/pubs-export.zip
  2. https://csrc.nist.gov/search
  3. https://raw.githubusercontent.com/usnistgov/NIST-Tech-Pubs/nist-pages/xml/allrecords.xml

Is there a way to detect which source should be used for certain reference?

@ronaldtse
Copy link
Contributor Author

We will only use 1 and 3 from now on. They will already represent the full information of all NIST publications. For a reference we will prioritize the information of 1 over 3.

@andrew2net
Copy link
Contributor

andrew2net commented Aug 29, 2021

@ronaldtse it seems the 1 and 3 don't represent full information. For example SP 800-55 Rev. 2 (Draft) is only in https://csrc.nist.gov/search.

@ronaldtse
Copy link
Contributor Author

@andrew2net interesting! In this case we should consider this a bug in 1. The results from 1 and 2 are supposed to be identical. I will report and revert.

@ronaldtse
Copy link
Contributor Author

In any case, we will migrate to a full-data approach with NIST instead of using dynamic scraping. Please help proceed.

andrew2net added a commit that referenced this issue Aug 30, 2021
Andrei Kislichenko automation moved this from In progress to Done Aug 30, 2021
@ronaldtse
Copy link
Contributor Author

The results from 1 and 2 are supposed to be identical. I will report and revert.

NIST CSRC responded that endpoint 1 is now fixed. Thanks guys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Development

No branches or pull requests

2 participants