Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse BIPM Metrologia data from rawdata-bipm-metrologia #28

Open
ronaldtse opened this issue Jun 14, 2022 · 18 comments
Open

Parse BIPM Metrologia data from rawdata-bipm-metrologia #28

ronaldtse opened this issue Jun 14, 2022 · 18 comments
Assignees
Labels
enhancement New feature or request

Comments

@ronaldtse
Copy link
Contributor

ronaldtse commented Jun 14, 2022

As described in relaton/relaton-data-bipm#17 .

This task supersedes #2 which implemented support to retrieve Metrologia bibliographic data from IOP but was unsatisfactory due to remote performance issues.

BIPM has now provided the full bibliographic data set of Metrologia, and we have an agreement in place with IOP Publishing, the publisher. The dataset is now at https://github.com/relaton/rawdata-bipm-metrologia (private access).

The work here is to parse that dataset into the relaton-data-bipm Relaton repository.

(The following information is also provided in README.adoc of the repository but included here for clarity)

The full set of bibliographic data comes in a zipped format in the following structure:

2022-04-05T10_55_52_content/         - data archive level
  0026-1394/                         - ISSN of Metrologia (physical version)
    0026-1394_1/                     - Volume 1 of Metrologia
      0026-1394_1_1/                 - Issue 1 of Volume 1
        0026-1394_1_1_1/             - Article 1 of Issue 1 of Volume 1
          metv1i1p1.xml              - Bibliographic data of Article 1

    0026-1394_2/                     - Volume 2 of Metrologia
      0026-1394_2_1/                 - Issue 1 of Volume 2
        0026-1394_2_1_1/             - Article 1 of Issue 1 of Volume 2 (1, 6, 11 are page numbers)
        0026-1394_2_1_6/             - Article 6 of Issue 1 of Volume 2 (1, 6, 11 are page numbers)
        0026-1394_2_1_11/            - Article 11 of Issue 1 of Volume 2 (1, 6, 11 are page numbers)
      
    0026-1394_59/                    - Volume 59 of Metrologia
      0026-1394_59_1A/               - Issue 1A of Volume 59
        0026-1394_59_1A_01001/       - Article 01001 of Issue 1A of Volume 59
        ...                          
        0026-1394_59_1A_08005/       - Article 08005 of Issue 1A of Volume 59
          0026-1394_59_1A_08005.xml  - Bibliographic data of Article 08005
      0026-1394_59_2/                - Issue 2 of Volume 59
        0026-1394_59_2_022001/       - Article 022001 of Issue 2 of Volume 59
          met_59_2_022001.xml        - Bibliographic data of Article 022001

Subsequent updates will be provided also in the archived format.

The update archives have the same structure:

2022-06-02T03_01_55_content/         - data archive level
  0026-1394/                         - ISSN of Metrologia (physical version)
    0026-1394_59/                    - Volume 59 of Metrologia
      0026-1394_59_3/                - Issue 3 of Volume 59
        0026-1394_59_3_034002/       - Article 034002 of Issue 3 of Volume 59
          met_59_3_034002.xml        - Bibliographic data of Article 034002

We need to parse this archive into a Relaton dataset.

Notice in the folder/file structure:

  • Article numbers are localized -- they are not unique across the dataset
  • Article numbers can be sequential numbers, page numbers
  • Issue numbers are not always sequential, e.g. "Issue 1A of Volume 59"
  • Filenames of XML files are not of a consistent pattern, there have been pattern changes, and probably does not represent much.
  • All the information provided in the folder names and file names are available from the actual XML files, with samples shown below.

Contents of metv1i1p1.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "JATS-journalpublishing1.dtd">
<article 
  xmlns:mml="http://www.w3.org/1998/Math/MathML" 
  xmlns:xlink="http://www.w3.org/1999/xlink" 
  article-type="editorial" 
  dtd-version="1.1" 
  xml:lang="en">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">met</journal-id>
      <journal-title-group>
        <journal-title>Metrologia</journal-title>
        <abbrev-journal-title abbrev-type="IOP">met</abbrev-journal-title>
        <abbrev-journal-title abbrev-type="publisher">Metrologia</abbrev-journal-title>
      </journal-title-group>
      <issn pub-type="ppub">0026-1394</issn>
      <issn pub-type="epub">1681-7575</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="publisher-id">0026-1394__</article-id>
      <article-id pub-id-type="doi">10.1088/0026-1394/1/1/001</article-id>
      <article-id pub-id-type="manuscript">001</article-id>
      <article-categories>
        <subj-group subj-group-type="display-article-type">
          <subject/>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title xml:lang="en">The Role and Policy of  <italic>Metrologia</italic></article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author" xlink:type="simple">
          <name>
            <surname>L E Howlett</surname>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff1"><label>1</label>Editor, Ottawa, Canada</aff>
      </contrib-group>
      <pub-date pub-type="ppub">
        <day>01</day>
        <month>01</month>
        <year>1965</year>
      </pub-date>
      <volume>1</volume>
      <issue>1</issue>
      <fpage>1</fpage>
      <lpage>1</lpage>
      <permissions>
        <copyright-statement>Published under licence by IOP Publishing Ltd</copyright-statement>
        <copyright-year>1965</copyright-year>
      </permissions>
      <self-uri content-type="pdf" xlink:href="metv1i1p1.pdf" xlink:type="simple"/>
      <abstract xml:lang="en">
        <p>Today it is often said ... After much study ...<italic>Metrologia</italic> ... <italic>Metrologia</italic> ...</p>
        <p>This journal will ...</p>
        <p>Preference will ...</p>
        <p>Review articles will ...</p>
        <p>The journal will ...</p>
        <p>There will be a...</p>
        <p>Ability to measure ...</p>
      </abstract>
    </article-meta>
  </front>
</article>

Contents of 0026-1394_59_1A_08005.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "JATS-journalpublishing1.dtd">
<article 
  xmlns:mml="http://www.w3.org/1998/Math/MathML" 
  xmlns:xlink="http://www.w3.org/1999/xlink" 
  article-type="note" 
  dtd-version="1.1" 
  xml:lang="en">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">met</journal-id>
      <journal-id journal-id-type="coden">MTRGAU</journal-id>
      <journal-title-group>
        <journal-title xml:lang="en">Metrologia</journal-title>
        <abbrev-journal-title abbrev-type="IOP" xml:lang="en">MET</abbrev-journal-title>
        <abbrev-journal-title abbrev-type="publisher" xml:lang="en">Metrologia</abbrev-journal-title>
      </journal-title-group>
      <issn pub-type="ppub">0026-1394</issn>
      <issn pub-type="epub">1681-7575</issn>
      <publisher>
        <publisher-name>IOP Publishing</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="publisher-id">met_59_1A_08005</article-id>
      <article-id pub-id-type="doi">10.1088/0026-1394/59/1A/08005</article-id>
      <article-id pub-id-type="manuscript">met_59_1A_08005</article-id>
      <article-categories>
        <subj-group subj-group-type="display-article-type">
          <subject>PILOT STUDY</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>Final report on pilot study CCQM-P211: carbon isotope delta measurements of vanillin</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author" xlink:type="simple">
          <contrib-id authenticated="false" contrib-id-type="orcid">0000-0003-3398-7246</contrib-id>
          <name name-style="western">
            <surname>Chartrand</surname>
            <given-names>Michelle M G</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation01">1</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <contrib-id authenticated="false" contrib-id-type="orcid">0000-0002-4126-3515</contrib-id>
          <name name-style="western">
            <surname>Kai</surname>
            <given-names>Fuu Ming</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation02">2</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <contrib-id authenticated="false" contrib-id-type="orcid">0000-0001-8744-5632</contrib-id>
          <name name-style="western">
            <surname>Meijer</surname>
            <given-names>Harro A J</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation03">3</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <contrib-id authenticated="false" contrib-id-type="orcid">0000-0003-4768-2603</contrib-id>
          <name name-style="western">
            <surname>Moossen</surname>
            <given-names>Heiko</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation04">4</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <contrib-id authenticated="false" contrib-id-type="orcid">0000-0002-8339-744X</contrib-id>
          <name name-style="western">
            <surname>Qi</surname>
            <given-names>Haiping</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation05">5</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <contrib-id authenticated="false" contrib-id-type="orcid">0000-0002-7227-791X</contrib-id>
          <name name-style="western">
            <surname>Aerts-Bijma</surname>
            <given-names>Anita T</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation03">3</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <name name-style="western">
            <surname>Cui</surname>
            <given-names>Yuxi</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation02">2</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <name name-style="western">
            <surname>Geilmann</surname>
            <given-names>Heike</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation04">4</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <contrib-id authenticated="false" contrib-id-type="orcid">0000-0002-2377-2615</contrib-id>
          <name name-style="western">
            <surname>Mester</surname>
            <given-names>Zoltan</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation01">1</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <contrib-id authenticated="false" contrib-id-type="orcid">0000-0002-3349-5535</contrib-id>
          <name name-style="western">
            <surname>Meija</surname>
            <given-names>Juris</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation01">1</xref>
        </contrib>
        <aff id="affiliation01"><label>1</label>
National Research Council Canada, Metrology, 1200 Montreal Rd., Ottawa, K1A 0R6, Canada</aff>
        <aff id="affiliation02"><label>2</label>
National Metrology Centre, Agency for Science, Technology and Research, 8 Cleantech Loop, 637145, Singapore</aff>
        <aff id="affiliation03"><label>3</label>
Centre for Isotope Research, University of Groningen, Nijenborgh 6, 9747 AG Groningen, The Netherlands</aff>
        <aff id="affiliation04"><label>4</label>
Stable Isotope Laboratory, Max Planck Institute for Biogeochemistry, Hans-Knoell-St. 10, 07745, Jena, Germany</aff>
        <aff id="affiliation05"><label>5</label>
US Geological Survey, Reston, VA 20192, USA</aff>
      </contrib-group>
      <pub-date pub-type="ppub">
        <day>01</day>
        <month>1</month>
        <year>2022</year>
      </pub-date>
      <pub-date pub-type="epub">
        <day>18</day>
        <month>2</month>
        <year>2022</year>
      </pub-date>
      <volume>59</volume>
      <issue>1A</issue>
      <elocation-id content-type="artnum">08005</elocation-id>
      <permissions>
        <copyright-statement>© 2022 BIPM &amp; IOP Publishing Ltd</copyright-statement>
        <copyright-year>2022</copyright-year>
        <license license-type="iop-standard" xlink:href="https://publishingsupport.iopscience.iop.org/iop-standard/v1">
          <license-p>This article is available under the terms of the <ext-link ext-link-type="uri">IOP-Standard License</ext-link>.</license-p>
        </license>
      </permissions>
      <abstract>
        <title>Main text</title>
        <p>This pilot study was ...</p>
        <p>To reach the main text of this paper, click on <ext-link xlink:href="https://www.bipm.org/documents/20126/67196226/CCQM-P211.pdf/03820c42-15b0-6849-3cde-aa6a1a105b42" xlink:type="simple">Final Report</ext-link>.</p>
        <p>The final report has been peer-reviewed and approved for publication by the CCQM.</p>
      </abstract>
    </article-meta>
  </front>
</article>
@ronaldtse
Copy link
Contributor Author

ronaldtse commented Feb 7, 2023

We need to action this issue ASAP due to BIPM request.

The corresponding data sync work has been done by @CAMOBAP at:

@andrew2net
Copy link
Contributor

@ronaldtse there are two date types in the source:

      <pub-date pub-type="ppub">
        <day>01</day>
        <month>1</month>
        <year>2022</year>
      </pub-date>
      <pub-date pub-type="epub">
        <day>18</day>
        <month>2</month>
        <year>2022</year>
      </pub-date>

Nick's suggestion is treating "epub" type as relation:

<relation type="hasManifestation">
  <bibitem>
    <title>(same)</title>
    <date>2022-02-18</date>
    <medium><carrier>online resource</carrier></medium>
  </bibitem>
</relation>

@ronaldtse
Copy link
Contributor Author

This is a good point.

We should take the earliest date of ppub and epub date as the date of publication.

I think that even the original "ppub" (which stands for "print publication", according to JATS) should also be encoded as a new manifestation:

<relation type="hasManifestation">
  <bibitem>
    <title>(same)</title>
    <date>2022-01-01</date>
    <medium><carrier>print</carrier></medium>
  </bibitem>
</relation>
<relation type="hasManifestation">
  <bibitem>
    <title>(same)</title>
    <date>2022-02-18</date>
    <medium><carrier>traditional</carrier></medium>
  </bibitem>
</relation>

@andrew2net
Copy link
Contributor

@ronaldtse it seems the data source doesn't provide URL's.

@ronaldtse
Copy link
Contributor Author

Then we don't need to provide a URL. We do have DOIs, so that is sufficient.

@andrew2net
Copy link
Contributor

andrew2net commented Feb 12, 2023

@ronaldtse yes, we do have DOIs for articles. But we also need to create issue documents with article relations, volume documents with issue relations, and root "Metrologia" documents with volume relations. Can we have these documents without URLs?

@ronaldtse
Copy link
Contributor Author

I think so for the moment. Let me ask BIPM/IOPP to provide URLs for these entries.

@ronaldtse
Copy link
Contributor Author

I have asked BIPM for URLs. For the moment, let's continue with URLs and file a ticket to keep track.

@ronaldtse
Copy link
Contributor Author

ronaldtse commented Feb 14, 2023

BIPM's Janet Miles says we should use the DOI for URL for articles. For volume and issues, there are no DOIs.

Let's use these URLs instead:

@andrew2net
Copy link
Contributor

andrew2net commented Feb 15, 2023

@ronaldtse the source file rawdata-bipm-metrologia/2022-04-05T10_55_52_content/0026-1394/0026-1394_37/0026-1394_37_5/0026-1394_37_5_68/me0568.xml misses page (article) number. It has the title "Index of Contributors" so it should have page 68 https://iopscience.iop.org/article/10.1088/0026-1394/37/5/68. Is it BIPM's mistake?

UPD same for:
rawdata-bipm-metrologia/2022-04-05T10_55_52_content/0026-1394/0026-1394_40/0026-1394_40_1/0026-1394_40_1_001/0026-1394_40_1_001.xml https://iopscience.iop.org/article/10.1088/0026-1394/40/1/001

@ronaldtse
Copy link
Contributor Author

@andrew2net have you re-pulled from this repo? The data path is different now.

I can see in the first file:
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_37/0026-1394_37_5/0026-1394_37_5_68/me0568.xml

        <article-id pub-id-type="manuscript">
          68
        </article-id>
        <title-group>
          <article-title xml:lang="en">
            Index of Contributors
          </article-title>
        </title-group>

The number 68 is present.

In the second file:
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_40/0026-1394_40_1/0026-1394_40_1_001/0026-1394_40_1_001.xml

        <article-id pub-id-type="manuscript">
          001
        </article-id>
        <title-group>
          <article-title xml:lang="en">
            Editorial
          </article-title>
        </title-group>

The 001 is also present.

@andrew2net
Copy link
Contributor

andrew2net commented Feb 16, 2023

@ronaldtse indeed. You are right about these documents, but most documents have an fpage element. For example the

rawdata-bipm-metrologia/2022-04-05T10_55_52_content/0026-1394/0026-1394_29/0026-1394_29_6/0026-1394_29_6_373/metv29i6p373.xml

has <fpage>373</fpage>, and <article-id pub-id-type="manuscript">001</article-id>. So it seems if there is fpage we should use it as an article, otherwise use article-id pub-id-type[@type="manuscript"]. Am I right?

@ronaldtse
Copy link
Contributor Author

It seems so. What a strange encoding.

@ronaldtse
Copy link
Contributor Author

Can you document this strange behavior in the README? Thanks.

@andrew2net
Copy link
Contributor

@ronaldtse if we use fpage as an article number then we have document ID duplication. So I use article-id [@pub-id-type="manuscript"] type currently, but we have different articles number now.

andrew2net added a commit that referenced this issue Feb 16, 2023
andrew2net added a commit that referenced this issue Feb 16, 2023
use rawdata-bipm-metrologia #28
@andrew2net
Copy link
Contributor

@ronaldtse here are duplicates in the source dataset:

"Metrologia 59 1A 06011"
rawdata-bipm-metrologia/data/2022-05-28T03_01_55_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_06011/0026-1394_59_1A_06011.xml
rawdata-bipm-metrologia/data/2022-06-29T03_01_46_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_06011/0026-1394_59_1A_06011.xml
"Metrologia 59 4 ac7687"
rawdata-bipm-metrologia/data/2022-07-07T03_01_47_content/0026-1394/0026-1394_59/0026-1394_59_4/0026-1394_59_4_045007/met_59_4_045007.xml
rawdata-bipm-metrologia/data/2022-10-15T03_01_48_content/0026-1394/0026-1394_59/0026-1394_59_4/0026-1394_59_4_045007/met_59_4_045007.xml
"Metrologia 59 1A 08013"
rawdata-bipm-metrologia/data/2022-09-03T03_01_53_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_08013/0026-1394_59_1A_08013.xml
rawdata-bipm-metrologia/data/2022-09-14T03_01_45_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_08013/0026-1394_59_1A_08013.xml
"Metrologia 59 6 ac98cb"
rawdata-bipm-metrologia/data/2022-10-29T03_01_45_content/0026-1394/0026-1394_59/0026-1394_59_6/0026-1394_59_6_064001/met_59_6_064001.xml
rawdata-bipm-metrologia/data/2022-11-17T03_01_46_content/0026-1394/0026-1394_59/0026-1394_59_6/0026-1394_59_6_064001/met_59_6_064001.xml
rawdata-bipm-metrologia/data/2022-11-24T03_01_45_content/0026-1394/0026-1394_59/0026-1394_59_6/0026-1394_59_6_064001/met_59_6_064001.xml
"Metrologia 59 1A 07020"
rawdata-bipm-metrologia/data/2022-11-18T03_01_53_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_07020/0026-1394_59_1A_07020.xml
rawdata-bipm-metrologia/data/2022-11-26T03_01_46_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_07020/0026-1394_59_1A_07020.xml
"Metrologia 60 1A 01001"
rawdata-bipm-metrologia/data/2023-01-05T03_01_46_content/0026-1394/0026-1394_60/0026-1394_60_1A/0026-1394_60_1A_01001/0026-1394_60_1A_01001.xml
rawdata-bipm-metrologia/data/2023-01-06T03_01_49_content/0026-1394/0026-1394_60/0026-1394_60_1A/0026-1394_60_1A_01001/0026-1394_60_1A_01001.xml

@ronaldtse
Copy link
Contributor Author

@andrew2net sorry to get back late here. For these source duplications:

  1. Are the contents identical?
  2. If not, do we need to merge them?
  3. Can we just take the newest copy? (if the newer ones are corrections)

Thanks!

@andrew2net
Copy link
Contributor

@ronaldtse

@andrew2net sorry to get back late here. For these source duplications:

  1. Are the contents identical?

In the 1, 5, and 6 cases the docs have difference in contributors. One doc has extra contributors.

In the 2 case docs look identical, but one of them has back element after front

  ...
  </front>
  <back>
    <ref-list content-type="numerical">
      <title>References</title>
      <ref id="metac7687bib1">
        <label>1</label>
        <element-citation publication-type="journal" xlink:type="simple">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Petit</surname>
              <given-names>G</given-names>
            </name>
            <name name-style="western">
              <surname>Jiang</surname>
              <given-names>Z</given-names>
            </name>
          </person-group>
          <year>2008</year>
          <source>Int. J. Navig. Obs.</source>
          <volume>2008</volume>
          <fpage>1</fpage>
          <lpage>8</lpage>
          <page-range>1–8</page-range>
          <pub-id pub-id-type="doi">10.1155/2008/562878</pub-id>
        </element-citation>
      </ref>
      <ref id="metac7687bib2">
        <label>2</label>
       ...

I looks like relations. Shouldn't we parse the relations?

In the 3 and 4 cases the docs look identical.

  1. If not, do we need to merge them?

I think we should merge them

  1. Can we just take the newest copy? (if the newer ones are corrections)

In these cases dates are identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants