# COMET Comparison

The [COMET](https://www.cometadata.org/) project is an effort to build tools for collaboratively improving scholarly metadata. It is predicated on the idea that academic institutions can help improve metadata that is being made available by Crossref as part of DOI registration by publishers.

This notebook explores how our data, which aggregates information from Dimensions, Web of Science, Open Alex and PubMed differs from the metadata that is available from Crossref.

Since we are storing the verbatim metadata from each source, we can look for where this data differs substantially from Crossref. The goal is to find 10 records that illustrate how we could provide enhanced metadata to COMET.

In [73]:
import os
import dotenv
import pandas
import sqlalchemy

Set some variables and use them to connect to the database after you have [established an SSH tunnel](https://github.com/sul-dlss/rialto-airflow/wiki/Querying-with-pgAdmin) to the database server.

In [74]:
dotenv.load_dotenv()

db_password = os.environ.get("DB_PASSWORD_STAGE")
db_name = 'rialto_20250711222454'
db_user = 'analyst'

engine = sqlalchemy.create_engine(f'postgresql://{db_user}:{db_password}@localhost:9999/{db_name}')

In [75]:
pandas.set_option('display.max_colwidth', 0)
pandas.set_option('display.max_rows', 1000)

pandas.read_sql('select doi from publication where doi is not null limit 15', con=engine)

Unnamed: 0,doi
0,10.1001/archneur.1981.00510110050006
1,10.1001/archneur.1982.00510160028005
2,10.1001/archneur.1983.04050020055011
3,10.1001/archneur.1985.04060020033010
4,10.1001/jamanetworkopen.2019.11519
5,10.1001/jamanetworkopen.2020.29540
6,10.1001/jamanetworkopen.2023.3640
7,10.1001/jamanetworkopen.2023.3646
8,10.1001/jamanetworkopen.2024.1828
9,10.1001/jamaoncol.2017.5257


## Title

Compare the title we've extracted from SulPub, Dimensions, OpenAlex and WoS with what is in Crossref.

In [76]:
pandas.read_sql("""
  SELECT doi, title, crossref_title
  FROM (
      SELECT
          doi,
          title,
          crossref_json -> 'title' ->> 0 AS crossref_title
      FROM publication
      WHERE crossref_json IS NOT NULL
  )
  WHERE
      RTRIM(LOWER(title), '.') != RTRIM(LOWER(crossref_title), '.')
      AND crossref_title !~ '<'
  LIMIT 25
""", con=engine)

Unnamed: 0,doi,title,crossref_title
0,10.1016/j.echo.2019.04.002,2019 ACC/AHA/ASE Advanced Training Statement on Echocardiography (Revision of the 2003 ACC/AHA Clinical Competence Statement on Echocardiography): A Report of the ACC Competency Management Committee,2019 ACC/AHA/ASE Advanced Training Statement on Echocardiography (Revision of the 2003 ACC/AHA Clinical Competence Statement on Echocardiography): A Report of the ACC Competency Management Committee
1,10.1002/cncr.25881,High-Intensity Focused Ultrasound (HIFU) Is Not Indicated for Treatment of Primary Bone Sarcomas,High‐intensity focused ultrasound (HIFU) is not indicated for treatment of primary bone sarcomas
2,10.1002/ski2.235,Development of a digital tool for home‐based monitoring of skin disease for older adults,Development of a Digital Tool for Home-Based Monitoring of Skin Disease for Older Adults
3,10.1027/1618-3169/a000578,Does Watching Videos With Natural Scenery Restore Attentional Resources?,Does Watching Videos With Natural Scenery Restore Attentional Resources?
4,10.1158/1078-0432.ccr-24-2814,"First-in-Human Clinical Trial of a Small-Molecule EBNA1 Inhibitor, VK-2019, in Patients with Epstein-Barr-Positive Nasopharyngeal Cancer, with Pharmacokinetic and Pharmacodynamic Studies.","First-in-Human Clinical Trial of a Small-Molecule EBNA1 Inhibitor, VK-2019, in Patients with Epstein-Barr–Positive Nasopharyngeal Cancer, with Pharmacokinetic and Pharmacodynamic Studies"
5,10.1161/circulationaha.112.000343,Sustained Release of Engineered Stromal Cell–Derived Factor 1-&agr; From Injectable Hydrogels Effectively Recruits Endothelial Progenitor Cells and Preserves Ventricular Function After Myocardial Infarction,Sustained Release of Engineered Stromal Cell–Derived Factor 1-α From Injectable Hydrogels Effectively Recruits Endothelial Progenitor Cells and Preserves Ventricular Function After Myocardial Infarction
6,10.1016/j.jacc.2021.02.023,COVID-19 in Adults With Congenital Heart Disease,COVID-19 in Adults With Congenital Heart Disease
7,10.1016/j.healun.2024.02.1179,(534) Failure to Rescue in Heart Lung Transplantation: Progress Over 30 Years,Failure to Rescue in Heart Lung Transplantation: Progress Over 30 Years
8,10.2147/pgpm.s201276,Pharmacogenetics of Pediatric Asthma: Current Perspectives,&lt;p&gt;Pharmacogenetics of Pediatric Asthma: Current Perspectives&lt;/p&gt;
9,10.1021/jm00110a010,"Topographically designed analogues of [D-Pen,D-Pen5]enkephalin.","Topographically designed analogs of [cyclic] [D-Pen2,D-Pen5]enkephalin"


The order of preference for title is:

* sulpub
* Dimensions
* OpenAlex
* Web of Science

Maybe title isn't the greatest place to start comparing since Crossref actually seems to be better in places. For example:

> (534) Failure to Rescue in Heart Lung Transplantation: Progress Over 30 Years

compared with Crossref:

> Failure to Rescue in Heart Lung Transplantation: Progress Over 30 Years

The published [article](https://doi.org/10.1016/j.jacc.2021.02.023) title doesn't have "(534)" in it.

> Original Mutations in latent membrane protein 1 of Epstein-Barr virus are associated with increased risk of posttransplant lymphoproliferative disorder in children

compared with Crossref:

> Mutations in latent membrane protein 1 of Epstein-Barr virus are associated with increased risk of posttransplant lymphoproliferative disorder in children`

The published [article](https://doi.org/10.1016/j.ajt.2023.02.014) title doesn't have "Original" in it.
    
It might be better to focus on areas where data we've collected isn't available from Crossref.

## Crossref Data

Examining a Crossref record like [this journal article](https://api.crossref.org/works/10.1098/rsos.150114) we can see that it includes metadata for:

* DOI
* title
* short title
* subtitle
* publisher
* journal title
* volume
* issue
* pages
* abstract
* publisher website URL
* article URL
* PDF URL
* publication date
* publication type
* record creation date
* update
* reference count
* author names
* author affiliations: e.g. `From the Departments of Cardiothoracic Surgery (A.N.S., J.W.M., Y.J.W.) and Bioengineering (A.N.S., Y.J.W.), Stanford University, CA.`
* references as DOIs
* language
* deposit date
* issued date
* ISSNs

As part of our workflow processing we are extracting a few pieces of information that is not available here:

* Article Processing Charge (APC)
* Open Access Classification
* Author ORCID
* Author email
* ROR for school
* Funder
* Federal Funder

Maybe that's enough? We can serialize the records we have as JSON for sharing.

## APC

In [77]:
pandas.read_sql("""
  SELECT doi, apc 
  FROM publication
  WHERE doi IS NOT NULL
    AND apc IS NOT NULL
""", con=engine)

Unnamed: 0,doi,apc
0,10.1002/pbc.24328,4330
1,10.1038/s41586-022-04489-4,11690
2,10.1002/ange.202101644,0
3,10.1016/j.jim.2020.112936,0
4,10.1002/art.42099,4940
...,...,...
1384,10.7554/elife.93013,2450
1385,10.7189/jogh.10.020380,2450
1386,10.7189/jogh.12.04014,1200
1387,10.7759/cureus.42051,2450


Here's a function that will output a publication with its authors and funders as JSON.

In [128]:
import json

# there's probably a better way to do this, but this will work for these purposes
def write_json(doi, filename):
    pub = pandas.read_sql('SELECT * from publication where doi = %(doi)s', params={'doi': doi}, con=engine).to_dict(orient='records')[0]

    pub['authors'] = pandas.read_sql(
        f'''
        SELECT *
        FROM author, pub_author_association
        WHERE author.id = pub_author_association.author_id
          AND pub_author_association.publication_id = {pub['id']}
        ''',
        con=engine).to_dict(orient='records')

    pub['funders'] = pandas.read_sql(
        f'''
        SELECT *
        FROM funder, pub_funder_association
        WHERE funder.id = pub_funder_association.funder_id
          AND pub_funder_association.publication_id = {pub['id']}
        ''',
        con=engine).to_dict(orient='records')
       
    json.dump(pub, open(f'data/comet/{filename}', 'w'), indent=2, sort_keys=True, default=str)

write_json('10.1038/s41586-022-04489-4', 'apc.json')
    

In [130]:
## Open Access

In [125]:
pandas.read_sql("""
  SELECT doi, open_access 
  FROM publication
  WHERE doi IS NOT NULL
    AND open_access IS NOT NULL
  LIMIT 100
""", con=engine)

Unnamed: 0,doi,open_access
0,10.1002/pbc.24328,green
1,10.1016/j.jamcollsurg.2012.06.133,closed
2,10.1021/acsabm.0c00348,closed
3,10.1038/s41586-022-04489-4,hybrid
4,10.1021/acsabm.9b00769,closed
5,10.1002/ange.202101644,closed
6,10.1016/j.jim.2020.112936,closed
7,10.1021/acs.nanolett.3c04172,green
8,10.1002/art.42099,bronze
9,10.1002/aet2.11043,bronze


In [129]:
write_json('10.1016/j.jaccas.2024.102715', 'openaccess.json')

## Authors

All our publications are associated with Authors where we know:

* department
* school
* orcid
* sunet/email

In [132]:
write_json('10.1038/s41586-023-06669-2', 'author.json')

## Funders

We know funder information from OpenAlex and Dimensions.

In [140]:
pandas.read_sql("""
  SELECT publication.doi, COUNT(*) AS funders_count
  FROM publication, pub_funder_association, funder
  WHERE publication.id = pub_funder_association.publication_id
    AND pub_funder_association.funder_id = funder.id
  GROUP BY publication.doi
  HAVING COUNT(*) > 2
  ORDER BY funders_count DESC
  LIMIT 100
""", con=engine)

Unnamed: 0,doi,funders_count
0,10.1117/1.nph.12.2.027801,28
1,10.3847/1538-4365/ac78eb,24
2,10.3847/1538-4357/acdd78,23
3,10.1038/s41467-020-18849-z,20
4,10.1111/all.16490,19
5,10.3847/1538-4357/ac6e65,18
6,10.3847/1538-4357/ad07d0,17
7,10.1038/s41591-024-03425-5,16
8,10.1111/pai.13802,16
9,10.1016/s1470-2045(16)30214-5,16


In [141]:
write_json('10.1038/s41467-025-60308-0', 'funders.json')