This notebook works to baseline the connection between USGS Series Reports in the USGS Publications Warehouse and the xDD (GeoDeepDive) infrastructure. The USGS reports have been picked up and processed through xDD for some time now as one of many collections of scientific articles worked through the xDD AI pipelines. They are in a collection indicated by "USGS" as publisher. We need to connect dots between the various places and pieces of infrastructure where we have these reports documented and are working with them and the xDD "digestion" of these contents so that we can better exploit what the xDD pipelines have produced. To do that, we need to work out the linkage between our own base identifiers and the "gddid" for the articles that is our key to tapping the various xDD APIs and working up specific NLP processing.

There are some challenges we need to work through in this due to a change that occurred some time back in how the xDD framework fetches USGS report metadata. This started off via ScienceBase and it's representation of USGS report metadata when the ScienceBase API provided a better route to operate against. Once the Pubs Warehouse web services became more mature, we switched to that route. However, this introduced some difference in how the "articles" are represented in xDD and the available information we can link through. Part of the work in this notebook is to hash out that issue, baseline the connection from "xDD articles" to Pubs Warehouse entities, and point out where we can make improvements to bring cohesion and interoperability moving forward. On the USGS side, we have a metadata element in the Pubs Warehouse Catalog schema to accommodate the "gddid" from xDD, and we'll be getting that in place and working out how to keep it maintained through this work.

In [1]:
import requests
import pandas as pd
from urllib.parse import urlparse, parse_qs

Here we need to work through the "scan-and-scroll" method supported by the xDD articles API to fetch all basic item metadata for the USGS publisher so that we can go through it to develop the linkage to our Pubs Warehouse base records and identify areas where we can improve the metadata in xDD for better alignment with our own systems.

In [2]:
publisher = "USGS"
xdd_articles_api = f"https://geodeepdive.org/api/articles?publisher={publisher}&full_results"

all_records = []
next_page = xdd_articles_api
page_num = 1
while next_page:
    response = requests.get(next_page).json()
    if "success" in response and "data" in response["success"]:
        print("PROCESSING", len(response["success"]["data"]), "FROM PAGE NUMBER", page_num, "RECORDS OF", response["success"]["hits"])
        all_records.extend(response["success"]["data"])
        if "next_page" in response["success"]:
            next_page = response["success"]["next_page"]
            page_num+=1
        else:
            break
    else:
        break


PROCESSING 500 FROM PAGE NUMBER 1 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 2 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 3 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 4 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 5 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 6 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 7 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 8 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 9 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 10 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 11 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 12 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 13 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 14 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 15 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 16 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 17 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 18 RECORDS OF 63976
PROCESSING 500 FROM PAGE NUMBER 19 RECORDS OF 63976
PROCESSING 500 FROM P

In [3]:
df_usgs_xdd_collection = pd.DataFrame(all_records)

In [5]:
df_usgs_xdd_collection.head()

Unnamed: 0,type,_gddid,title,volume,journal,link,publisher,abstract,author,pages,number,identifier,year,cited_by
0,fulltext,5a8dbacfcf58f1815e19f708,Grover Heinrichs mining collection: Copper Cre...,,,[{'url': 'https://www.sciencebase.gov/catalog/...,USGS,,[],,,[],2016,
1,fulltext,5a8dbad0cf58f1815e19f70e,Grover Heinrichs mining collection: Exxon Company,,,[{'url': 'https://www.sciencebase.gov/catalog/...,USGS,,[],,,[],2016,
2,fulltext,5a8dbac9cf58f1815e19f700,Grover Heinrichs mining collection: Leadville ...,,,[{'url': 'https://www.sciencebase.gov/catalog/...,USGS,,[],,,[],2016,
3,fulltext,5a8dbad7cf58f1815e19f717,Vulture: Various Cost sheets,,,[{'url': 'https://www.sciencebase.gov/catalog/...,USGS,,[],,,[],2015,
4,fulltext,559dcd86e138230816dad9c0,Program disk to accompany MAGMIX (Computers &a...,,Open-File Report,[{'url': 'https://pubs.er.usgs.gov/publication...,United States Geological Survey,,"[{'name': 'Frost, T.P.'}, {'name': 'Lindsay, J...",,,"[{'type': 'doi', 'id': '10.3133/ofr88501'}]",1988,


In [33]:
xdd_article_links = df_usgs_xdd_collection[df_usgs_xdd_collection.link.str.len() > 0][["_gddid","link"]].explode("link")
xdd_article_links["url"] = xdd_article_links["link"].apply(lambda x: x["url"])
xdd_article_links.drop(columns="link", inplace=True)

xdd_article_links['domain'] = xdd_article_links['url'].apply(lambda x: urlparse(x).netloc)
xdd_article_links['path'] = xdd_article_links['url'].apply(lambda x: urlparse(x).path)
xdd_article_links['scheme'] = xdd_article_links['url'].apply(lambda x: urlparse(x).scheme)
xdd_article_links['params'] = xdd_article_links['url'].apply(lambda x: parse_qs(urlparse(x).query))


In [28]:
xdd_article_links.domain.unique()

array(['www.sciencebase.gov', 'pubs.er.usgs.gov', 'pubs.usgs.gov',
       'ngmdb.usgs.gov', 'ks.water.usgs.gov', 'pubs.water.usgs.gov',
       'www.dggs.dnr.state.ak.us', 'www.archive.org', 'wi.water.usgs.gov',
       'or.water.usgs.gov', 'water.usgs.gov', 'il.water.usgs.gov',
       'www.webharvest.gov', 'sofia.usgs.gov', 'ca.water.usgs.gov',
       'mo.water.usgs.gov', 'ky.water.usgs.gov', 'smig.usgs.gov',
       'nwql.usgs.gov', 'repository.agic.umn.edu', 'md.water.usgs.gov',
       'ny.water.usgs.gov', 'www.usgs.gov', 'mn.water.usgs.gov',
       'vulcan.wr.usgs.gov', 'nd.water.usgs.gov', 'ne.water.usgs.gov',
       'wwwsd.cr.usgs.gov', 'fl.water.usgs.gov', 'geology.er.usgs.gov',
       'pa.water.usgs.gov', 'pr.water.usgs.gov', 'www.nps.gov',
       'gulfsci.usgs.gov', 'quake.wr.usgs.gov', 'nsmp.wr.usgs.gov',
       'wa.water.usgs.gov', 'mi.water.usgs.gov', 'toxics.usgs.gov',
       'dggs.alaska.gov', 'coastal.er.usgs.gov', 'astrogeology.usgs.gov',
       'bqs.usgs.gov', 'erg.usgs.g

In [6]:
xdd_article_identifiers = df_usgs_xdd_collection[df_usgs_xdd_collection.identifier.str.len() > 0][["_gddid","identifier"]].explode("identifier")
xdd_article_identifiers["doi"] = xdd_article_identifiers["identifier"].apply(lambda x: x["id"])
xdd_article_identifiers.drop(columns="identifier", inplace=True)

We've got some duplicate DOIs in the xDD system, which I would not necessarily expect. This may be do to some of the ways we've used DOIs to resolve to collections vs. individual articles. I'll have to dig into these further, but for now we can really only work through the unique DOIs.

In [7]:
ids = xdd_article_identifiers["doi"]
duplicate_dois = xdd_article_identifiers[ids.isin(ids[ids.duplicated()])].sort_values("doi")
unique_dois = xdd_article_identifiers.drop_duplicates(subset="doi", keep=False)

# Interim Tangent

I'm shortcutting the process toward a comprehensive linkage between all Pubs Warehouse Catalog entities and associated xDD entities to see if we can get some immediate connections between the handful of reports we have represented in the GeoKB and get thost properties in place. Even though this is a tangent here, the same basic process should work to gather the lion's share of linkages from Pubs Warehouse reposited USGS reports to xDD representations. Given truly unique, single item DOIs (mostly in the 10.3133 identifier space) from xDD, we should be able to pull GDDIDs into the Pubs Warehouse catalog as a baseline. We'll then need to work out a process to pick up new identifiers once processing is complete on whatever schedule the xDD pipelines are operating. We might, perhaps, look toward something old school like an RSS feed, though that doesn't exist yet on the xDD side to my knowledge.

In [8]:
from wbmaker import WikibaseConnection

In [9]:
geokb = WikibaseConnection("GEOKB_CLOUD")

In [12]:
geokb_usgs_reports = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3Fdoi%20%3Fgddid%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP1%2Fwdt%3AP2*%20wd%3AQ11%20.%0A%20%20%3Fitem%20wdt%3AP74%20%3Fdoi%20.%0A%20%20OPTIONAL%20%7B%0A%20%20%20%20%3Fitem%20wdt%3AP93%20%3Fgddid%20.%0A%20%20%7D%0A%7D",
    output_format="dataframe"
)
geokb_usgs_reports["qid"] = geokb_usgs_reports["item"].apply(lambda x: x.split("/")[-1])

In [13]:
geokb_doi_to_gddid = pd.merge(
    left=geokb_usgs_reports[geokb_usgs_reports.doi.isin(unique_dois.doi)],
    right=unique_dois,
    how="left",
    on="doi"
)

In [15]:
geokb_doi_to_gddid.head(10)

Unnamed: 0,item,doi,gddid,qid,_gddid
0,https://geokb.wikibase.cloud/entity/Q54997,10.3133/mf2198E,5d2411d30b45c76caf91efe9,Q54997,5d2411d30b45c76caf91efe9
1,https://geokb.wikibase.cloud/entity/Q54999,10.3133/mf2198D,5d2411d30b45c76caf91efe3,Q54999,5d2411d30b45c76caf91efe3
2,https://geokb.wikibase.cloud/entity/Q55001,10.3133/mf2198C,5d2411d30b45c76caf91efea,Q55001,5d2411d30b45c76caf91efea
3,https://geokb.wikibase.cloud/entity/Q55033,10.3133/mf1985B,5da285b1998e17af825362e8,Q55033,5da285b1998e17af825362e8
4,https://geokb.wikibase.cloud/entity/Q55053,10.3133/mf2021F,5d4ea46e0b45c76cafa444df,Q55053,5d4ea46e0b45c76cafa444df
5,https://geokb.wikibase.cloud/entity/Q55069,10.3133/mf1996A,557f9100e13823bc80ba2755,Q55069,557f9100e13823bc80ba2755
6,https://geokb.wikibase.cloud/entity/Q55183,10.3133/cir718,55b8969ce13823bd29ba8893,Q55183,55b8969ce13823bd29ba8893
7,https://geokb.wikibase.cloud/entity/Q55188,10.3133/cir930D,,Q55188,5d24fa060b45c76caf9212c0
8,https://geokb.wikibase.cloud/entity/Q55190,10.3133/cir901,,Q55190,55b8f208e13823bd29ba8b0a
9,https://geokb.wikibase.cloud/entity/Q55192,10.3133/cir930A,,Q55192,55b8f0c4e13823bd29ba8b01


In [None]:
for index, row in geokb_doi_to_gddid[geokb_doi_to_gddid["gddid"].isnull()].iterrows():
    item = geokb.wbi.item.get(row["qid"])

    api_ref = geokb.models.References()
    api_ref.add(
        geokb.datatypes.URL(
            prop_nr=geokb.prop_lookup["reference URL"],
            value=f"https://geodeepdive.org/api/articles?doi={row['doi']}"
        )
    )

    item.claims.add(
        geokb.datatypes.ExternalID(
            prop_nr=geokb.prop_lookup["GDDID"],
            value=row["_gddid"],
            references=api_ref
        )
    )

    response = item.write(
        summary="Added GDDID based on DOI match via xDD articles API"
    )
    print(response.id)
