After talking about the model catalog today, I got curious about the DOIs that have been minted in the USGS ID space that claim to be models. This notebook takes a look. A quick look at the DataCite works API shows that there are quite a few DOIs minted, with the majority coming from work I did to register all those GAP species habitat models. But there seem to be some other interesting things in there. I put in a process to pull all those together by paginating through the API results but then stash them in a Feather file in case for further reference.

In [1]:
import requests
import pandas as pd

recordset = list()

In [None]:
# Function to create lookup URL for USGS DOIs where resource type is model
def get_url(page_number):
    return f"https://api.datacite.org/works?data-center-id=usgs.prod&resource-type-id=model&page[number]={page_number}"


In [None]:
%%time
# Get the first page of results
page_number = 1
result = requests.get(get_url(page_number)).json()

In [None]:
%%time
# Loop through and build out the full result set (the lazy way)
while len(result["data"]) > 0:
    recordset.extend(result["data"])
    page_number += 1
    result = requests.get(get_url(page_number)).json()

If we ran the process to get the latest set of model type DOIs, then we build a dataframe from the attributes structure in each record. If we didn't run it, then we pull in the dataframe from the stashed Pickle file.

In [2]:
if len(recordset) == 0:
    df_models = pd.read_pickle("ModelsStash")
else:
    df_models = pd.DataFrame([i["attributes"] for i in recordset])
    df_models.to_pickle("ModelsStash")


In [3]:
# See what the data look like in a dataframe
df_models.head(5)

Unnamed: 0,doi,identifier,url,author,title,container-title,description,resource-type-subtype,data-center-id,member-id,...,view-count,views-over-time,download-count,downloads-over-time,published,registered,checked,updated,media,xml
0,10.5066/p9hiyvg2,https://doi.org/10.5066/p9hiyvg2,https://water.usgs.gov/GIS/metadata/usgswrd/XM...,"[{'given': 'Tracie R', 'family': 'Jackson'}]",MODFLOW-2005 and PEST models used to character...,U.S. Geological Survey,,Model,usgs.prod,usgs,...,0,[],0,[],2020,2020-03-05T18:30:50.000Z,,2020-03-05T18:30:50.000Z,[],PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLT...
1,10.5066/f76t0jpb,https://doi.org/10.5066/f76t0jpb,https://ca.water.usgs.gov/projects/reg_hydro/b...,"[{'given': 'Lorraine E.', 'family': 'Flint'}, ...",California Basin Characterization Model: A Dat...,U.S. Geological Survey,"The Basin Characterization Model (BCM), can tr...",Model,usgs.prod,usgs,...,0,[],0,[],2014,2014-09-17T00:57:18.000Z,,2020-03-02T21:34:34.000Z,[],PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLT...
2,10.5066/f798853m,https://doi.org/10.5066/f798853m,http://water.usgs.gov/GIS/metadata/usgswrd/XML...,"[{'given': 'J Hal', 'family': 'Davis'}, {'give...",MODFLOW 2000 and MT3DMS models of potentiometr...,U.S. Geological Survey,This model is a preliminary characterization o...,Model,usgs.prod,usgs,...,0,[],0,[],2016,2016-05-05T18:15:15.000Z,,2020-03-02T21:18:55.000Z,[],PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLT...
3,10.5066/f76w98jb,https://doi.org/10.5066/f76w98jb,https://water.usgs.gov/GIS/metadata/usgswrd/XM...,"[{'given': 'Alex R', 'family': 'Fiore'}, {'giv...",MODFLOW-2005 model used to evaluate the potent...,U.S. Geological Survey,"A three-dimensional groundwater flow model, MO...",Model,usgs.prod,usgs,...,0,[],0,[],2018,2018-03-13T14:32:08.000Z,,2020-03-02T21:18:05.000Z,[],PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLT...
4,10.5066/f7h99392,https://doi.org/10.5066/f7h99392,http://water.usgs.gov/GIS/metadata/usgswrd/XML...,"[{'given': 'Stephen J', 'family': 'Cauller'}, ...",MODFLOW2005 model used to simulate the effects...,U.S. Geological Survey,A three-dimensional groundwater flow model was...,Model,usgs.prod,usgs,...,0,[],0,[],2016,2016-07-26T16:40:24.000Z,,2020-03-02T21:16:07.000Z,[],PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLT...


In [None]:
# Set the option to be able to look at everything in columns for convenience
pd.set_option('display.max_colwidth', None)

One of the things I was interested in is where these DOIs actually point in terms of their dereferencing URLs. To examine this, I create a new column with just the domain part of the URL so that I can group and look at things.

In [None]:
df_models['url_domain'] = df_models.apply(lambda row: row.url.split("/")[2], axis = 1)

In [None]:
df_models[["doi","url_domain"]].groupby('url_domain').count()

That's kind of an interesting spread. Almost all of the 1761 "models" in ScienceBase are the GAP habitat maps, which are a model output of sorts from a habitat affinity-based species distribution modeling method that involves human intervention. But it looks like we have some other interesting things to look at and think about in terms of what they actual represent for a future model catalog. In the sections below, I output particular records, take a look to see what's on the other end, and provide some notes.

In [None]:
# Helper function to filter the dataframe for a particular URL domain
def get_domain_items(domain):
    return df_models.loc[df_models['url_domain'] == domain][["doi","url","title","description"]]

In [None]:
get_domain_items("usgs.gov")

Well, those are disappointing. Looks like someone needs to do some cleanup in what actually got registered out with DataCite, if that's even possible.

In [None]:
get_domain_items("axiomdatascience.com")

This one from a USGS partner in Alaska that I know pretty well looks interesting. The DOI de-references to a web application run by Axiom Data Science that provides an interactive system to visualize and download the cached results (model output data) for a wave height and wind speed model that looks at both historic conditions and projected futures under several scenarios for coastal zones. The abstract and the web site list other interesting parts of the picture, including a USGS Open File Report describing the methods and what is possibly an "off the books" web page that purportedly accompanied a data release but is the kind of thing that might otherwise have ended up in a USGS Data Series or something similar with a journal. This is a pretty interesting case that has just about all the components for a model catalog except that I couldn't find the source code with a cursory look at least.

In [None]:
get_domain_items("code.usgs.gov")

I know a little bit about the COAWST model coupling system as I looked at is as an example for EarthMAP. The DOI reference here points to a fine enough source repository, but it does present some challenges in terms of interfacing toward the model catalog idea. You really have to dig into the user manual in a Word document to get at deeper level information about the model, which means writing code to crawl and put pieces together here would be pretty challenging.

I'd heard about the Sagebrush Hurdle Model but hadn't looked at the source code before. This is an interesting case where there is some useful metadata embedded in the README, namely a reference to a journal article that model was used in. The repo also has a CSV file that contains the input data for the model, which probably happens a fair bit depending on the size of the necessary data. It's not a bad practice, in some ways, to stash input data with the model. However, in this case, there's really not any real documentation in or with the dataset (species occurrence points and basic environmental condition) to understand where it came from without some sleuthing. So, it gets points for transparency, but some demerits for re-usability.

In [None]:
get_domain_items("github.com")

This is an interesting case of a minted DOI pointing to a provisional code repo sitting in a personal space in GitHub. I guess that's okay under policy, but it opens a few questions. I would tend to not mint a DOI for something that was not going to eventually be production code. Richie Erickson does provide a nice disclaimer up front in the README indicating that he doesn't expect someone to build on it. There's a title reference to an article that's now [published](https://www.ncbi.nlm.nih.gov/pubmed/31188494), and some guidance about following up to put those links into some type of code metadata that would support the reference would be a good thing to have.

In [None]:
get_domain_items("my.usgs.gov")

These three cases are pointing to the Atlassian BitBucket instance that we've now decomissioned on myUSGS. Someone probably should follow up to see if these repos got moved somewhere else and then change the dereferencing URL in the DOI system. The lack of a description in the DataCite metadata here (and elsewhere) should also be a flag.

In [None]:
get_domain_items("regclim.coas.oregonstate.edu")

This is a web site that provides access to data distribution services for NetCDF files from a climate model downscaling process. It's essentially model output data or perhaps more precisely, model derivative data. This process has been done in a number of different places using a variety of downscaling methods appropriate to different uses. The most notable framework in USGS for distributing these data with some value-added services for statistical summarization is through the GeoDataPortal. In this case, the DOI might actually be miscategorized as a model but it's related to modeling, indicating that some guidance and parameters are probably needed in USGS policy.

In [None]:
get_domain_items("dx.doi.org")

This is a curious one that dereferences to another DOI that is not active at the moment. Something weird happened with this. It seems like we should have some safeguards in our toolset to keep this kind of thing from happening.

The description does present an interesting class of model "stuff" that will end up in our catalog somewhere - inputs and outputs from hypothetical model runs made to critically examine aspects of model performance or provide some level of calibration. These are important artifacts in a modeling lifecycle that we should come up with a way to handle, but they should be somehow distinct from other types of model "packages."

In [None]:
# Provide an index on a key substring identifying GAP habitat maps and then show the rest
sciencebase_items = get_domain_items("www.sciencebase.gov")
sciencebase_items["hab_map"] = sciencebase_items["title"].str.find("_2001v1",2)
sciencebase_items.loc[sciencebase_items["hab_map"] < 0]

I already know about the GAP habitat maps, but I was curious about what other "models" are distributed via ScienceBase. It looks like some cool stuff online. A bunch of missing descriptions is problematic and something that should be easily fixed in the DOI system. Unfortunately, ScienceBase seems to be taking some much needed time off this evening, so I can't pull up any of these. Descriptions look like model outputs, and I would guess that the ScienceBase Items might actually be classed as data releases (or at least they could have been). The nice thing about using ScienceBase as the backend landing system for these DOIs would be the reasonably simple structured metadata capability. I'd be looking for things like links to source code, links to associated publications, and links or relationships to input data that might have been structured into the items, making the production of a linked catalog more feasible.