After talking about the model catalog today, I got curious about the DOIs that have been minted in the USGS ID space that claim to be models. This notebook takes a look. A quick look at the DataCite works API shows that there are quite a few DOIs minted, with the majority coming from work I did to register all those GAP species habitat models. But there seem to be some other interesting things in there. I put in a process to pull all those together by paginating through the API results but then stash them in a Feather file in case for further reference.

In [1]:
import requests
import pandas as pd

recordset = list()

In [2]:
# Function to create lookup URL for USGS DOIs where resource type is model
def get_url(page_number):
    return f"https://api.datacite.org/works?data-center-id=usgs.prod&resource-type-id=model&page[number]={page_number}"


In [3]:
%%time
# Get the first page of results and set up a structure to contain everything
page_number = 1
result = requests.get(get_url(page_number)).json()

CPU times: user 26.7 ms, sys: 7.96 ms, total: 34.6 ms
Wall time: 1.69 s


In [4]:
%%time
# Loop through and build out the full result set (the lazy way)
while len(result["data"]) > 0:
    recordset.extend(result["data"])
    page_number += 1
    result = requests.get(get_url(page_number)).json()

CPU times: user 1.33 s, sys: 104 ms, total: 1.44 s
Wall time: 5min 57s


If we ran the process to get the latest set of model type DOIs, then we build a dataframe from the attributes structure in each record. If we didn't run it, then we pull in the dataframe from the stashed Pickle file.

In [7]:
if len(recordset) == 0:
    df_models = pd.read_pickle("ModelsStash")
else:
    df_models = pd.DataFrame([i["attributes"] for i in recordset])
    df_models.to_pickle("ModelsStash")


In [8]:
# See what the data look like in a dataframe
df_models.head(5)

Unnamed: 0,doi,identifier,url,author,title,container-title,description,resource-type-subtype,data-center-id,member-id,...,view-count,views-over-time,download-count,downloads-over-time,published,registered,checked,updated,media,xml
0,10.5066/p9hiyvg2,https://doi.org/10.5066/p9hiyvg2,https://water.usgs.gov/GIS/metadata/usgswrd/XM...,"[{'given': 'Tracie R', 'family': 'Jackson'}]",MODFLOW-2005 and PEST models used to character...,U.S. Geological Survey,,Model,usgs.prod,usgs,...,0,[],0,[],2020,2020-03-05T18:30:50.000Z,,2020-03-05T18:30:50.000Z,[],PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLT...
1,10.5066/f76t0jpb,https://doi.org/10.5066/f76t0jpb,https://ca.water.usgs.gov/projects/reg_hydro/b...,"[{'given': 'Lorraine E.', 'family': 'Flint'}, ...",California Basin Characterization Model: A Dat...,U.S. Geological Survey,"The Basin Characterization Model (BCM), can tr...",Model,usgs.prod,usgs,...,0,[],0,[],2014,2014-09-17T00:57:18.000Z,,2020-03-02T21:34:34.000Z,[],PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLT...
2,10.5066/f798853m,https://doi.org/10.5066/f798853m,http://water.usgs.gov/GIS/metadata/usgswrd/XML...,"[{'given': 'J Hal', 'family': 'Davis'}, {'give...",MODFLOW 2000 and MT3DMS models of potentiometr...,U.S. Geological Survey,This model is a preliminary characterization o...,Model,usgs.prod,usgs,...,0,[],0,[],2016,2016-05-05T18:15:15.000Z,,2020-03-02T21:18:55.000Z,[],PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLT...
3,10.5066/f76w98jb,https://doi.org/10.5066/f76w98jb,https://water.usgs.gov/GIS/metadata/usgswrd/XM...,"[{'given': 'Alex R', 'family': 'Fiore'}, {'giv...",MODFLOW-2005 model used to evaluate the potent...,U.S. Geological Survey,"A three-dimensional groundwater flow model, MO...",Model,usgs.prod,usgs,...,0,[],0,[],2018,2018-03-13T14:32:08.000Z,,2020-03-02T21:18:05.000Z,[],PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLT...
4,10.5066/f7h99392,https://doi.org/10.5066/f7h99392,http://water.usgs.gov/GIS/metadata/usgswrd/XML...,"[{'given': 'Stephen J', 'family': 'Cauller'}, ...",MODFLOW2005 model used to simulate the effects...,U.S. Geological Survey,A three-dimensional groundwater flow model was...,Model,usgs.prod,usgs,...,0,[],0,[],2016,2016-07-26T16:40:24.000Z,,2020-03-02T21:16:07.000Z,[],PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLT...


In [9]:
# Set the option to be able to look at everything in columns for convenience
pd.set_option('display.max_colwidth', None)

One of the things I was interested in is where these DOIs actually point in terms of their dereferencing URLs. To examine this, I create a new column with just the domain part of the URL so that I can group and look at things.

In [10]:
df_models['url_domain'] = df_models.apply(lambda row: row.url.split("/")[2], axis = 1)

In [11]:
df_models[["doi","url_domain"]].groupby('url_domain').count()

Unnamed: 0_level_0,doi
url_domain,Unnamed: 1_level_1
alaska.usgs.gov,1
axiomdatascience.com,1
ca.water.usgs.gov,1
coastal.er.usgs.gov,1
code.usgs.gov,2
dx.doi.org,1
github.com,1
my.usgs.gov,3
nrtwq.usgs.gov,3
regclim.coas.oregonstate.edu,1


That's kind of an interesting spread. Almost all of the 1761 "models" in ScienceBase are the GAP habitat maps, which are a model output of sorts from a habitat affinity-based species distribution modeling method that involves human intervention. But it looks like we have some other interesting things to look at and think about in terms of what they actual represent for a future model catalog. In the sections below, I output particular records, take a look to see what's on the other end, and provide some notes.

In [13]:
df_models.loc[df_models['url_domain'] == "usgs.gov"][["doi","url","title","description"]]

Unnamed: 0,doi,url,title,description
1874,10.5066/f78050nh,http://usgs.gov,Title,test111222
1875,10.5066/f7cr5rd0,https://usgs.gov,Title,description of the dataset


Well, those are disappointing. Looks like someone needs to do some cleanup in what actually got registered out with DataCite, if that's even possible.

In [12]:
df_models.loc[df_models['url_domain'] == "axiomdatascience.com"][["doi","url","title","description"]]

Unnamed: 0,doi,url,title,description
113,10.5066/f72b8w3t,http://axiomdatascience.com/maps/integrated/?portal_id=51#,"Wave and Wind projections for United States Coasts; Mainland, Pacific Islands, and United States-Affiliated Pacific Islands","Coastal managers and ocean engineers rely heavily on projected average and extreme wave conditions for planning and design purposes, but when working on a local or regional scale, are faced with much uncertainty as changes in the global climate impart spatially-varying trends. Future storm conditions are likely to evolve in a fashion that is unlike past conditions and is ultimately dependent on the complicated interaction between the Earth?s atmosphere and ocean systems. Despite a lack of available data and tools to address future impacts, consideration of climate change is increasingly becoming a requirement for organizations considering future nearshore and coastal vulnerabilities. To address this need, the USGS used winds from four different atmosphere-ocean coupled general circulation models (AOGCMs) or Global Climate Models (GCMs) and the WaveWatchIII numerical wave model to compute historical and future wave conditions under the influence of two climate scenarios. The GCMs respond to specified, time-varying concentrations of various atmospheric constituents (such as greenhouse gases) and include an interactive representation of the atmosphere, ocean, land, and sea ice. The two climate scenarios are derived from the Coupled Model Inter-Comparison Project, Phase 5 (CMIP5; World Climate Research Programme, 2013) and represent one medium-emission mitigation scenario (Representative Concentration Pathways (RCP4.5) and one high-emissions scenario (RCP8.5). The historical time-period spans the years 1976 through 2005, whereas the two future time-periods encompass the mid (years 2026 through 2045) and end of the 21st century (years 2081 through 2099/2100). Continuous time-series of dynamically-downscaled hourly wave parameters (significant wave heights, peak wave periods, and wave directions) and three-hourly winds (wind speed and wind direction) are available for download at discrete deep-water locations along four U.S. coastal regions: ? Pacific Islands [this should be hyperlinked] ? West Coast [this should be hyperlinked] ? East Coast [this should be hyperlinked] ? Alaska Coasts [this should be hyperlinked] The data and cursory overviews of changing conditions along the coasts are summarized in (make these into links.. or have them as a box on the right-hand-side as is currently the case.. if so, then change the text here to read ?? along the coasts are summarized in the documents provided in the right-hand-side box.?) Storlazzi, C.D., Shope, J.B., Erikson, L.H., Hegermiller, C.A., and Barnard, P.L., 2015. Future wave and wind projections for United States and United States-affiliated Pacific Islands: U.S. Geological Survey Open-File Report 2015?1001, 426 p., http://dx.doi.org/10.3133/ofr20151001. Erikson, L.H., Hegermiller, C.E., Barnard, P.L., and Storlazzi, C. 2016. Wave projections for United States mainland coasts. U.S. Geological Survey pamphlet to accompany data set, http://dx.doi.org/10.5066/F7D798GR The time-series data cursory overviews provide information on trends and variability of geophysical variables that are expected to respond to changes in global-scale forcing. The data are being used for and are made available for further evaluation of trends and variability in offshore conditions, and as boundary conditions for regiona-l and local-scale coastal hazard models. Because winds and waves are the key processes driving extreme water levels and wave-driven flooding, the data are expected to be crucial for projecting future transient sea-level extremes on coasts and for defining areas that might be vulnerable to changing wind and wave conditions."


This one from a USGS partner in Alaska that I know pretty well looks interesting. The DOI de-references to a web application run by Axiom Data Science that provides an interactive system to visualize and download the cached results (model output data) for a wave height and wind speed model that looks at both historic conditions and projected futures under several scenarios for coastal zones. The abstract and the web site list other interesting parts of the picture, including a USGS Open File Report describing the methods and what is possibly an "off the books" web page that purportedly accompanied a data release but is the kind of thing that might otherwise have ended up in a USGS Data Series or something similar with a journal. This is a pretty interesting case that has just about all the components for a model catalog except that I couldn't find the source code with a cursory look at least.

In [14]:
df_models.loc[df_models['url_domain'] == "code.usgs.gov"][["doi","url","title","description"]]

Unnamed: 0,doi,url,title,description
344,10.5066/p9nquaow,https://code.usgs.gov/coawstmodel/COAWST,COAWST Modeling System v3.4,Coupled ocean atmosphere wave sediment transport modeling system
481,10.5066/p9nqnh41,https://code.usgs.gov/ecosystems/FRESC/sagebrush_hurdle_model,sagebrush_hurdle_model,
