This notebook works through a process to build a collection of model descriptor items in ScienceBase from a source spreadsheet and then get that information back out of ScienceBase as a spreadsheet. Once the items are in ScienceBase, you could also use the built in output to CSV for a ScienceBase catalog search result, but doing that in code lets you fully control what you want in the spreadsheet.

There is also another feature in ScienceBase to attach a spreadsheet to an item and then provide a configuration snippet that is used to generate child items from the rows in the spreadsheet. I just didn't immediately turn up documentation on how to do that, and I've forgotten the specifics.

Personally, I would use the spreadsheet-to-sciencebase method once and then do everything else in ScienceBase from that point on. Otherwise, you would need a process to first lookup the existing item in ScienceBase and update with whatever you changed in the spreadsheet. I did not build that kind of "upsert" method into this code.

In [1]:
import pandas as pd
from sciencebasepy import SbSession
import requests
import math

You need to establish an authenticated session with ScienceBase in order to write any items to the system. Insert your email address here, and when you execute the cell, you will be prompted for your Active Directory password.

In [2]:
sb = SbSession().loginc("sbristol@usgs.gov")

········


You will need to put the top level "USGS Model Catalog" item somewhere to serve as the container/collection for the model descriptive items. I would personally put that up at the very root of ScienceBase, eventually, but for now that can go anywhere that you want to house it in the near term. It just needs to be somewhere that you or someone else can open up for public access.

Here, I am getting my own "My Items" ID using a function in sciencebasepy. You can replace this with whatever ScienceBase Item ID (UUID) value you want to use or put it in your My Items space. We just want to be careful not to proliferate too many of these and clean up after ourselves.

In [3]:
parent_id_for_catalog = sb.get_my_items_id()

We need to create a top level item in ScienceBase to house the collection. You don't need any information other than a title and parent ID to start an item, so that's all I've used here. The ScienceBase edit interface can be used readily enough to add in some descriptive information and things like organizational contacts on who's responsible for the work. I would get with Viv or Leslie to help craft some language to describe the work in progress USGS Model Catalog in a way that answers obvious questions folks will have about it.

Note: If you end up fiddling with this process, it would be worthwhile to flesh out the JSON structure below for the "model_catalog_item" with at least the text you want to use for the body so you don't have to do that multiple times in the ScienceBase user interface.

In [4]:
model_catalog_item = {
    'title': 'USGS Model Catalog',
    'parentId': parent_id_for_catalog
}

model_catalog = sb.create_item(model_catalog_item)

Here we can look at the full item document that we just created. The ID for this item now becomes the parent ID that we'll use in creating the actual individual model descriptor items from the spreadsheet.

In [5]:
model_catalog

{'link': {'rel': 'self',
  'url': 'https://www.sciencebase.gov/catalog/item/5e8cfc9082cee42d13465f82'},
 'relatedItems': {'link': {'url': 'https://www.sciencebase.gov/catalog/itemLinks?itemId=5e8cfc9082cee42d13465f82',
   'rel': 'related'}},
 'id': '5e8cfc9082cee42d13465f82',
 'title': 'USGS Model Catalog',
 'provenance': {'dateCreated': '2020-04-07T22:20:00Z',
  'lastUpdated': '2020-04-07T22:20:00Z',
  'lastUpdatedBy': 'sbristol@usgs.gov',
  'createdBy': 'sbristol@usgs.gov'},
 'hasChildren': False,
 'parentId': '4f4f863be4b0c2aeb78a9e3f',
 'permissions': {'read': {'acl': [],
   'inherited': True,
   'inheritsFromId': '4f4f863be4b0c2aeb78a9e3f'},
  'write': {'acl': ['USER:sbristol@usgs.gov'],
   'inherited': True,
   'inheritsFromId': '4f4f863be4b0c2aeb78a9e3f'}},
 'distributionLinks': [],
 'locked': False}

There are lots of ways of working with spreadsheets of different kinds, but Pandas is pretty simple and convenient. Here we read the latest snapshot of the Excel file that I put in the code repo into a Pandas dataframe and show what it looks like.

In [6]:
usgs_models = pd.read_excel("USGS_models_named_models.xlsx")
usgs_models["output_links"] = usgs_models[["Output","Output.1","Output.2","Output.3","Output.4"]].values.tolist()
usgs_models.head()

Unnamed: 0,Model Name,Link,Contact(s),Output,Output.1,Output.2,Output.3,Output.4,Unnamed: 8,output_links
0,1DTempPro,https://water.usgs.gov/ogw/bgas/1dtemppro/,edswain@usgs.gov,https://doi.org/10.5066/P9Q8JGAO,,,,,,"[https://doi.org/10.5066/P9Q8JGAO, , nan, nan..."
1,BBS,https://www.mbr-pwrc.usgs.gov/bbs/,sbeliew@usgs.gov,https://doi.org/10.5066/F7JS9NHH,,,,,,"[https://doi.org/10.5066/F7JS9NHH, , nan, nan..."
2,BEWARE,https://catalog.data.gov/dataset/beware-databa...,aallwardt@usgs.gov,https://doi.org/10.5066/F7T43S20,,,,,,"[https://doi.org/10.5066/F7T43S20, , nan, nan..."
3,BISECT,https://pubs.er.usgs.gov/publication/sir20195045,edswain@usgs.gov,,,,,,,"[ , nan, nan, nan, nan]"
4,California Basin Characterization Model,https://ca.water.usgs.gov/projects/reg_hydro/b...,lflint@usgs.gov,,,,,,,"[nan, nan, nan, nan, nan]"


Now that we have our container to put items and we have our data to build items from, we can assemble a list of new ScienceBase Items and load them all at once with the create_items function. Looking at the data, we need to do a couple of things:

* Split contacts on semicolons for cases where there is more than one email address
* Grab all the output links from separate columns so we can add them as links

We'll check to make sure that any output links aren't already the same as the info link so we don't duplicate those unnecessarily in ScienceBase.

In this process, we need to make some choices about how we are going to deal with laying the information out.

* With the email addresses, we have a good source to use in finding the full person record for contacts we want to add. We can't necessarily assume these are "authors" at this point, so we'll put them in as simple "Point of Contact" type contacts for this context.
* We could do some sleuthing to really classify the links in different ways that could be really useful for the eventual catalog. For now, we can refer to the "Link" links with a title like "Model Reference Link" to basically put those links into a common context here. We'll title the others as "Model Output" for the time being.

Some really interesting things could start to spool out from this just by gathering the links together like this. Some of the links represent machine-readable end points where code could be written to gather structured metadata from those sources and use it to build out a more complete picture of the models. We can explore what that might look like down the road.

I created a couple of helper functions here that make the processing workflow simpler. The first function builds the information content that ScienceBase needs to link a contact to the ScienceBase Directory. Really, this is something that should operate within the ScienceBase API itself where you simply send in a known contact and have the system connect the dots for you. This makes things kind of slow having to run all the queries against the ScienceBase Directory. I also got lazy in the process here in terms of not dealing with the search results thoroughly. It should work for this purpose, but I'd need more time to make it something real.

The second function just builds out the web links.

In [7]:
def sb_party_to_contact(search_term):
    search_result = requests.get(
        f"https://www.sciencebase.gov/directory/people?q={search_term}&format=json&dataset=all&max=10"
    ).json()
    
    if search_result["total"] != 1:
        return None
    
    person_record = search_result["people"][0]
    
    sb_contact = {
        "name": person_record["displayName"],
        "oldPartyId": person_record["id"],
        "contactType": person_record["type"],
        "onlineResource": f"https://my.usgs.gov/catalog/Global/catalogParty/show/{person_record['id']}",
        "email": person_record["email"],
        "active": person_record["active"],
        "jobTitle": person_record["extensions"]["personExtension"]["jobTitle"],
        "firstName": person_record["extensions"]["personExtension"]["firstName"],
        "lastName": person_record["extensions"]["personExtension"]["lastName"],
        "cellPhone": person_record["extensions"]["personExtension"]["cellPhone"],
        "organization": {
            "displayText": person_record["extensions"]["personExtension"]["organizationDisplayText"]
        },
        "primaryLocation": {
            "name": person_record["primaryLocation"]["name"],
            "building": person_record["primaryLocation"]["building"],
            "buildingCode": person_record["primaryLocation"]["buildingCode"],
            "officePhone": person_record["primaryLocation"]["phone"],
            "faxPhone": person_record["primaryLocation"]["faxPhone"],
            "streetAddress": {
                "line1": person_record["primaryLocation"]["streetAddress"]["line1"],
                "city": person_record["primaryLocation"]["streetAddress"]["city"],
                "state": person_record["primaryLocation"]["streetAddress"]["state"],
                "zip": person_record["primaryLocation"]["streetAddress"]["zip"]
            }
        },
    }
    
    if "orcId" in person_record.keys():
        sb_contact["orcId"] = person_record["orcId"]
        
    return sb_contact
    
    
def sb_web_link(url, title="Model Reference Link"):
    return {
                "type": "webLink",
                "typeLabel": "Web Link",
                "uri": url,
                "rel": "related",
                "title": title,
                "hidden": False
            }


This workflow builds out the items that need to get submitted to the collection. It creates a simple, barebones item with title, contacts, and links. But that's a good start.

In [8]:
model_descriptors = list()

for index, record in usgs_models.iterrows():
    new_model_item = {
        "parentId": model_catalog["id"],
        "title": record["Model Name"],
        "webLinks": [sb_web_link(record["Link"])]
    }
    
    record_contacts = record["Contact(s)"].split(";")
    if len(record_contacts) > 0:
        new_model_item["contacts"] = [sb_party_to_contact(contact) for contact in record_contacts]
        
    for link in [l for l in record["output_links"] if isinstance(l, str) and l not in [i["uri"] for i in new_model_item["webLinks"]]]:
        new_model_item["webLinks"].append(sb_web_link(link, "Model Output Data"))
        
    model_descriptors.append(new_model_item)

# To Do
I need to finish this off by creating the items in the model catalog collection and then showing how to get things back out and put into a spreadsheet.

In [9]:
model_descriptors

[{'parentId': '5e8cfc9082cee42d13465f82',
  'title': '1DTempPro',
  'webLinks': [{'type': 'webLink',
    'typeLabel': 'Web Link',
    'uri': 'https://water.usgs.gov/ogw/bgas/1dtemppro/',
    'rel': 'related',
    'title': 'Model Reference Link',
    'hidden': False},
   {'type': 'webLink',
    'typeLabel': 'Web Link',
    'uri': 'https://doi.org/10.5066/P9Q8JGAO',
    'rel': 'related',
    'title': 'Model Output Data',
    'hidden': False},
   {'type': 'webLink',
    'typeLabel': 'Web Link',
    'uri': ' ',
    'rel': 'related',
    'title': 'Model Output Data',
    'hidden': False}],
  'contacts': [{'name': 'Eric D Swain',
    'oldPartyId': 10133,
    'contactType': 'person',
    'onlineResource': 'https://my.usgs.gov/catalog/Global/catalogParty/show/10133',
    'email': 'edswain@usgs.gov',
    'active': True,
    'jobTitle': 'Hydrologist (RGE)',
    'firstName': 'Eric',
    'lastName': 'Swain',
    'cellPhone': None,
    'organization': {'displayText': 'Caribbean-Florida Water Scienc