This is a really rough start to a processing algorithm for ScienceBase Items, focused on the CASC research projects. I had a number of points where this broke, and so it is quite inelegant. I'll come back to build out some type of generalized processor that will handle more use cases with content originating in ScienceBase. Some things that are seriously a pain:

* Too many people have injected meaning into the use of item hierarchy (aka "Folders"). This means we have to look at the hierarchy and figure out what it translates into instead of having all that information encoded at the item level.
* There is no real notion of controlled semantics in ScienceBase. The structure is there to have tags be part of something larger, but each community has gone their own way with whatever their needs are. Cases like this one show a bunch of inconsistency thrown into the mix, meaning we have a really hard time making any real use out of tags without a ton of interpretation.
* The ScienceBase API remains a bugaboo with time that it is just not responsive and other times when it hums along just fine.

In [55]:
import os
import requests
from wbmaker import WikibaseConnection
import pandas as pd
import pickle

geokb = WikibaseConnection('GEOKB_CLOUD')


# ScienceBase Content

In [3]:
if os.path.exists('../data/casc_projects.pkl'):
    casc_projects = pickle.load(open('../data/casc_projects.pkl', 'rb'))
else:
    casc_project_query = "https://www.sciencebase.gov/catalog/items?max=100&folderId=4f4e476ae4b07f02db47e13b&format=json&fields=title,body,contacts,tags,parentId,dates,browseTypes,systemTypes,spatial,facets,distributionLinks&filter0=browseCategory%3DProject"
    casc_projects = []
    while True:
        r = requests.get(casc_project_query).json()
        if 'items' in r:
            print(len(r['items']), r['items'][-1]['title'])
            casc_projects.extend(r['items'])
        if 'nextlink' in r:
            casc_project_query = r['nextlink']['url']
        else:
            break

    pickle.dump(casc_projects, open('../data/casc_projects.pkl', 'wb'))

df_casc_projects = pd.DataFrame(casc_projects)

# Tags

In [4]:
type_classification = {
    'theme': [
        'Water, Coasts and Ice', 
        'Science Tools for Managers',
        'Wildlife and Plants',
        'Indigenous Peoples',
        'Landscapes',
        'Drought, Fire and Extreme Weather',
        'Coastal Science',
        'Education, Modeling and Tools'
    ],
    'keyword': [
        'keywords',
        'Label'
    ],
    'place': [
        'Place',
        'Location',
        'SWCSC States',
        'SWCSC Geographic Area'
    ],
    'categorization': [
        'CMS Themes',
        'CMS Topics',
        'Science Themes',
        'SWCSC Habitat Types',
    ],
    'organizations': [
        'Organization',
        'Climate and Land Use Change Mission Area'
    ],
    'time': [
        'Fiscal Year'
    ],
    'drop_types': [
        'Categories',
        'Community',
        'SWCSC Taxon',
        'Taxon',
        'CMS Status',
        'Research Topics',
        'SWCSC Funding Category'
    ]
}

casc_project_tags = df_casc_projects[['id','tags']].explode('tags')
casc_project_tags = pd.concat([casc_project_tags['id'], casc_project_tags['tags'].apply(lambda x: pd.Series(x))], axis=1)

casc_project_tags = casc_project_tags[~casc_project_tags['type'].isin(type_classification['drop_types'])]
casc_project_tags['term'] = casc_project_tags['name'].str.lower()

tag_type_corrections = {
    'Water, Coasts and Ice': [
        'Water, Coasts and Ice', 
        'Water, Coasts, and Ice',
    ],
    'Science Tools for Managers': [
        'Science Tools for Managers',
        'Science Tools For Managers', 
        'Science Tools for Manager',
    ],
    'Wildlife and Plants': [
        'Wildlife and Plants',
        'Wildlife and plants',
    ],
    'Indigenous Peoples': [
        'Indigenous Peoples',
        'Indigenous People',
    ],
    'keywords': [
        'Keywords',
        'Keyword'
    ],
    'Science Themes': [
        'Science Themes',
        'CASC Science Themes',
        'NCCWSC Science Themes',
        'Theme',
    ]
}

for k,v in tag_type_corrections.items():
    casc_project_tags['type'] = casc_project_tags['type'].apply(lambda x: k if x in v else x)

## Themes

In [5]:
def derived_theme(tag_type, tag_name):
    for theme in type_classification['theme']:
        if tag_type == theme:
            return theme
        if tag_type == theme.lower():
            return theme
        if tag_name == theme:
            return theme
        if tag_name == theme.lower():
            return theme
    return None

casc_project_tags['derived_theme'] = casc_project_tags.apply(lambda x: derived_theme(x['type'], x['name']), axis=1)

casc_themes = casc_project_tags[['id','derived_theme']].dropna().drop_duplicates().reset_index(drop=True)
casc_project_themes = casc_themes.groupby('id', as_index=False).agg(list)

## Future Experimentation

In [None]:
for type_name, type_list in type_classification.items():
    if type_name != 'drop_types':
        for type_item in type_list:
            unique_tags = list(casc_project_tags[casc_project_tags['type'] == type_item]['term'].unique())
            print(type_name, ':', type_item)
            print(unique_tags)
            print()

no_type_tags = list(casc_project_tags[casc_project_tags['type'].isnull()]['term'].unique())
print('No Type')
print(no_type_tags)


# Person Contacts

In [6]:
# Isolate contacts for processing and filter to those that can be identified (email or ORCID)
casc_project_contacts = df_casc_projects[['id','contacts']].explode('contacts')
casc_project_contacts = pd.concat([casc_project_contacts['id'], casc_project_contacts['contacts'].apply(lambda x: pd.Series(x))], axis=1)
identifiable_contacts = casc_project_contacts[
    (
        casc_project_contacts['email'].notnull()
        &
        casc_project_contacts['email'].str.contains('usgs.gov')
    )
    |
    casc_project_contacts['orcId'].notnull()
][['id','type','name','email','orcId']]

# Get the QIDs and identifiers for people from the GeoKB
person_query = """
PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?item ?itemLabel ?email ?orcid
WHERE {
  ?item wdt:P1 wd:Q3 .
  OPTIONAL {
    ?item wdt:P109 ?email .
  }
  OPTIONAL {
    ?item wdt:P106 ?orcid .
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
"""

geokb_persons = geokb.sparql_query(person_query)
geokb_persons['person_qid'] = geokb_persons['item'].apply(lambda x: x.split('/')[-1])

# Remove people that are not in the GeoKB
remove_person_qids = [
    'Q45116'
]
geokb_persons = geokb_persons[~geokb_persons['person_qid'].isin(remove_person_qids)] 

geokb_persons['email'] = geokb_persons['email'].apply(lambda x: x.split('mailto:')[-1] if x else None)

# Link contacts to GeoKB people
orcid_matches = identifiable_contacts[identifiable_contacts['orcId'].notnull() & identifiable_contacts['orcId'].isin(geokb_persons['orcid'])]
email_matches = identifiable_contacts[
    ~identifiable_contacts['orcId'].isin(orcid_matches['orcId'])
    &
    identifiable_contacts['email'].notnull()
    &
    identifiable_contacts['email'].isin(geokb_persons['email'])
]

linkable_contacts = pd.concat([
    pd.merge(
        left=orcid_matches[['id','type','orcId']].rename(columns={'orcId':'orcid'}),
        right=geokb_persons[['person_qid','orcid']],
        how='left',
        on='orcid'
    ),
    pd.merge(
        left=email_matches[['id','type','email']],
        right=geokb_persons[['person_qid','email']],
        how='left',
        on='email'
    )    
])

linkable_contacts['predicate'] = linkable_contacts['type'].apply(lambda x: geokb.prop_lookup['principal investigator'] if x == 'Principal Investigator' else geokb.prop_lookup['investigator'])
linkable_contacts.drop(columns=['type','orcid','email'], inplace=True)
linkable_contacts.head()

Unnamed: 0,id,person_qid,predicate
0,55195ee5e4b0323842782fd0,Q44959,P156
1,5012ab04e4b05140039e02f8,Q49678,P155
2,4f83459ce4b0e84f608680f1,Q44554,P156
3,5006f498e4b0abf7ce733f92,Q138438,P155
4,544a7a4fe4b03653c63f88e6,Q48298,P155


# Org Contacts
We can derive the connection from projects to the specific CASC (region) that they are a part of by looking to the folder structure in ScienceBase. This is another case where ScienceBase has gone completely awry from its original design intent. We have to essentially look outside the metadata model itself, examining the parentId, in order to understand a fundamental piece of information. The whole folder thing in ScienceBase is really whack!

In [7]:
query_casc_org = """
PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?item ?itemLabel ?sciencebase_id
WHERE {
  ?item wdt:P1 wd:Q138341 ;
        wdt:P124 ?sciencebase_id .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
"""

geokb_orgs = geokb.sparql_query(query_casc_org)
geokb_orgs['org_qid'] = geokb_orgs['item'].apply(lambda x: x.split('/')[-1])

def project_containers(sb_id):
    url = f"https://www.sciencebase.gov/catalog/items?max=1000&format=json&fields=id&filter0=systemType%3DFolder&folderId={sb_id}"
    sb_r = requests.get(url).json()
    return [x['id'] for x in sb_r['items']]

geokb_orgs['project_containers'] = geokb_orgs['sciencebase_id'].apply(project_containers)

project_container_lookup = geokb_orgs[['org_qid','sciencebase_id']].set_index('sciencebase_id')['org_qid'].to_dict()
project_container_lookup.update(
    geokb_orgs[['org_qid','project_containers']].explode('project_containers').set_index('project_containers')['org_qid'].to_dict()
)

project_orgs = df_casc_projects[['id','title','parentId']].dropna().drop_duplicates().reset_index(drop=True)
project_orgs['casc_org_qid'] = project_orgs['parentId'].apply(lambda x: project_container_lookup[x] if x in project_container_lookup else None)
project_orgs.dropna(subset=['casc_org_qid'], inplace=True)

project_orgs.head()

Unnamed: 0,id,title,parentId,casc_org_qid
0,5a0b360ee4b09af898cb6f97,Prioritizing Stream Temperature Data Collectio...,55130bb3e4b02e76d75c0748,Q138342
1,600a4cfcd34e162231fb2635,Pacific Islands Climate Adaptation Science Cen...,5ffdfff4d34e52c3b3d9e70e,Q138348
2,604906a4d34eb120311a9f64,Regional Collaborations,600a4cfcd34e162231fb2635,Q138348
3,5d8e59b2e4b0c4f70d0ccdbc,A Synthesis of Recent Links Between Climate Ch...,5bc8d791e4b0fc368ebfe3e7,Q138344
4,5a68f114e4b06e28e9c7234f,Migration Mismatch: Bird Migration and Phenolo...,5952991de4b062508e3c770b,Q138344


# Core project records 

In [8]:
casc_projects_to_geokb = pd.merge(
    left=project_orgs,
    right=geokb_orgs[['org_qid','itemLabel']].rename(columns={'org_qid': 'casc_org_qid', 'itemLabel':'casc_org_name'}),
    how='left',
    on='casc_org_qid'
)

casc_projects_to_geokb = pd.merge(
    left=casc_projects_to_geokb,
    right=casc_project_themes,
    how='left',
    on='id'
)

def project_description(casc_name, casc_themes):
    desc = f"research project from the {casc_name}"
    if isinstance(casc_themes, list):
        desc += f" focused on {', '.join(casc_themes)}"
    return desc

casc_projects_to_geokb['description'] = casc_projects_to_geokb.apply(lambda x: project_description(x['casc_org_name'], x['derived_theme']), axis=1)

casc_projects_to_geokb.head()

Unnamed: 0,id,title,parentId,casc_org_qid,casc_org_name,derived_theme,description
0,5a0b360ee4b09af898cb6f97,Prioritizing Stream Temperature Data Collectio...,55130bb3e4b02e76d75c0748,Q138342,Alaska Climate Adaptation Science Center,"[Water, Coasts and Ice]",research project from the Alaska Climate Adapt...
1,600a4cfcd34e162231fb2635,Pacific Islands Climate Adaptation Science Cen...,5ffdfff4d34e52c3b3d9e70e,Q138348,Pacific Islands Climate Adaptation Science Center,,research project from the Pacific Islands Clim...
2,604906a4d34eb120311a9f64,Regional Collaborations,600a4cfcd34e162231fb2635,Q138348,Pacific Islands Climate Adaptation Science Center,,research project from the Pacific Islands Clim...
3,5d8e59b2e4b0c4f70d0ccdbc,A Synthesis of Recent Links Between Climate Ch...,5bc8d791e4b0fc368ebfe3e7,Q138344,National Climate Adaptation Science Center,[Science Tools for Managers],research project from the National Climate Ada...
4,5a68f114e4b06e28e9c7234f,Migration Mismatch: Bird Migration and Phenolo...,5952991de4b062508e3c770b,Q138344,National Climate Adaptation Science Center,[Wildlife and Plants],research project from the National Climate Ada...


# Dates

In [9]:
casc_project_dates = df_casc_projects[['id','dates']].explode('dates')
casc_project_dates = pd.concat([casc_project_dates['id'], casc_project_dates['dates'].apply(lambda x: pd.Series(x))], axis=1)
casc_project_dates = casc_project_dates[casc_project_dates['type'].isin(['Start','End'])][['id','type','dateString']]
casc_project_dates['date'] = pd.to_datetime(casc_project_dates['dateString'])
casc_project_dates['date'] = casc_project_dates['date'].apply(lambda x: x.strftime('+%Y-%m-%dT00:00:00Z'))
casc_project_dates['predicate'] = casc_project_dates['type'].apply(lambda x: geokb.prop_lookup['start time'] if x == 'Start' else geokb.prop_lookup['end time'])
casc_project_dates['precision'] = casc_project_dates['dateString'].apply(lambda x: 'year' if len(x) == 4 else 'day')
casc_project_dates.drop(columns=['type','dateString'], inplace=True)
casc_project_dates.head()

Unnamed: 0,id,date,predicate,precision
0,5a0b360ee4b09af898cb6f97,+2015-01-01T00:00:00Z,P60,year
0,5a0b360ee4b09af898cb6f97,+2020-01-01T00:00:00Z,P61,year
1,600a4cfcd34e162231fb2635,+2019-10-01T00:00:00Z,P60,day
1,600a4cfcd34e162231fb2635,+2024-09-30T00:00:00Z,P61,day
3,5d8e59b2e4b0c4f70d0ccdbc,+2018-02-01T00:00:00Z,P60,day


# Location

In [10]:
casc_project_locations = df_casc_projects[df_casc_projects['spatial'].notnull()][['id','spatial']].reset_index(drop=True)
casc_project_locations['representational_point'] = casc_project_locations['spatial'].apply(lambda x: x['representationalPoint'] if x is not None and 'representationalPoint' in x else None)
casc_project_locations.dropna(inplace=True)
casc_project_locations['longitude'] = casc_project_locations['representational_point'].apply(lambda x: str(x[0]))
casc_project_locations['latitude'] = casc_project_locations['representational_point'].apply(lambda x: str(x[1]))
casc_project_locations.drop(columns=['spatial','representational_point'], inplace=True)
casc_project_locations.head()

Unnamed: 0,id,longitude,latitude
0,5a0b360ee4b09af898cb6f97,-154.11775899999998,61.48683616700001
1,600a4cfcd34e162231fb2635,0.5398999999999887,3.8402
2,604906a4d34eb120311a9f64,0.5398999999999887,3.8402
3,5d8e59b2e4b0c4f70d0ccdbc,1.3997691894473974e-11,13.95200538023445
4,5a68f114e4b06e28e9c7234f,-70.17980944074814,36.31291226880885


# Build Project Entities

In [37]:
geokb_projects_query = """
PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?item ?itemLabel
WHERE {
  ?item wdt:P1 wd:Q15 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
"""

geokb_projects = geokb.sparql_query(geokb_projects_query)
geokb_projects['qid'] = geokb_projects['item'].apply(lambda x: x.split('/')[-1])

In [43]:
item_body_texts = pd.concat([
    pd.merge(
        left=df_casc_projects[['id','title','body']],
        right=geokb_projects[['qid','itemLabel']].rename(columns={'itemLabel':'title'}),
        how='inner',
        on='title'
    ),
    pd.merge(
        left=df_casc_projects[['id','title','body']],
        right=missing_sparql_items[['qid','sb_id']].rename(columns={'sb_id':'id'}),
        how='inner',
        on='id'
    )
]).drop_duplicates().reset_index(drop=True)

In [53]:
from bs4 import BeautifulSoup
item_body_texts['body_text'] = item_body_texts['body'].apply(lambda x: [i.replace('\xa0', ' ') for i in BeautifulSoup(x, 'html.parser').get_text().split('\n') if len(i.strip()) > 0] if x else None)
item_body_texts['item_talk_text'] = item_body_texts['body_text'].apply(lambda x: '\n'.join(x))
item_body_texts['item_talk_text'] = item_body_texts.apply(lambda x: f"= {x['title']} =\n\n{x['item_talk_text']}", axis=1)

In [32]:
missing_sparql_items = pd.DataFrame([{k:v for k,v in i.items() if k in ['sb_id','qid']} for i in old_item_id_reports])

missing_projects = casc_projects_to_geokb[
    ~casc_projects_to_geokb['title'].isin(geokb_projects['itemLabel'])
    &
    ~casc_projects_to_geokb['id'].isin(missing_sparql_items['sb_id'])
]

# casc_projects_to_geokb = pd.merge(
#     left=casc_projects_to_geokb,
#     right=df_existing_geokb_items,
#     how='left',
#     on='title'
# )

In [None]:
from wikibaseintegrator.wbi_enums import WikibaseDatePrecision

new_item_id_map = []
old_item_id_reports = []

for index, row in missing_projects.iterrows():
    casc_project_ref = geokb.datatypes.ExternalID(
        prop_nr=geokb.prop_lookup['ScienceBase Item'],
        value=row['id']
    )

    item = geokb.wbi.item.new()

    item.labels.set('en', row['title'])
    item.descriptions.set('en', row['description'])

    item.claims.add(
        geokb.datatypes.Item(
            prop_nr=geokb.prop_lookup['instance of'],
            value='Q15',
            references=[casc_project_ref]   
        ),
        action_if_exists=geokb.action_if_exists.REPLACE_ALL
    )

    item.claims.add(
        geokb.datatypes.Item(
            prop_nr=geokb.prop_lookup['funder'],
            value=row['casc_org_qid'],
            references=[casc_project_ref]
        ),
        action_if_exists=geokb.action_if_exists.REPLACE_ALL
    )

    project_dates = casc_project_dates[casc_project_dates['id'] == row['id']]
    for i, r in project_dates.iterrows():
        item.claims.add(
            geokb.datatypes.Time(
                prop_nr=r['predicate'],
                time=r['date'],
                precision=WikibaseDatePrecision.YEAR if r['precision'] == 'year' else WikibaseDatePrecision.DAY,
                references=[casc_project_ref]
            ),
            action_if_exists=geokb.action_if_exists.REPLACE_ALL
        )

    project_location = casc_project_locations[casc_project_locations['id'] == row['id']]
    for i, r in project_location.iterrows():
        item.claims.add(
            geokb.datatypes.GlobeCoordinate(
                prop_nr=geokb.prop_lookup['coordinate location'],
                latitude=float(r['latitude']),
                longitude=float(r['longitude']),
                references=[casc_project_ref]
            ),
            action_if_exists=geokb.action_if_exists.REPLACE_ALL
        )

    project_contacts = linkable_contacts[linkable_contacts['id'] == row['id']]
    contact_claims = []
    for i, r in project_contacts.iterrows():
        contact_claims.append(
            geokb.datatypes.Item(
                prop_nr=r['predicate'],
                value=r['person_qid'],
                references=[casc_project_ref]
            )
        )
    item.claims.add(
        contact_claims,
        action_if_exists=geokb.action_if_exists.REPLACE_ALL
    )

    # display(item.get_json())
    try:
        response = item.write(
            summary='Importing CASC projects from ScienceBase'
        )
        new_item_id_map.append({
            'qid': response.id,
            'sb_id': row['id']
        })
        print(response.id, row['title'])
    except Exception as e:
        old_item_id_reports.append({
            'sb_id': row['id'],
            'error': str(e)
        })
        print(row['id'], row['title'], str(e))


# Full body content to Item Talk pages

In [56]:
item_body_texts.head()

Unnamed: 0,id,title,body,qid,body_text,item_talk_text
0,5a0b360ee4b09af898cb6f97,Prioritizing Stream Temperature Data Collectio...,Changes in stream temperature can have signifi...,Q160067,[Changes in stream temperature can have signif...,= Prioritizing Stream Temperature Data Collect...
1,600a4cfcd34e162231fb2635,Pacific Islands Climate Adaptation Science Cen...,The Pacific Islands Climate Adaptation Science...,Q160072,[The Pacific Islands Climate Adaptation Scienc...,= Pacific Islands Climate Adaptation Science C...
2,604906a4d34eb120311a9f64,Regional Collaborations,PI-CASC regularly interacts with a diverse and...,Q160073,[PI-CASC regularly interacts with a diverse an...,= Regional Collaborations =\n\nPI-CASC regular...
3,5d8e59b2e4b0c4f70d0ccdbc,A Synthesis of Recent Links Between Climate Ch...,Climate change is already affecting and will c...,Q160074,[Climate change is already affecting and will ...,= A Synthesis of Recent Links Between Climate ...
4,5a68f114e4b06e28e9c7234f,Migration Mismatch: Bird Migration and Phenolo...,"There are approximately 2,000 species of migra...",Q160075,"[There are approximately 2,000 species of migr...",= Migration Mismatch: Bird Migration and Pheno...


In [57]:
for index, row in item_body_texts.iterrows():
    talk_page = geokb.mw_site.pages[f"Item_talk:{row['qid']}"]
    talk_page.save(row['item_talk_text'], summary='Cached full body text from ScienceBase for reference')
    print(row['qid'])

Q160067
Q160072
Q160073
Q160074
Q160075
Q160076
Q160077
Q160078
Q160079
Q160080
Q160081
Q160082
Q160083
Q160084
Q160085
Q160086
Q160087
Q160088
Q160089
Q160090
Q160091
Q160092
Q160093
Q160094
Q160095
Q160096
Q160097
Q160098
Q160099
Q160100
Q160101
Q160102
Q160103
Q160104
Q160105
Q160106
Q160107
Q160108
Q160109
Q160110
Q160111
Q160112
Q160113
Q160114
Q160115
Q160116
Q160117
Q160118
Q160119
Q160120
Q160121
Q160122
Q160123
Q160124
Q160125
Q160126
Q160127
Q160128
Q160129
Q160130
Q160131
Q160132
Q160133
Q160134
Q160135
Q160136
Q160137
Q160138
Q160139
Q160140
Q160141
Q160142
Q160143
Q160144
Q160157
Q160158
Q160159
Q160160
Q160161
Q160162
Q160163
Q160164
Q160165
Q160166
Q160167
Q160168
Q160169
Q160170
Q160171
Q160172
Q160173
Q160174
Q160175
Q160176
Q160177
Q160178
Q160179
Q160180
Q160181
Q160182
Q160183
Q160184
Q160185
Q160186
Q160187
Q160188
Q160189
Q160190
Q160191
Q160192
Q160193
Q160194
Q160195
Q160196
Q160197
Q160198
Q160199
Q160200
Q160201
Q160202
Q160203
Q160204
Q160205
Q160206
Q160207
