This notebook starts to explore where we can make connections between institutes and organizations represented in OBIS and monitoring programs that responded to a survey for the EOV effort. Metadata for contributing institutions is captured within the OBIS database and available via the API. The notebook tries some things with fuzzy search matching to attempt to make connections, but it looks like we will need to do some further digging, both into the actual datasets and into other ways of finding connections. I pulled possible names/titles for searching from institutes, dataset names, and dataset contact organizations, and we only get 6 reasonable links through this route.

Additional variables in the Darwin Core standard itself that may be useful in better describing datasets in a way that might connect to the names of monitoring programs in regular use include the following:

* catalogNumber (sometimes also used as occurrenceID)
* collectionCode
* locality
* institutionCode (not necessarily connected to the institutions information from metadata)
* datasetName and datasetID (not necessarily the same as dataset titles from metadata)

Still more distant possibilities might come from analyzing personal contact information with email addresses for dataset contacts from metadata or even recordedBy values in the data. It's not that large a global community overall, and these could generate leads in some fashion. Although, they will be relatively tenuous and perhaps not worth the extra effort.

In [1]:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import requests

In [2]:
eovsByPrograms = pd.read_csv('EOVsByProgram.csv')

In [3]:
eovsByPrograms

Unnamed: 0,Program,Full name,Microb,Phyto,Zoo,allFish,TBM,BentInv,Macroalg,Seagrass,Mangrove,Coral
0,AIMS_LTMP,AIMS Long-term Monitoring Program,,,,,,,,,,
1,AMBON,Arctic Marine Biodiversity Observing Network,,1.0,1.0,,1.0,1.0,,,,
2,AMT,Atlantic Meridional Transect,1.0,1.0,1.0,,,,,,,
3,Antares,Antares Marine Monitoring Network,,1.0,,,,,,,,
4,AOOS,Alaska Ocean Observing System,,,,,,,,,,
5,AWI_LTO,AWI long-term observations,,1.0,1.0,,1.0,1.0,,,,
6,BalearesOMN,Balearic Islands ocean observing and monitorin...,,1.0,,1.0,1.0,,,,,
7,BASEcosystems,BAS Ecosystems,,,1.0,,1.0,,,,,
8,BioArgo,Biogeochemical Argo,,1.0,,,,,,,,
9,BioRaTS,Rothera Biology Monitor,,,,,,,,,,


In [4]:
obis_institutes = requests.get("https://api.obis.org/Institute").json()

In [5]:
institute_dict = dict()
for institute in obis_institutes['results']:
    institute_dict[institute['name']] = institute['id']
    if institute['children'] is not None:
        for child_institute in institute['children']:
            institute_dict[child_institute['name']] = child_institute['id']

institute_name_list = [k for k,v in institute_dict.items()]

In [6]:
obis_datasets = requests.get("https://api.obis.org/Dataset").json()

In [7]:
dataset_dict = dict()
organization_name_list = list()
for dataset in obis_datasets['results']:
    dataset_dict[dataset['title']] = dataset['id']
    if dataset['contacts'] is not None:
        for org_name in [c['organization'] for c in dataset['contacts'] if c['organization'] is not None]:
            if org_name not in organization_name_list:
                organization_name_list.append(org_name)

dataset_title_list = [k for k,v in dataset_dict.items()]

In [8]:
matching_test = list()
for index,row in eovsByPrograms.iterrows():
    this_test = {
        "program_full_name": row["Full name"],
        "program_short_name": row["Program"]
    }
    this_test["possible_match_institute_full"] = process.extractOne(row["Full name"], institute_name_list)
    this_test["possible_match_dataset_full"] = process.extractOne(row["Full name"], dataset_title_list)
    this_test["possible_match_orgcontact_full"] = process.extractOne(row["Full name"], organization_name_list)

    this_test["possible_match_institute_short"] = process.extractOne(row["Program"], institute_name_list)
    this_test["possible_match_dataset_short"] = process.extractOne(row["Program"], dataset_title_list)
    this_test["possible_match_orgcontact_short"] = process.extractOne(row["Program"], organization_name_list)

    matching_test.append(this_test)


The following pulls out anything with a score higher than a threshold to see if there is anything reasonable in terms of matches to the various name/title pools. Even matches with a score of 90 are mostly false positives using this method.

In [11]:
match_threshold = 90
any_matches = [i for i in matching_test 
               if i['possible_match_institute_full'][-1] > match_threshold 
               or i['possible_match_dataset_full'][-1] > match_threshold 
               or i['possible_match_orgcontact_full'][-1] > match_threshold 
               or i['possible_match_institute_short'][-1] > match_threshold 
               or i['possible_match_dataset_short'][-1] > match_threshold 
               or i['possible_match_orgcontact_short'][-1] > match_threshold
              ]

In [12]:
print(len(any_matches))
display(any_matches)

6


[{'possible_match_dataset_full': ('Victorian Biodiversity Atlas, Victoria, Australia (1900-2017) - marine records',
   86),
  'possible_match_dataset_short': ('OBM', 57),
  'possible_match_institute_full': ('Circumpolar Biodiversity Monitoring Programme',
   100),
  'possible_match_institute_short': ('University of Tasmania, School of Aquaculture, Launceston Campus',
   45),
  'possible_match_orgcontact_full': ('Global Biodiversity Information Facility Netherlands Biodiversity Information Facility (GBIF-NLBIF)',
   86),
  'possible_match_orgcontact_short': ('Helen  Campbell', 68),
  'program_full_name': 'Circumpolar Biodiversity Monitoring Programme',
  'program_short_name': 'CBMP'},
 {'possible_match_dataset_full': ('World Ocean Database 2009', 86),
  'possible_match_dataset_short': ('NCOS', 67),
  'possible_match_institute_full': ('Gulf of Mexico Coastal Ocean Observing System',
   100),
  'possible_match_institute_short': ('U.S. Geological Survey HQ', 54),
  'possible_match_orgconta