The graph representation of entities and relationships involved in USGS work highlights several key entities that we often want to call up and examine in various ways as a part of assessing scientific capacity or sharing our work. This notebook explores the use of the graph to build a simplified summary of key entities and the characteristics about those entities in a [faceted search](https://en.wikipedia.org/wiki/Faceted_search) context. By entities, we are focusing on people, projects, publications, datasets, and models - 5 specific types of "scientific assets" that our graph is organizing and exposing in more holistic and integrated ways. In the graph, the facets associated with these entities are also housed as graph entities with different labels such as "DefinedSubjectMatter" and "JobTitle." In this context, we treat those as classifying charactistics of the entities we are interested in exploring.

The graph representation of these entities allows us to infer characteristics (facet values) about entities based on indirect relationships. For instance, we infer that a person has something to do with a subject matter associated with the publications or datasets that person has authored even though the particular term might be included in the list of words a person has used to describe themselves. In a more sophisticated system we might want to use weighting factors to weigh direct relationships more heavily than indirect relationships or use frequency of occurrence as a qualifier. In our current case, we do not yet have enough of a comprehensive coverage within the overall information system to make that a particularly valuable pursuit.

This notebook runs a series of queries against the graph built through the graph-building codes within this part of the repository to synthesize a set of entities and their classifying charactistics. These results are structured into documents (dicts/json objects) with a few high-level identifying properties and lists/arrays of terms that are themselves named entities in the graph. The process is controlled through a configuration document that determines how we look for relationships in the graph and what level of distance we are willing to accept for inferred relationships. 

Simplifying the properties themselves is an important part of this process as it allows us to put all entities into a single index with mostly common properties across entity types. This property synthesis process is built into the data building process where we take source information and organize records for incorporation into a common graph. Our focus here is to provide clarity and simplicity on what might otherwise be complex information, making sure that each entity and characteristic brought into the index can be traced to its origin.

Because we are building a search index from this synthesis, we bring in larger descriptive text where it is available and add it to the documents. This allows for that content to be indexed for full text search in addition to the filtering characteristics. A search pattern might start with a set of terms offered for full text search with further refinement through the facets.

For the facets themselves, we are focused mostly on terms that we have been able to align with defined linkable references. Much of the focus is on terms from the USGS Thesaurus, and we are using the top-level categorization scheme as our logical grouping of terms. These are generally from cases where an underlying data source points to the USGS Thesaurus explicitly as its source, and our processing has determined those assertions to be valid. In other cases, we are developing the characteristics through a combination of rule-based named entity recognition to build training data for machine learning named entity recognition followed by confirmed entity linking to create DefinedSubjectMatter relationships in the graph. These dynamics are described in other notebooks in this documentation set.

In [1]:
import isaid_helpers
import pandas as pd
import hashlib
import numpy as np
from pylinkedcmd import utilities
import click

The following is a configuration document that controls the relationship or facet gathering part of the process of querying the graph. It specifies the entity we are gathering information for, the node on the other end of the relationship, and the level of distance we allow for relationships to be sought. We also need to specify the particular identifier property that will be used to merge the lists of related entities/terms back to the base level entity. Because it's important to explain what the characteristics mean and where the information comes from to develop inferred relationships, we include a set of descriptive texts pulled in from markdown files for each facet.

In [2]:
facet_gatherer = [
    {
        "facet_name": "mission_areas_person",
        "facet_title": "USGS Mission Areas",
        "entity_type": "Person",
        "related_entity": "Organization",
        "related_entity_filter_prop": "category",
        "related_entity_filter_value": "USGS Mission Areas",
        "relationship_depth": "3",
        "identifier_field": "email",
        "description": isaid_helpers.get_facet_doc("mission_areas_person")
    },
    {
        "facet_name": "mission_areas_project",
        "facet_title": "USGS Mission Areas",
        "entity_type": "Project",
        "related_entity": "Organization",
        "related_entity_filter_prop": "category",
        "related_entity_filter_value": "USGS Mission Areas",
        "relationship_depth": "2",
        "identifier_field": "project_id",
        "description": isaid_helpers.get_facet_doc("mission_areas_project")
    },
    {
        "facet_name": "mission_areas_dataset",
        "facet_title": "USGS Mission Areas",
        "entity_type": "Dataset",
        "related_entity": "Organization",
        "related_entity_filter_prop": "category",
        "related_entity_filter_value": "USGS Mission Areas",
        "relationship_depth": "3",
        "identifier_field": "sdc_internal_id",
        "description": isaid_helpers.get_facet_doc("mission_areas_dataset")
    },
    {
        "facet_name": "regions_person",
        "facet_title": "USGS Regions",
        "entity_type": "Person",
        "related_entity": "Organization",
        "related_entity_filter_prop": "category",
        "related_entity_filter_value": "Interior Unified Regions",
        "relationship_depth": "4",
        "identifier_field": "email"
    },
    {
      "facet_name": "regions_project",
       "facet_title": "USGS Regions",
       "entity_type": "Project",
       "related_entity": "Organization",
       "related_entity_filter_prop": "category",
       "related_entity_filter_value": "Interior Unified Regions",
       "relationship_depth": "2",
        "identifier_field": "project_id"
    },
    {
      "facet_name": "regions_dataset",
       "facet_title": "USGS Regions",
       "entity_type": "Dataset",
       "related_entity": "Organization",
       "related_entity_filter_prop": "category",
       "related_entity_filter_value": "Interior Unified Regions",
       "relationship_depth": "2",
        "identifier_field": "sdc_internal_id"
    },
    {
        "facet_name": "cost_centers_person",
        "facet_title": "USGS Science Centers",
        "entity_type": "Person",
        "related_entity": "Organization",
        "related_entity_filter_prop": "category",
        "related_entity_filter_value": "USGS Cost Center",
        "relationship_depth": "1",
        "identifier_field": "email"
    },
    {
       "facet_name": "cost_centers_projects",
       "facet_title": "USGS Science Centers",
       "entity_type": "Project",
       "related_entity": "Organization",
       "related_entity_filter_prop": "category",
       "related_entity_filter_value": "USGS Cost Center",
       "relationship_depth": "1",
        "identifier_field": "project_id"
    },
    {
       "facet_name": "cost_centers_dataset",
       "facet_title": "USGS Science Centers",
       "entity_type": "Dataset",
       "related_entity": "Organization",
       "related_entity_filter_prop": "category",
       "related_entity_filter_value": "USGS Cost Center",
       "relationship_depth": "1",
        "identifier_field": "sdc_internal_id"
    },
    {
        "facet_name": "usgs_science_topics_person",
        "facet_title": "USGS Science Topics",
        "entity_type": "Person",
        "related_entity": "DefinedSubjectMatter",
        "related_entity_filter_prop": "concept_label",
        "related_entity_filter_value": "USGS_SCIENCE_TOPICS",
        "relationship_depth": "2",
        "identifier_field": "email"
    },
    {
       "facet_name": "usgs_science_topics_project",
       "facet_title": "USGS Science Topics",
       "entity_type": "Project",
       "related_entity": "DefinedSubjectMatter",
       "related_entity_filter_prop": "concept_label",
       "related_entity_filter_value": "USGS_SCIENCE_TOPICS",
       "relationship_depth": "1",
        "identifier_field": "project_id"
    },
    {
       "facet_name": "usgs_science_topics_dataset",
       "facet_title": "USGS Science Topics",
       "entity_type": "Dataset",
       "related_entity": "DefinedSubjectMatter",
       "related_entity_filter_prop": "concept_label",
       "related_entity_filter_value": "USGS_SCIENCE_TOPICS",
       "relationship_depth": "1",
        "identifier_field": "sdc_internal_id"
    },
    {
        "facet_name": "climate_change_terms_person",
        "facet_title": "Climate Change Terms",
        "entity_type": "Person",
        "related_entity": "DefinedSubjectMatter",
        "related_entity_filter_prop": "source",
        "related_entity_filter_value": "EPA Climate Change Glossary",
        "relationship_depth": "2",
        "identifier_field": "email"
    },
    {
        "facet_name": "climate_change_terms_project",
        "facet_title": "Climate Change Terms",
        "entity_type": "Project",
        "related_entity": "DefinedSubjectMatter",
        "related_entity_filter_prop": "source",
        "related_entity_filter_value": "EPA Climate Change Glossary",
        "relationship_depth": "1",
        "identifier_field": "project_id"
    },
    {
        "facet_name": "climate_change_terms_dataset",
        "facet_title": "Climate Change Terms",
        "entity_type": "Dataset",
        "related_entity": "DefinedSubjectMatter",
        "related_entity_filter_prop": "source",
        "related_entity_filter_value": "EPA Climate Change Glossary",
        "relationship_depth": "1",
        "identifier_field": "sdc_internal_id"
    },
    {
        "facet_name": "science_disciplines_person",
        "facet_title": "Science Disciplines",
        "entity_type": "Person",
        "related_entity": "DefinedSubjectMatter",
        "related_entity_filter_prop": "concept_label",
        "related_entity_filter_value": "USGS_SCIENTIFIC_DISCIPLINES",
        "relationship_depth": "2",
        "identifier_field": "email"
    },
    {
        "facet_name": "science_disciplines_project",
        "facet_title": "Science Disciplines",
        "entity_type": "Project",
        "related_entity": "DefinedSubjectMatter",
        "related_entity_filter_prop": "concept_label",
        "related_entity_filter_value": "USGS_SCIENTIFIC_DISCIPLINES",
        "relationship_depth": "1",
        "identifier_field": "project_id"
    },
    {
        "facet_name": "science_disciplines_dataset",
        "facet_title": "Science Disciplines",
        "entity_type": "Dataset",
        "related_entity": "DefinedSubjectMatter",
        "related_entity_filter_prop": "concept_label",
        "related_entity_filter_value": "USGS_SCIENTIFIC_DISCIPLINES",
        "relationship_depth": "1",
        "identifier_field": "sdc_internal_id"
    },
    {
        "facet_name": "geologic_time_periods_person",
        "facet_title": "Geologic Time Periods",
        "entity_type": "Person",
        "related_entity": "DefinedSubjectMatter",
        "related_entity_filter_prop": "concept_label",
        "related_entity_filter_value": "USGS_GEOLOGIC_TIME_PERIODS",
        "relationship_depth": "2",
        "identifier_field": "email"
    },
    {
        "facet_name": "geologic_time_periods_project",
        "facet_title": "Geologic Time Periods",
        "entity_type": "Project",
        "related_entity": "DefinedSubjectMatter",
        "related_entity_filter_prop": "concept_label",
        "related_entity_filter_value": "USGS_GEOLOGIC_TIME_PERIODS",
        "relationship_depth": "1",
        "identifier_field": "project_id"
    },
    {
        "facet_name": "geologic_time_periods_dataset",
        "facet_title": "Geologic Time Periods",
        "entity_type": "Dataset",
        "related_entity": "DefinedSubjectMatter",
        "related_entity_filter_prop": "concept_label",
        "related_entity_filter_value": "USGS_GEOLOGIC_TIME_PERIODS",
        "relationship_depth": "1",
        "identifier_field": "sdc_internal_id"
    },
    {
        "facet_name": "usgs_institutional_activities_person",
        "facet_title": "USGS Institutional Structures",
        "entity_type": "Person",
        "related_entity": "DefinedSubjectMatter",
        "related_entity_filter_prop": "concept_label",
        "related_entity_filter_value": "USGS_INSTITUTIONAL_STRUCTURES_ACTIVITIES",
        "relationship_depth": "2",
        "identifier_field": "email"
    },
    {
        "facet_name": "usgs_institutional_activities_project",
        "facet_title": "USGS Institutional Structures",
        "entity_type": "Project",
        "related_entity": "DefinedSubjectMatter",
        "related_entity_filter_prop": "concept_label",
        "related_entity_filter_value": "USGS_INSTITUTIONAL_STRUCTURES_ACTIVITIES",
        "relationship_depth": "1",
        "identifier_field": "project_id"
    },
    {
        "facet_name": "usgs_institutional_activities_dataset",
        "facet_title": "USGS Institutional Structures",
        "entity_type": "Dataset",
        "related_entity": "DefinedSubjectMatter",
        "related_entity_filter_prop": "concept_label",
        "related_entity_filter_value": "USGS_INSTITUTIONAL_STRUCTURES_ACTIVITIES",
        "relationship_depth": "1",
        "identifier_field": "sdc_internal_id"
    },
    {
        "facet_name": "usgs_business_categories_person",
        "facet_title": "USGS Business Categories",
        "entity_type": "Person",
        "related_entity": "DefinedSubjectMatter",
        "related_entity_filter_prop": "concept_label",
        "related_entity_filter_value": "USGS_BUSINESS_CATEGORIES",
        "relationship_depth": "2",
        "identifier_field": "email"
    },
    {
        "facet_name": "usgs_business_categories_project",
        "facet_title": "USGS Business Categories",
        "entity_type": "Project",
        "related_entity": "DefinedSubjectMatter",
        "related_entity_filter_prop": "concept_label",
        "related_entity_filter_value": "USGS_BUSINESS_CATEGORIES",
        "relationship_depth": "1",
        "identifier_field": "project_id"
    },
    {
        "facet_name": "usgs_business_categories_dataset",
        "facet_title": "USGS Business Categories",
        "entity_type": "Dataset",
        "related_entity": "DefinedSubjectMatter",
        "related_entity_filter_prop": "concept_label",
        "related_entity_filter_value": "USGS_BUSINESS_CATEGORIES",
        "relationship_depth": "1",
        "identifier_field": "sdc_internal_id"
    },
    {
        "facet_name": "locations_addressed_person",
        "facet_title": "Locations Addressed",
        "entity_type": "Person",
        "related_entity": "Location",
        "related_entity_filter_prop": None,
        "related_entity_filter_value": None,
        "relationship_depth": "2",
        "identifier_field": "email"
    },
    {
        "facet_name": "locations_addressed_project",
        "facet_title": "Locations Addressed",
        "entity_type": "Project",
        "related_entity": "Location",
        "related_entity_filter_prop": None,
        "related_entity_filter_value": None,
        "relationship_depth": "1",
        "identifier_field": "project_id"
    },
    {
        "facet_name": "locations_addressed_dataset",
        "facet_title": "Locations Addressed",
        "entity_type": "Dataset",
        "related_entity": "Location",
        "related_entity_filter_prop": None,
        "related_entity_filter_value": None,
        "relationship_depth": "1",
        "identifier_field": "sdc_internal_id"
    },
    {
        "facet_name": "job_titles",
        "facet_title": "USGS Job Titles",
        "entity_type": "Person",
        "related_entity": "JobTitle",
        "related_entity_filter_prop": None,
        "related_entity_filter_value": None,
        "relationship_depth": "1",
        "identifier_field": "email"
    },
]


There are likely more sophisticated queries that could gather data from our graph more efficiently, but we are balancing explainability with efficiency in this exercise. We start here by gathering up base level properties for each of the entity types we are bringing into the index. Simplification of properties has already occurred in the graph, and we are pulling most of what has been included for our entities here. We've already made choices at the data building and graph building stages of our process on things like how to specify a URL for as many entities as possible. This gives us something actionable that we can put out for just about every entity we bring together in a synthesis like this. We do make a choice on the string value that we can use as a logical identifier for each entity type and set that as a common property across the entities. We then loop through the facet_gatherer configuration to collect each set of related characteristics tied to that same logical identifier so we can put all the data together.

In [3]:
%%time
facets = dict()
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    facets["people_master"] = session.run("""
    MATCH (p:Person {active: "True"})
    RETURN labels(p)[0] AS entity_type, 
        p.email AS identifier_string, 
        p.name AS name, 
        p.url AS url, 
        p.description AS description, 
        p.email AS email, 
        p.orcid AS orcid,
        p.source AS source,
        p.image AS image,
        p.active AS active
    """).data()

    facets["projects_master"] = session.run("""
    MATCH (pr:Project)
    RETURN labels(pr)[0] AS entity_type, 
        pr.project_id AS identifier_string, 
        pr.name AS name, 
        pr.descriptive_texts AS descriptive_texts,
        pr.type AS type,
        pr.url AS url,
        pr.status AS status,
        pr.source AS source,
        pr.id_basis_project AS basis_project_number,
        pr.basis_task_number AS basis_task_number,
        pr.basis_subtask_number AS basis_subtask_number
    """).data()

    facets["datasets_master"] = session.run("""
    MATCH (d:Dataset)
    WHERE NOT d.sdc_internal_id IS NULL
    RETURN labels(d)[0] AS entity_type, 
        d.sdc_internal_id AS identifier_string, 
        d.name AS name, 
        d.description AS description, 
        d.url AS url,
        d.year_published AS year_published,
        d.doi AS doi,
        d.source AS source
    """).data()

    for facet_config in facet_gatherer:
        if facet_config["related_entity_filter_prop"] is None:
            facet_search = """
            MATCH (n1:%(entity_type)s)-[*..%(relationship_depth)s]-(n2:%(related_entity)s)
            WHERE NOT n1.%(identifier_field)s IS NULL
            RETURN n1.%(identifier_field)s AS identifier_string, collect(distinct(n2.name)) AS %(facet_name)s
            """ % facet_config
        else:
            facet_search = """
            MATCH (n1:%(entity_type)s)-[*..%(relationship_depth)s]-(n2:%(related_entity)s {%(related_entity_filter_prop)s: "%(related_entity_filter_value)s"})
            WHERE NOT n1.%(identifier_field)s IS NULL
            RETURN n1.%(identifier_field)s AS identifier_string, collect(distinct(n2.name)) AS %(facet_name)s
            """ % facet_config
        
        facet_data = session.run(facet_search).data()

        if facet_config["facet_name"] in facets:
            facets[facet_config["facet_name"]].extend(facet_data)
        else:
            facets[facet_config["facet_name"]] = facet_data
            

CPU times: user 9.96 s, sys: 676 ms, total: 10.6 s
Wall time: 15.8 s


We use Pandas as a reasonable vehicle to efficiently handle merging the results from our graph queries all together and then output that back as a list of dictionaries. Renaming the facet lists to logical titles and adding in a hashed identifier are conveniences used for the particular indexing technology we are feeding the final data into.

In [4]:
df = pd.DataFrame(facets["people_master"])
df = df.append(pd.DataFrame(facets["projects_master"]))
df = df.append(pd.DataFrame(facets["datasets_master"]))

for facet_config in facet_gatherer:
    merge_subset = [
        {
            "identifier_string": i["identifier_string"], 
            facet_config["facet_name"]: i[facet_config["facet_name"]]
        } for i in facets[facet_config["facet_name"]] 
        if "identifier_string" in i and i["identifier_string"] is not None
    ]
    if merge_subset:
        df_merge_in = pd.DataFrame(merge_subset)

        df = df.merge(
            df_merge_in,
            how="left",
            on="identifier_string"
        )

df = df.where(pd.notnull(df), None)
d = df.to_dict(orient="records")
for entity in d:
    entity.update({"identifier": hashlib.md5(entity["identifier_string"].encode('utf-8')).hexdigest()})
    for facet_config in facet_gatherer:
        if facet_config["facet_name"] in entity:
            entity[facet_config["facet_title"]] = entity.pop(facet_config["facet_name"])
    entity["Entity Type"] = entity.pop("entity_type")


I'm currently using an open source search engine technology called Meilisearch that handles reasonably large chunks of documents very efficiently and has a very simple implementation of the faceting concept. It's Python API is also very accessible. However, these final, simplified documents could be sent to many other search platforms such as Elasticsearch for efficient processing and use. I use a utility built for chunking the data up here to send batches of documents to the search client.

In [6]:
%%time
search_client = isaid_helpers.get_search_client()

if click.confirm('Do you want to delete existing documents in the index first?', default=False):
    search_client.get_index('entities').delete_all_documents()

for batch in utilities.chunks(d, 5000):
    search_client.get_index('entities').update_documents(batch)

Do you want to delete existing documents in the index first? [y/N]: 
CPU times: user 1.51 s, sys: 1.21 s, total: 2.71 s
Wall time: 1min 29s
