Much of the information we are gathering in the iSAID exercise has either unverified or uncertain relationships to named entities that can be linked to a source of definition and meaning or it consists of unstructured chunks of text that have to be processed with some form of natural language processing if we want to identify characteristics of the entities being described for analysis. To accomplish these tasks, we are pursuing a combination of rule-based named entity recognition (NER) with machine learning NER. The former is a fairly simplistic process of  finding a named entity within a piece of text, while the latter uses statistical methods with a trained model to identify etities based on where and how they show up in text. Machine learning helps us identify concepts that we don't explicitly start with as known entities. We are combining these approaches together in experimentation toward entity linking machine learning where our goal is to characterize entities in our knowledge graph with links to other defined entities in our graph.

We are still in early phases of this development, examining several different technologies and methods for best results that we can take into an operational system. In this notebook, we are simply teeing up one example of a training dataset for processing with the Python Spacy NLP package. We build a full corpus of potentially relevant texts that we've already assembled in our graph tied to specific entities (people, projects, datasets, etc.) and run a simple rule-based process to identify where these terms show up in our texts. We then dump this and process it via command line to train a very specific NER model that we can operate against these and other texts to identify potential entity relationships. We can then confirm those entity relationships against our reference vocabulary and feed what we want to use as relationships into the graph.

We are also experimenting with more robust ways of accomplishing this same task, with the most promising being the development of an actual knowledge graph model witin Spacy and developing word vector training data using Gensim to build entity linking models. Early results to simply jumpstart the process with this fairly crude method using the small Climate Change Glossary has let us show the potential of the overall NER approach. We are also examining the Amazon Web Services approach to NER using their built-in specific technologies. Those could offer some operational advantages as our raw materials will be spun up in other parts of the AWS stack. However, we would like to maintain a certain separation of concerns in the architecture such that key steps like training data selection and processing, training data versioning, model building parameters, individual model-building runs, and entity-linking steps can all be individually maintained and operated in a way that will stand the test of time.

In [1]:
import isaid_helpers
import pickle
from collections import Counter
import ast
import re
import multiprocessing
from joblib import Parallel, delayed
from tqdm import tqdm

In [2]:
reference_vocabulary = pickle.load(open(isaid_helpers.f_ner_reference, "rb"))
display(Counter(i['source'] for i in reference_vocabulary))

Counter({'Wikidata Mineral Species': 10314,
         'Wikidata Chemical Elements': 667,
         'Wikidata Sedimentary Rocks': 91,
         'Wikidata Clastic Sediments': 7,
         'Wikidata Sovereign States': 1409,
         'Wikidata US States': 50,
         'Wikidata Global Seas and Oceans': 258,
         'Wikidata Global Faults': 3102,
         'Wikidata Global Volcanos': 1548,
         'Wikidata Global Earthquakes': 1500,
         'Wikidata US National Parks': 106,
         'Wikidata US National Monuments': 184,
         'Wikidata US National Forests': 221,
         'Wikidata US Wild and Scenic Rivers': 50,
         'Wikidata Geologic Formations': 9299,
         'Wikidata Aquifers': 27,
         'Wikidata Fields of Science': 457,
         'Wikidata Additional Commodities': 10,
         'Wikidata US Territories': 38,
         'Wikidata US Counties': 3108,
         'EPA Climate Change Glossary': 123,
         'Common geographic areas': 66096,
         'USGS Thesaurus': 1151,
       

In [3]:
%%time
entity_collections = dict()

with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    results = session.run("""
    MATCH (pr:Project)
    WHERE NOT pr.descriptive_texts IS NULL
    RETURN pr.name AS name, pr.descriptive_texts AS descriptive_texts
    """)
    entity_collections["projects"] = results.data()

with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    results = session.run("""
    MATCH (w:CreativeWork)
    WHERE NOT w.description IS NULL
    RETURN w.name AS name, w.description AS description
    """)
    entity_collections["creative_works"] = results.data()

with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    results = session.run("""
    MATCH (d:Dataset)
    WHERE NOT d.description IS NULL
    RETURN d.name AS name, d.description AS description
    """)
    entity_collections["datasets"] = results.data()

with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    results = session.run("""
    MATCH (t:UndefinedSubjectMatter)
    RETURN t.name AS name, t.description AS description
    """)
    entity_collections["undefined_terms"] = results.data()

with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    results = session.run("""
    MATCH (l:Location)
    RETURN l.name AS name, l.description AS description
    """)
    entity_collections["locations"] = results.data()

with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    results = session.run("""
    MATCH (jt:JobTitle)
    RETURN jt.name AS name, jt.description AS description
    """)
    entity_collections["job_titles"] = results.data()

for item in entity_collections.keys():
    print(item, len(entity_collections[item]))

projects 17119
creative_works 20793
datasets 21853
undefined_terms 1524
locations 1468
job_titles 1954
CPU times: user 2.85 s, sys: 321 ms, total: 3.17 s
Wall time: 4.59 s


In [4]:
all_texts = list()
bad_texts = list()

for collection_name, collection_items in entity_collections.items():
    for item in collection_items:
        all_texts.append(item["name"])
        if "description" in item and item["description"]:
            all_texts.append(item["description"])
        if "descriptive_texts" in item and item["descriptive_texts"]:
            try:
                descriptive_texts_list = ast.literal_eval(item["descriptive_texts"])
                if isinstance(descriptive_texts_list, list):
                    all_texts.extend(descriptive_texts_list)
                else:
                    print(type(descriptive_texts_list))
            except:
                bad_texts.append(item["descriptive_texts"])

unique_texts = list()
for text in list(set(all_texts)):
    if not isinstance(text, str):
        continue
    if len(text) < 20:
        continue
    unique_texts.append(text)
    
pickle.dump(unique_texts, open("data/training_texts.p", "wb"))
print("UNIQUE TEXTS:", len(unique_texts))
print("BAD TEXTS:", len(bad_texts))

UNIQUE TEXTS: 174470
BAD TEXTS: 3


In [5]:
bad_refs = list()

def train_data(text, references, reference_sources=None):
    entities = list()
    refs_found = list()
    
    if reference_sources is not None:
        references = [i for i in references if i["source"] in reference_sources]
    
    for ref in references:
        if ref not in bad_refs:
            start_positions = list()
            try:
                start_positions.extend([m.start() for m in re.finditer(ref["label"], text)])
            except Exception as e:
                bad_refs.append(ref)
                continue

            if start_positions:
                refs_found.append(ref)
                for pos in start_positions:
                    entities.append(
                        [
                            pos,
                            pos+len(ref["label"]),
                            ref["concept_label"]
                        ]
                    )
    if entities:
        trainer = [
            text,
            {
                "entities": entities
            }
        ]
        return trainer, refs_found
    else:
        return list(), list()

    
training_data = list()
refs_in_data = list()

def accumulator(text, reference_sources=None):
    training_record, refs_found = train_data(text, reference_vocabulary, reference_sources)
    if training_record:
        training_data.append(training_record)
    if refs_found:
        refs_in_data.extend(refs_found) 
            

In [6]:
Parallel(n_jobs=8, prefer="threads")(
    delayed(accumulator)
    (
        i, reference_sources=['EPA Climate Change Glossary']
    ) for i in tqdm(unique_texts)
)


100%|██████████| 174470/174470 [20:04<00:00, 144.88it/s]


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

In [8]:
pickle.dump(training_data, open("data/ner_trainer_climate_change_terms.p", "wb"))