# OpenIE graph generation
In this notebook, we'll use OpenIE to generate triples form our dataset, and use the DyGIE++-derived named entities to filter down the overwhelming amount of triples that result from OpenIE procedures.

In [7]:
import jsonlines
import networkx as nx
from tqdm import tqdm

## Reading in triples
The stanza servers can be a little finnickey, so we used an interactive job with a script to generate the triples.

In [2]:
with jsonlines.open('../../data/kg/openie_triples_08May23024.jsonl') as reader:
    triples = [obj for obj in reader]

In [3]:
print(f'There are {len(triples)} triples extracted from the dataset by OpenIE.')

There are 644175 triples extracted from the dataset by OpenIE.


A quick sample of the triples shows us that the quality is extremely low:

In [4]:
triples[:5]

[{'subject': 'Studies', 'relation': 'elucidate', 'object': 'basis'},
 {'subject': 'Studies',
  'relation': 'elucidate',
  'object': 'biochemical basis of survival of excised axes of seeds'},
 {'subject': 'Studies',
  'relation': 'elucidate',
  'object': 'biochemical basis of survival of embryonic axes of recalcitrant seeds of Trichilia dregeana'},
 {'subject': 'Studies',
  'relation': 'basis of',
  'object': 'survival of embryonic axes of recalcitrant seeds'},
 {'subject': 'Studies',
  'relation': 'elucidate',
  'object': 'biochemical basis of survival of excised embryonic axes of seeds of Trichilia dregeana'}]

## Filter triples by DyGIE++ entities
In order to improve the quality of the OpenIE extraction, we'll only keep triples that have a DyGIE++ entity as both a subject and a predicate. We'll start by requiring an exact match; if this proves too stringent, we can use fuzzy matching instead.

In [5]:
dygie_co_graph = nx.read_graphml('../../data/kg/all_drought_dt_co_occurrence_graph_02May2024.graphml')
dygiepp_ents = list(dygie_co_graph.nodes())

In [None]:
openie_nodes = []
openie_edges = []
for trip in tqdm(triples):
    if (trip['subject'].lower() in dygiepp_ents) and (trip['object'].lower() in dygiepp_ents):
        openie_nodes.append(trip['subject'].lower())
        openie_nodes.append(trip['object'].lower())
        openie_edges.append((trip['subject'].lower(), trip['object'].lower()))
openie_nodes = list(set(openie_nodes))
openie_edges = list(set(openie_edges))

 16%|█▌        | 102721/644175 [28:49<2:18:16, 65.26it/s]

In [None]:
print(f'There are {len(openie_nodes)} nodes and {len(openie_edges)} edges after filtering by entity.')

In [None]:
openie_graph = nx.Graph()
_ = openie_graph.add_nodes_from(openie_nodes)
_ = openie_graph.add_edges_from(openie_edges)