# OpenIE graph generation
In this notebook, we'll use OpenIE to generate triples form our dataset, and use the DyGIE++-derived named entities to filter down the overwhelming amount of triples that result from OpenIE procedures.

In [1]:
import jsonlines
import networkx as nx

## Reading in triples
The stanza servers can be a little finnickey, so we used an interactive job with a script to generate the triples. We also filtered the dataset down to DyGIE++ entities as part of the same job, as there are a large number of triples and this process is time consuming. We'll read in and analyze them here.

In [2]:
with jsonlines.open('../../data/kg/openie_triples_08May23024.jsonl') as reader:
    all_triples = [obj for obj in reader]

In [3]:
print(f'There were {len(all_triples)} triples initially extracted from the dataset by OpenIE.')

There were 644175 triples initially extracted from the dataset by OpenIE.


In [4]:
ents = []
for trip in all_triples:
    ents.append(trip['subject'].lower())
    ents.append(trip['object'].lower())
unique_ents = list(set(ents))
print(f'There were {len(unique_ents)} unique entities in the initial OpenIE extraction.')

There were 323233 unique entities in the initial OpenIE extraction.


A quick sample of the triples and entities shows us that the quality is extremely low:

In [5]:
all_triples[:5]

[{'subject': 'Studies', 'relation': 'elucidate', 'object': 'basis'},
 {'subject': 'Studies',
  'relation': 'elucidate',
  'object': 'biochemical basis of survival of excised axes of seeds'},
 {'subject': 'Studies',
  'relation': 'elucidate',
  'object': 'biochemical basis of survival of embryonic axes of recalcitrant seeds of Trichilia dregeana'},
 {'subject': 'Studies',
  'relation': 'basis of',
  'object': 'survival of embryonic axes of recalcitrant seeds'},
 {'subject': 'Studies',
  'relation': 'elucidate',
  'object': 'biochemical basis of survival of excised embryonic axes of seeds of Trichilia dregeana'}]

In [6]:
unique_ents[:10]

['study on desiccation related protein in plant floral nectar',
 'percentage',
 'extended to march beyond normal growing seasons',
 'while yeasts depend',
 'maa structure',
 'regulatory interactions at level for expression of network',
 'nodaensis',
 'clearly important driver of bryophyte assemblages in coniferous forests',
 'detach',
 'by associating with different symbionts recognised as mediators with acquisition of traits possible by uptake of novel symbionts']

In [7]:
openie_filtered_graph = nx.read_graphml('../../data/kg/openIE_filtered_graph_08May2024.graphml')

In [8]:
print(f'After exact-match filtering based on DyGIE++ entities, there are {len(list(openie_filtered_graph.edges))} triples in the OpenIE graph.')
print(f'There are {len(list(openie_filtered_graph.nodes))} nodes in the filtered graph.')

After exact-match filtering based on DyGIE++ entities, there are 8156 triples in the OpenIE graph.
There are 6195 nodes in the filtered graph.
