# OntoGPT graph generation
We ran OntoGPT on the desiccation subset of our dataset; here, we'll post-process the output and format it as a graph.

In [1]:
import pandas as pd
from collections import Counter
import networkx as nx
from tqdm import tqdm

## Read in converted OntoGPT output

### Entities

In [2]:
initial_ent_df = pd.read_csv('../../data/kg/ontogpt_slim_ent_df_20May2024.csv')
initial_ent_df.head()

Unnamed: 0,id,category,name,provided_by
0,AUTO:N/A,Protein,,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6
1,AUTO:lipid%20peroxidation,Molecule,lipid peroxidation,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6
2,CHEBI:16856,Molecule,glutathione,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6
3,CHEBI:29191,Molecule,hydroxyl radical,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6
4,AUTO:Trichilia%20dregeana,Organism,Trichilia dregeana,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6


There was a bug in OntoGPT at the time of running that caused entities to be needlessly duplicated in the output file. While the bug has since been fixed, it takes several days to run on a dataset of this size, so let's see if we can determine where the entities truly belong with confidence.

The bug causes all entities from the previous document to be prepended to the list of entities in the current document. This means that the entities grow exponentially, as all entities from all previous documents are in the current document. To remedy this, let's drop duplicates, keeping only the first occurrence, which should preserve the correct `provided_by` ID.

In [3]:
print(f'Initial entity df shape was: {initial_ent_df.shape}. There are {len(initial_ent_df.provided_by.unique())} unique document identifiers in the dataset.')

Initial entity df shape was: (36380420, 4). There are 5237 unique document identifiers in the dataset.


In [4]:
ent_df = initial_ent_df.drop_duplicates(subset=['id', 'category', 'name'], keep='first', ignore_index=True)
ent_df.head()

Unnamed: 0,id,category,name,provided_by
0,AUTO:N/A,Protein,,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6
1,AUTO:lipid%20peroxidation,Molecule,lipid peroxidation,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6
2,CHEBI:16856,Molecule,glutathione,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6
3,CHEBI:29191,Molecule,hydroxyl radical,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6
4,AUTO:Trichilia%20dregeana,Organism,Trichilia dregeana,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6


In [5]:
print(f'Cleaned entity df shape is: {ent_df.shape}')

Cleaned entity df shape is: (12508, 4)


Assert that we still have the same number of unique `provided_by` identifiers:

In [6]:
print(f'There are {len(ent_df.provided_by.unique())} unique document identifiers in the dataset.')

There are 4144 unique document identifiers in the dataset.


It's reasonable to think that the missing document identifiers are the result of the model not having extracted any entities for that document; however, we need to confirm this. Let's get the ID's that don't have entities and check that their entities are identical to those of the previous document in the set:

In [7]:
all_ids = list(initial_ent_df.provided_by.unique())

In [8]:
missing_ids = [x for x in all_ids if x not in ent_df.provided_by.unique()]

In [9]:
missing_ids[:5]

['00287dc5-4c7b-4e9b-ae1c-c05e7a86c1ad',
 '636772eb-eb24-4f4f-bfe1-ae89e08fb80f',
 'a2bd0277-2183-42e5-98c0-08b20474cd00',
 'df986ea9-cb5f-4122-aec4-f90d578d7af7',
 '11ade7df-60ad-449d-a0ab-cd754d163cc3']

Note: The following cell is intensive and takes about an hour and a half to run.

In [10]:
is_ident_to_prev = []
for i in tqdm(range(len(missing_ids))):
    idx = [idx for idx, doc_id in enumerate(all_ids) if doc_id == missing_ids[i]][0]
    current_doc_ents = initial_ent_df[initial_ent_df['provided_by'] == missing_ids[i]]
    prev_doc_ents = initial_ent_df[initial_ent_df['provided_by'] == all_ids[idx-1]]
    is_eq = current_doc_ents.drop(columns='provided_by').reset_index(drop=True).equals(prev_doc_ents.drop(columns='provided_by').reset_index(drop=True))
    is_ident_to_prev.append(is_eq)

100%|██████████| 1093/1093 [1:21:46<00:00,  4.49s/it]


In [11]:
print(f'All missing IDs have 0 new entities compared to their previous document: {all(is_ident_to_prev)}')

All missing IDs have 0 new entities compared to their previous document: True


Yay! It worked! We can safely drop the extra entities. All we have to do is make sure to add back the missing document ID's when we calculate the per-doc ratios, as we've been giving docs with no entities or relations a 0 in those analyses.

Problematic entities by ID that we can observe just by scrolling through the list: `AUTO:N/A`, `AUTO:Not%20provided`, `AUTO:Not%20mentioned`, `AUTO:None`, `AUTO:Not%20found`, `AUTO:Unknown`... etc. Quantify prevalance and remove:

In [12]:
problem_ents = ['AUTO:N/A', 'AUTO:Not%20provided', 'AUTO:not%20provided', 'AUTO:Not%20mentioned', 'AUTO:None', 'AUTO:Not%20found', 'AUTO:Unknown',
               'AUTO:Not%20mentioned%20in%20the%20text.', 'AUTO:unknown', 'AUTO:none', 'AUTO:Not%20applicable',
               'AUTO:not%20available', 'AUTO:Not%20mentioned.', 'AUTO:No%20information%20provided.',
               'AUTO:Not%20provided%20in%20the%20text.', 'AUTO:Not%20specified', 'AUTO:not%20specified', 'AUTO:unspecified',
               'AUTO:n/a', 'AUTO:not%20applicable', 'AUTO:%5Bnot%20mentioned%5D']

In [13]:
entity_counts = Counter(ent_df.id.tolist())
print('Prevelance of problematic entities:')
for ent in problem_ents:
    print(ent, '--', entity_counts[ent])
print(f'Of a total of {len(ent_df)} entities.')

Prevelance of problematic entities:
AUTO:N/A -- 1
AUTO:Not%20provided -- 1
AUTO:not%20provided -- 1
AUTO:Not%20mentioned -- 1
AUTO:None -- 1
AUTO:Not%20found -- 1
AUTO:Unknown -- 1
AUTO:Not%20mentioned%20in%20the%20text. -- 1
AUTO:unknown -- 1
AUTO:none -- 1
AUTO:Not%20applicable -- 1
AUTO:not%20available -- 1
AUTO:Not%20mentioned. -- 1
AUTO:No%20information%20provided. -- 1
AUTO:Not%20provided%20in%20the%20text. -- 1
AUTO:Not%20specified -- 1
AUTO:not%20specified -- 1
AUTO:unspecified -- 1
AUTO:n/a -- 1
AUTO:not%20applicable -- 1
AUTO:%5Bnot%20mentioned%5D -- 1
Of a total of 12508 entities.


Not horrible in absolute numbers, but worse than we might like. Also, all of these hallucinated entities have at least 2,000 instances in the dataset, even something as seemingly strange as `AUTO:%5Bnot%20mentioned%5D`. Let's clean them:

In [14]:
ent_df_cleaned = ent_df[~ent_df['id'].isin(problem_ents)]
ent_df_cleaned.head()

Unnamed: 0,id,category,name,provided_by
1,AUTO:lipid%20peroxidation,Molecule,lipid peroxidation,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6
2,CHEBI:16856,Molecule,glutathione,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6
3,CHEBI:29191,Molecule,hydroxyl radical,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6
4,AUTO:Trichilia%20dregeana,Organism,Trichilia dregeana,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6
5,CHEBI:22152,Molecule,abscisic acid,623dce16-704e-41a7-b549-22982835bff9


In [15]:
print(f'{len(ent_df) - len(ent_df_cleaned)} total entities were dropped on cleaning ({100*(len(ent_df) - len(ent_df_cleaned))/len(ent_df):.2f}%)')

21 total entities were dropped on cleaning (0.17%)


In [16]:
unique_ents = ent_df.id.unique()
unique_after_cleaning = ent_df_cleaned.id.unique()
unique_dropped = problem_ents
assert len(unique_dropped) == len(unique_ents) - len(unique_after_cleaning)
print(f'This is {len(unique_ents) - len(unique_after_cleaning)} of {len(unique_ents)} unqiue entities lost ({100*(len(unique_ents) - len(unique_after_cleaning))/len(unique_ents):.2f}%)')

This is 21 of 12508 unqiue entities lost (0.17%)


Check how many are truly grounded:

In [17]:
clean_ent_counts = Counter(ent_df_cleaned.id.tolist())
true_grounded = sum([v for k, v in clean_ent_counts.items() if 'AUTO:' not in k])
print(f'{100*true_grounded/sum(clean_ent_counts.values()):.2f}% of entities have a true grounding.')

20.99% of entities have a true grounding.


### Relations

In [18]:
rel_df = pd.read_csv('../../data/kg/ontogpt_slim_rel_df_20May2024.csv')
rel_df.head()

Unnamed: 0,id,category,provided_by,predicate,subject,object
0,f40835d2-e716-4073-88c5-27eac763fe43,GeneGeneInteraction,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6,GeneGeneInteraction,AUTO:N/A,AUTO:N/A
1,c20331ec-e5aa-4af8-9483-618678e9d6ca,GeneOrganismRelationship,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6,GeneOrganismRelationship,AUTO:N/A,AUTO:N/A
2,d4dd7d59-1118-4577-a870-9a85d7f89f17,ProteinProteinInteraction,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6,ProteinProteinInteraction,AUTO:N/A,AUTO:N/A
3,09f0659d-9ea2-42d6-a331-d076507d012c,ProteinOrganismRelationship,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6,ProteinOrganismRelationship,AUTO:N/A,AUTO:N/A
4,af2466f5-44d4-4224-bf3e-befef9553aa3,GeneMoleculeInteraction,1811d3fb-5fb9-4cbe-aa8b-84b4d4716af6,GeneMoleculeInteraction,AUTO:N/A,AUTO:N/A


We need to get rid of any relations that rely on the three problem entities, and quantify the prevalence of both relations with two problem entities and relations with one problem entity.

In [19]:
rel_df_cleaned_just_both = rel_df[~((rel_df['subject'].isin(problem_ents)) & (rel_df['object'].isin(problem_ents)))]
print(f'{len(rel_df) - len(rel_df_cleaned_just_both)} relations were dropped for having both entities be problem entities ({100*(len(rel_df) - len(rel_df_cleaned_just_both))/len(rel_df):.2f}% of initial relations)')

3568 relations were dropped for having both entities be problem entities (44.68% of initial relations)


In [20]:
rel_df_cleaned_problems = rel_df_cleaned_just_both[~((rel_df_cleaned_just_both['subject'].isin(problem_ents)) | (rel_df_cleaned_just_both['object'].isin(problem_ents)))]
print(f'{len(rel_df_cleaned_just_both) - len(rel_df_cleaned_problems)} relations were dropped for having one entity be problem entities ({100*(len(rel_df_cleaned_just_both) - len(rel_df_cleaned_problems))/len(rel_df):.2f}% of initial relations)')

292 relations were dropped for having one entity be problem entities (3.66% of initial relations)


We've also noticed that some of the relations are trivial because they relate an entity to itself; we also want to drop and quantify those.

In [21]:
rel_df_cleaned = rel_df_cleaned_problems[rel_df_cleaned_problems['subject'] != rel_df_cleaned_problems['object']]
print(f'{len(rel_df_cleaned_problems) - len(rel_df_cleaned)} relations were dropped for having one entity be problem entities ({100*(len(rel_df_cleaned_problems) - len(rel_df_cleaned))/len(rel_df):.2f}% of initial relations)')

426 relations were dropped for having one entity be problem entities (5.34% of initial relations)


In [22]:
rel_df_cleaned.head()

Unnamed: 0,id,category,provided_by,predicate,subject,object
23,ed9c4fc6-0721-429b-a130-bea4988c41b6,GeneMoleculeInteraction,a1fa005c-4606-4a90-aca3-65f02a03de6e,GeneMoleculeInteraction,AUTO:CsGolS1,CHEBI:17505
25,d4bf6963-8019-4ec4-b452-5e10e692b758,GeneMoleculeInteraction,effe7e16-b4ae-40b1-931c-d7abca0007bd,GeneMoleculeInteraction,AUTO:PAO,CHEBI:32875
26,74517f99-53ca-4e6d-82a8-f92017929a34,ProteinOrganismRelationship,009bf68f-6bb1-49bb-8e16-932db901c55b,ProteinOrganismRelationship,AUTO:phytoferritin,NCBITaxon:81913
33,165766ff-cf5c-432f-a940-d617a8c4b12f,GeneProteinInteraction,419b6f1d-32f7-42cb-86dc-584bba04d1e3,GeneProteinInteraction,AUTO:AtGAMT1,PR:000008052
34,ab84aa56-0da6-4c0e-b28a-d78b43cd7ec9,GeneOrganismRelationship,419b6f1d-32f7-42cb-86dc-584bba04d1e3,GeneOrganismRelationship,AUTO:AtGAMT1,NCBITaxon:4081


More than 50% of relations were dropped because they were hallucinated around non-existent entities -- not a great look.

## Format into a networkx graph
For the sake of keeping it computationally simple, since we know we're not going to use this graph downstream, I'm going to ignore attributes like provided_by.

In [23]:
ents_for_nodes = ent_df_cleaned.drop(columns='provided_by').drop_duplicates().set_index('id').rename(columns={'category': 'ent_type', 'name': 'semantic_name'})

In [24]:
nodes = [(k, v) for k, v in ents_for_nodes.to_dict('index').items()]

In [25]:
rels_for_edges = rel_df_cleaned.drop(columns=['id', 'provided_by', 'category']).drop_duplicates()

In [26]:
edges = []
for i, row in rels_for_edges.iterrows():
    edge = (row.subject, row.object, {'rel_type': row.predicate})
    edges.append(edge)

In [27]:
ontogpt_graph = nx.Graph()
_ = ontogpt_graph.add_nodes_from(nodes)
_ = ontogpt_graph.add_edges_from(edges)

In [28]:
nx.write_graphml(ontogpt_graph, '../../data/kg/ontogpt_cleaned_graph_22May2024.graphml')

Also save the cleaned entity and relation dataframes:

In [29]:
ent_df_cleaned.to_csv('../../data/kg/ontogpt_slim_ent_df_CLEANED_20May2024.csv', index=False)
rel_df_cleaned.to_csv('../../data/kg/ontogpt_slim_rel_df_CLEANED_20May2024.csv', index=False)