# Schema iterative evals
In this notebook I'll perform heuristic evaluations of the output of OntoGPT as I iterate over the schema.

In [1]:
from ontogpt.io.csv_wrapper import parse_yaml_predictions
import pandas as pd
import jsonlines
from os import listdir
from collections import defaultdict
import regex

  from .autonotebook import tqdm as notebook_tqdm


## Read in data
### Gold standard annotations

In [2]:
with jsonlines.open('../data/gold_standards/pickle_all.jsonl') as reader:
    pickle = [obj for obj in reader]

In [3]:
# Get only the docs that we applied OntoGPT on
evaled_ids = [fname.split('.')[0] for fname in listdir('../data/ontogpt_input/pickle_5/')]
gold = [doc for doc in pickle if doc['doc_key'] in evaled_ids]

### OntoGPT output

In [4]:
to_eval_1 = '../data/ontogpt_output/pickle_5/output.txt'
to_eval_2 = '../data/ontogpt_output/pickle_5_iter2/output.txt'
to_eval_3 = '../data/ontogpt_output/pickle_5_iter3/output.txt'
schema_path = '../knowledge_graph/schema/desiccation.yaml'

## Format OntoGPT output as CSV

In [5]:
ent_df_1, rel_df_1 = parse_yaml_predictions(to_eval_1, schema_path)
ent_df_2, rel_df_2 = parse_yaml_predictions(to_eval_2, schema_path)
ent_df_3, rel_df_3 = parse_yaml_predictions(to_eval_3, schema_path)

100%|██████████| 5/5 [00:00<00:00, 17985.87it/s]
100%|██████████| 5/5 [00:00<00:00, 17008.53it/s]
100%|██████████| 5/5 [00:00<00:00, 24105.20it/s]


In [7]:
ent_df_3.head()

Unnamed: 0,id,category,name,provided_by
0,GO:0004707,genes,MAPK,765bee80-fe70-4a64-998c-16412276b343
1,AUTO:SIPK,genes,SIPK,765bee80-fe70-4a64-998c-16412276b343
2,AUTO:WIPK,genes,WIPK,765bee80-fe70-4a64-998c-16412276b343
3,AUTO:NahG,genes,NahG,765bee80-fe70-4a64-998c-16412276b343
4,AUTO:salicylic%20acid-induced%20protein%20kinase,proteins,salicylic acid-induced protein kinase,765bee80-fe70-4a64-998c-16412276b343


We need to map the original filenames to the ID's assigned by OntoGPT; luckily it seems like the order is preserved when random ID's are assigned.

In [8]:
def map_doc_keys(ent_df, rel_df, input_dir):
    id_map = {rid: orig_id.split('.')[0] for rid, orig_id in zip(ent_df.provided_by.unique(), listdir(input_dir))}
    ent_df['doc_key'] = ent_df['provided_by'].map(id_map)
    rel_df['doc_key'] = rel_df['provided_by'].map(id_map)
    return ent_df, rel_df

In [9]:
ent_df_1, rel_df_1 = map_doc_keys(ent_df_1, rel_df_1, '../data/ontogpt_input/pickle_5/')
ent_df_2, rel_df_2 = map_doc_keys(ent_df_2, rel_df_2, '../data/ontogpt_input/pickle_5/')
ent_df_3, rel_df_3 = map_doc_keys(ent_df_3, rel_df_3, '../data/ontogpt_input/pickle_5/')

## Evaluate entities
We'll look at the predictions and the true entities side by side, and assign a manual heuristic score for each abstract as a way to track improvement.

In [10]:
types_to_keep = [
    'Amino_acid_monomer',
    'Peptide',
    'Protein',
    'Nucleotide',
    'Polynucleotide',
    'DNA',
    'RNA',
    'Organic_compound_other',
    'Inorganic_compound_other',
    'Element',
    'Multicellular_organism',
    'Unicellular_organism',
    'Virus'
]

In [11]:
target_entities = defaultdict(list)
for doc in gold:
    doc_key = doc['doc_key']
    sent_toks = [tok for sent in doc['sentences'] for tok in sent]
    for sent in doc['ner']:
        for ent in sent:
            if ent[2] in types_to_keep:
                ent_text = ' '.join(sent_toks[ent[0]:ent[1]+1])
                target_entities[doc_key].append((ent_text, ent[2]))
target_entities = {k: list(set(v)) for k, v in target_entities.items()}

There is [a bug](https://github.com/monarch-initiative/ontogpt/issues/351) in OntoGPT that puts entities from previous documents in all following documents; let's quickly pre-processes here to remove the suplicates.

In [12]:
def process_preds(evaled_ids, ent_df):
    processed_preds = {}
    for i, doc in enumerate(evaled_ids):
        if i == 0:
            processed_preds[doc] = ent_df[ent_df['doc_key'] == doc].name.tolist()
        else:
            current_list = ent_df[ent_df['doc_key'] == doc].name.tolist()
            previous_ents = [
                ent for prev_doc in evaled_ids[:i] for ent in processed_preds[prev_doc]
            ]
            processed_list = current_list[len(previous_ents):]
            processed_preds[doc] = processed_list
    return processed_preds

In [13]:
def print_ent_comparison(evaled_ids, processed_preds, target_entities):
    for doc in evaled_ids:
        ents = processed_preds[doc]
        gold_ents = [e[0] for e in target_entities[doc]]
        print('\n\n\nFor doc ', doc, ':')
        print('-----------------------------------------')
        print('PREDICTIONS:')
        print('\n'.join(ents))
        print('-------------------')
        print('GOLD STANDARD:')
        print('\n'.join(gold_ents))

In [14]:
processed_preds_1 = process_preds(evaled_ids, ent_df_1)
print_ent_comparison(evaled_ids, processed_preds_1, target_entities)




For doc  PMID10707361_abstract :
-----------------------------------------
PREDICTIONS:
MAPK
SIPK
WIPK
NahG
salicylic acid-induced protein kinase
wounding-induced protein kinase
nitric oxide
salicylic acid
ethylene
jasmonic acid
tobacco
-------------------
GOLD STANDARD:
nitric oxide
salicylic acid (SA)-induced protein kinase
NO
tobacco
SIPK
transgenic NahG tobacco
wounding-induced protein kinase
kinases
WIPK
mitogen-activated protein ( MAP ) kinases



For doc  PMID10972869_abstract :
-----------------------------------------
PREDICTIONS:
NtPat1
NtPat2
NtPat3
phospholipase A2 (PLA2)
patatin
12-oxophytodienoic acid
tobacco mosaic virus
NtPat
virus-infected leaves
-------------------
GOLD STANDARD:
NtPat genes
oxylipins
patatin-like phospholipases
PLA2 isoforms
tobacco patatin-like cDNAs
PLA2
NtPat proteins
tobacco mosaic virus
NtPat3
Recombinant NtPat1 and NtPat3 enzymes
membrane lipids
phospholipase A2
NtPat1
unsaturated fatty acids
NtPat2
phosphatidylcholine
patatin



For doc  PM

In [15]:
processed_preds_2 = process_preds(evaled_ids, ent_df_2)
print_ent_comparison(evaled_ids, processed_preds_2, target_entities)




For doc  PMID10707361_abstract :
-----------------------------------------
PREDICTIONS:
MAPK
SIPK
WIPK
NahG
salicylic acid-induced protein kinase
wounding-induced protein kinase
nitric oxide
salicylic acid
ethylene
jasmonic acid
tobacco
-------------------
GOLD STANDARD:
nitric oxide
salicylic acid (SA)-induced protein kinase
NO
tobacco
SIPK
transgenic NahG tobacco
wounding-induced protein kinase
kinases
WIPK
mitogen-activated protein ( MAP ) kinases



For doc  PMID10972869_abstract :
-----------------------------------------
PREDICTIONS:
NtPat1
NtPat2
NtPat3
phospholipase A2 (PLA2)
patatin
12-oxophytodienoic acid
unsaturated fatty acids
phosphatidylcholine
potato
NtPat
patatin-like proteins
plant
-------------------
GOLD STANDARD:
NtPat genes
oxylipins
patatin-like phospholipases
PLA2 isoforms
tobacco patatin-like cDNAs
PLA2
NtPat proteins
tobacco mosaic virus
NtPat3
Recombinant NtPat1 and NtPat3 enzymes
membrane lipids
phospholipase A2
NtPat1
unsaturated fatty acids
NtPat2
phosph

In [16]:
processed_preds_3 = process_preds(evaled_ids, ent_df_3)
print_ent_comparison(evaled_ids, processed_preds_3, target_entities)




For doc  PMID10707361_abstract :
-----------------------------------------
PREDICTIONS:
MAPK
SIPK
WIPK
NahG
salicylic acid-induced protein kinase
wounding-induced protein kinase
nitric oxide
salicylic acid
ethylene
jasmonic acid
tobacco
-------------------
GOLD STANDARD:
nitric oxide
salicylic acid (SA)-induced protein kinase
NO
tobacco
SIPK
transgenic NahG tobacco
wounding-induced protein kinase
kinases
WIPK
mitogen-activated protein ( MAP ) kinases



For doc  PMID10972869_abstract :
-----------------------------------------
PREDICTIONS:
NtPat1
NtPat2
NtPat3
phospholipase A2 (PLA2)
patatin
12-oxophytodienoic acid
unsaturated fatty acids
phosphatidylcholine
potato
NtPat
patatin-like proteins
plant
-------------------
GOLD STANDARD:
NtPat genes
oxylipins
patatin-like phospholipases
PLA2 isoforms
tobacco patatin-like cDNAs
PLA2
NtPat proteins
tobacco mosaic virus
NtPat3
Recombinant NtPat1 and NtPat3 enzymes
membrane lipids
phospholipase A2
NtPat1
unsaturated fatty acids
NtPat2
phosph

Direct comparison of the entities pulled from each of the three versions:

In [22]:
for doc in evaled_ids:
    print('\nOn document', doc)
    print('-----------------------------------------------')
    for n, preds in {'1': processed_preds_1, '2': processed_preds_2, '3': processed_preds_3}.items():
        print('Iteration ' + n + 'entities:')
        print(preds[doc])


On document PMID10707361_abstract
-----------------------------------------------
Iteration 1entities:
['MAPK', 'SIPK', 'WIPK', 'NahG', 'salicylic acid-induced protein kinase', 'wounding-induced protein kinase', 'nitric oxide', 'salicylic acid', 'ethylene', 'jasmonic acid', 'tobacco']
Iteration 2entities:
['MAPK', 'SIPK', 'WIPK', 'NahG', 'salicylic acid-induced protein kinase', 'wounding-induced protein kinase', 'nitric oxide', 'salicylic acid', 'ethylene', 'jasmonic acid', 'tobacco']
Iteration 3entities:
['MAPK', 'SIPK', 'WIPK', 'NahG', 'salicylic acid-induced protein kinase', 'wounding-induced protein kinase', 'nitric oxide', 'salicylic acid', 'ethylene', 'jasmonic acid', 'tobacco']

On document PMID10972869_abstract
-----------------------------------------------
Iteration 1entities:
['NtPat1', 'NtPat2', 'NtPat3', 'phospholipase A2 (PLA2)', 'patatin', '12-oxophytodienoic acid', 'tobacco mosaic virus', 'NtPat', 'virus-infected leaves']
Iteration 2entities:
['NtPat1', 'NtPat2', 'NtPa

## Evaluating relations
The relation types we extracted with OntoGPT don't map well onto the PICKLE relations, so we'll just print the text of the abstract here along with the relations from both PICKLE and the predictions to get an idea of how it's doing.

In [17]:
ent_to_name_1 = ent_df_1[['id', 'name']].set_index('id').to_dict()['name']
ent_to_name_2 = ent_df_2[['id', 'name']].set_index('id').to_dict()['name']
ent_to_name_3 = ent_df_3[['id', 'name']].set_index('id').to_dict()['name']

In [18]:
def print_rel_comparisons(gold, ent_to_name, rel_df):
    for doc in gold:
        toks = [tok for sent in doc['sentences'] for tok in sent]
        text = ' '.join(toks)
        gold_rels = []
        for sent in doc['relations']:
            for rel in sent:
                subj = ' '.join(toks[rel[0]:rel[1]+1])
                pred = rel[4]
                obj = ' '.join(toks[rel[2]: rel[3]+1])
                gold_rels.append((subj, pred, obj))
        rels = [(ent_to_name[row.subject], row.predicate, ent_to_name[row.object]) for i, row in rel_df[rel_df['doc_key'] == doc['doc_key']].iterrows()]
        print('\n\nAbstract', doc['doc_key'])
        print(text)
        print('\nGold relations:')
        for rel in gold_rels:
            print(rel)
        print('\nPredicted relations:')
        for rel in rels:
            print(rel)

In [19]:
print_rel_comparisons(gold, ent_to_name_1, rel_df_1)



Abstract PMID10707361_abstract
In tobacco , two mitogen-activated protein ( MAP ) kinases , designated salicylic acid (SA)-induced protein kinase ( SIPK ) and wounding-induced protein kinase ( WIPK ) are activated in a disease resistance-specific manner following pathogen infection or elicitor treatment . To investigate whether nitric oxide ( NO ) , SA , ethylene , or jasmonic acid ( JA ) are involved in this phenomenon , the ability of these defense signals to activate these kinases was assessed . Both NO and SA activated SIPK ; however , they did not activate WIPK . Additional analyses with transgenic NahG tobacco revealed that SA is required for the NO-mediated induction of SIPK . Neither JA nor ethylene activated SIPK or WIPK . Thus , SIPK may function downstream of SA in the NO signaling pathway for defense responses , while the signals responsible for resistance-associated WIPK activation have yet to be determined .

Gold relations:
('salicylic acid (SA)-induced protein kinase'

In [20]:
print_rel_comparisons(gold, ent_to_name_2, rel_df_2)



Abstract PMID10707361_abstract
In tobacco , two mitogen-activated protein ( MAP ) kinases , designated salicylic acid (SA)-induced protein kinase ( SIPK ) and wounding-induced protein kinase ( WIPK ) are activated in a disease resistance-specific manner following pathogen infection or elicitor treatment . To investigate whether nitric oxide ( NO ) , SA , ethylene , or jasmonic acid ( JA ) are involved in this phenomenon , the ability of these defense signals to activate these kinases was assessed . Both NO and SA activated SIPK ; however , they did not activate WIPK . Additional analyses with transgenic NahG tobacco revealed that SA is required for the NO-mediated induction of SIPK . Neither JA nor ethylene activated SIPK or WIPK . Thus , SIPK may function downstream of SA in the NO signaling pathway for defense responses , while the signals responsible for resistance-associated WIPK activation have yet to be determined .

Gold relations:
('salicylic acid (SA)-induced protein kinase'

In [21]:
print_rel_comparisons(gold, ent_to_name_3, rel_df_3)



Abstract PMID10707361_abstract
In tobacco , two mitogen-activated protein ( MAP ) kinases , designated salicylic acid (SA)-induced protein kinase ( SIPK ) and wounding-induced protein kinase ( WIPK ) are activated in a disease resistance-specific manner following pathogen infection or elicitor treatment . To investigate whether nitric oxide ( NO ) , SA , ethylene , or jasmonic acid ( JA ) are involved in this phenomenon , the ability of these defense signals to activate these kinases was assessed . Both NO and SA activated SIPK ; however , they did not activate WIPK . Additional analyses with transgenic NahG tobacco revealed that SA is required for the NO-mediated induction of SIPK . Neither JA nor ethylene activated SIPK or WIPK . Thus , SIPK may function downstream of SA in the NO signaling pathway for defense responses , while the signals responsible for resistance-associated WIPK activation have yet to be determined .

Gold relations:
('salicylic acid (SA)-induced protein kinase'

Direct comparison of the relations pulled from each of the three versions:

In [24]:
for doc in evaled_ids:
    print('\nOn document', doc)
    print('-----------------------------------------------')
    for n, rel_df in {'1': rel_df_1, '2': rel_df_2, '3': rel_df_3}.items():
        rel_df = rel_df[rel_df['doc_key'] == doc]
        triples = [(row.subject, row.predicate, row.object) for i, row in rel_df.iterrows()]
        print('Iteration ' + n + 'relations:')
        for trip in triples:
            print(trip)


On document PMID10707361_abstract
-----------------------------------------------
Iteration 1relations:
Iteration 2relations:
('AUTO:SIPK', 'GeneMoleculeInteraction', 'CHEBI:16480')
('AUTO:SIPK', 'GeneMoleculeInteraction', 'CHEBI:16914')
Iteration 3relations:
('AUTO:SIPK', 'GeneMoleculeInteraction', 'CHEBI:16480')
('AUTO:SIPK', 'GeneMoleculeInteraction', 'CHEBI:16914')

On document PMID10972869_abstract
-----------------------------------------------
Iteration 1relations:
('AUTO:NtPat2', 'GeneProteinInteraction', 'PR:000012798')
('AUTO:NtPat', 'GeneOrganismRelationship', 'AUTO:virus-infected%20leaves')
Iteration 2relations:
('AUTO:NtPat2', 'GeneProteinInteraction', 'PR:000012798')
('AUTO:NtPat', 'GeneOrganismRelationship', 'NCBITaxon:4097')
('AUTO:patatin-like%20proteins', 'ProteinOrganismRelationship', 'AUTO:plant')
Iteration 3relations:
('AUTO:NtPat2', 'GeneProteinInteraction', 'PR:000012798')
('AUTO:NtPat', 'GeneOrganismRelationship', 'NCBITaxon:4097')
('AUTO:patatin-like%20protein