# PubTator Networks
In this notebook we'll explore the creation of networks using grounded PubTator entity annotations.

In [1]:
import pandas as pd
import networkx as nx
from itertools import combinations
from collections import defaultdict
import numpy as np
import taxoniq

## Gene-only co-occurrence network

In [2]:
genes = pd.read_csv('../data/pubtator/gene_only_partial_20Nov2023.csv')
genes.head()

Unnamed: 0,paperId,ann_text,ann_type,db_grounding
0,7bff04897ec52618c4adacb0122cddc455b255e3,Aphelenchus avenae,Species,70226
1,a8713d21e63e63ae733f656067032413b39d41e3,Artemia franciscana,Species,6661
2,a8713d21e63e63ae733f656067032413b39d41e3,Artemia franciscana,Species,6661
3,c2a1a6318d1902752bd59724cb3b6d55eaa842bb,mammalian,Species,9606
4,415502b49e3e0e412408fdf9a4653754ffb68ece,Escherichia coli,Species,562


In [3]:
genes.shape[0], len(genes.db_grounding.unique())

(1284, 189)

In [4]:
genes.ann_type.unique()

array(['Species', 'Genus', 'Strain', 'CellLine'], dtype=object)

We requested Gene and only got these types, so something is definitely wrong; but I'm going to continue on with network building using these types. I want to use the database-grounded names for the node labels here, so I'll use taxoniq for that.

In [8]:
def get_scientific_name(db_ground_num):
    try:
        t = taxoniq.Taxon(db_ground_num)
        return t.scientific_name
    except KeyError:
        return 'No_grounding'
    

In [9]:
genes['grounded_name'] = genes['db_grounding'].apply(get_scientific_name)

In [11]:
genes.head()

Unnamed: 0,paperId,ann_text,ann_type,db_grounding,grounded_name
0,7bff04897ec52618c4adacb0122cddc455b255e3,Aphelenchus avenae,Species,70226,Aphelenchus avenae
1,a8713d21e63e63ae733f656067032413b39d41e3,Artemia franciscana,Species,6661,Artemia franciscana
2,a8713d21e63e63ae733f656067032413b39d41e3,Artemia franciscana,Species,6661,Artemia franciscana
3,c2a1a6318d1902752bd59724cb3b6d55eaa842bb,mammalian,Species,9606,Homo sapiens
4,415502b49e3e0e412408fdf9a4653754ffb68ece,Escherichia coli,Species,562,Escherichia coli


In [13]:
genes.groupby('grounded_name').count().loc['No_grounding']

paperId         20
ann_text        20
ann_type        20
db_grounding    20
Name: No_grounding, dtype: int64

In [16]:
genes.groupby('grounded_name').count().head()

Unnamed: 0_level_0,paperId,ann_text,ann_type,db_grounding
grounded_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Acer platanoides,2,2,2,2
Acer pseudoplatanus,2,2,2,2
Acinetobacter baumannii,219,219,219,219
Acinetobacter baumannii A118,2,2,2,2
Acinetobacter baumannii AB5075,17,17,17,17


Only 20 of the 1284 instances have no grounding! That is relatively optimistic. I find it deeply sus that Acinetobacter baumannii is the most-mentioned species, however, there is clearly something wrong with the annotations here so maybe this is an artefact of the bug.

In [17]:
genes_graph = nx.MultiGraph()
nodes = {db_ground: defaultdict(list) for db_ground in genes.db_grounding.unique()}
for i, row in genes.iterrows():
    nodes[row['db_grounding']]['ann_text'].append(row.grounded_name)
    nodes[row['db_grounding']]['ann_type'].append(row.ann_type)
    nodes[row['db_grounding']]['paperId'].append(row.paperId)
for node, attrs in nodes.items():
    for name, attr in attrs.items():
        try:
            nodes[node][name] = ' | '.join(list(set(attr)))
        except TypeError:
            changed = [i if isinstance(i, str) else 'None' for i in list(set(attr))]
            nodes[node][name] = ' |'.join(changed)
nodes_to_add = [(n, attrs) for n, attrs in nodes.items()]
genes_graph.add_nodes_from(nodes_to_add)
edge_id = 0
for paper in genes.paperId.unique():
    to_join = genes[genes['paperId'] == paper]
    edges_to_add = combinations(to_join.db_grounding.unique(), 2)
    edges_with_ids = []
    for e in edges_to_add:
        edge = (e[0], e[1], edge_id)
        edge_id += 1
        edges_with_ids.append(edge)
    genes_graph.add_edges_from(edges_with_ids, label=paper)

In [18]:
nx.write_graphml(genes_graph, '../data/kg/genes_but_species_750_co_occurrence_20Nov2023.graphml')