# Unclassified exploration
We'd like to figure out exactly what's going on with why so many nodes don't get a classification. Let's reverse engineer our grpah and see what species names the non-classified abstracts have in common, if any.

In [21]:
import networkx as nx
from tqdm import tqdm
import requests
from math import ceil
from taxonerd import TaxoNERD

In [2]:
graph = nx.read_graphml('../data/citation_network/des_tol_100_classified_07Sept2023.graphml')

In [8]:
noclass_node_ids = []
for n, attrs in graph.nodes(data=True):
    if attrs['study_system'] == 'NOCLASS':
        noclass_node_ids.append(n)

I didn't save the abstracts in the graph, so we can re-request them through the semantic scholar API:

In [4]:
# Import API key. This must be requested from https://www.semanticscholar.org/product/api#api-key; we save ours in an untracked file in data and import here
import sys
sys.path.append('../data/')
from semantic_scholar_API_key import API_KEY
header = {'x-api-key': API_KEY}

In [19]:
noclass_papers = {}
lost_refs = 0
top_num = int(ceil(len(noclass_node_ids) / 500.0)) * 500
for i in tqdm(range(0, top_num, 500)):
    ids = noclass_node_ids[i: i+500]
    search = requests.post('https://api.semanticscholar.org/graph/v1/paper/batch',
                          params={'fields': 'title,abstract'},
                          json={'ids': ids}).json()
    for r in search:
        try:
            noclass_papers[r['paperId']] = r
        except TypeError:
            lost_refs += 1

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:07<00:00,  1.00it/s]


I want to get a sense of where the problem lies. My current hypothesis is that the failure point is entity linking, that we do classify entities in these papers, but then they get lost when we try and get an NCBI Taxonomy ID for them. Let's test that hypothesis:

In [22]:
taxonerd = TaxoNERD()
nlp = taxonerd.load("en_core_eco_biobert")

In [25]:
classified = []
not_classified = []
for paper, paperdata in tqdm(noclass_papers.items()):
    try:
        text = paperdata['title'] + ' ' + paperdata['abstract']
    except TypeError:
        text = paperdata['title']
    ent_df = taxonerd.find_in_text(text)
    if ent_df.shape[0] == 0:
        not_classified.append(paper)
    elif ent_df.shape[0] > 0:
        classified.append(paper)
print(f'Of the total of {len(noclass_papers)}, {len(classified)} contained identifiable entities, while {len(not_classified)} had no entities identified.')

 19%|████████████████████████████▋                                                                                                                        | 678/3516 [06:36<27:39,  1.71it/s]


KeyboardInterrupt: 