# Citation network quality control

Quality checks:
* Make sure all nodes are unique (use papers with overlapping citations as a check)
* Manually crosscheck nodes with no connections to make sure they actually have no citations
* Manually observe random samples of NOCLASS nodes to identify potential routes for coverage improvement

In [1]:
import jsonlines
from collections import defaultdict, Counter
import networkx as nx
import requests
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import json
import regex

In [2]:
# Import API key. This must be requested from https://www.semanticscholar.org/product/api#api-key; we save ours in an untracked file in data and import here
import sys
sys.path.append('../data/')
from semantic_scholar_API_key import API_KEY
header = {'x-api-key': API_KEY}

## Read in Data

In [3]:
with jsonlines.open('../data/semantic_scholar/desiccation_tolerance_10000_with_fixed_rev_cit_23Oct2023.jsonl') as reader:
    papers = []
    for obj in reader:
        papers.append(obj)

In [4]:
graph = nx.read_graphml('../data/citation_network/full_10000_with_classification_19Sep2023.graphml')

## Assert that all nodes are unique
### Check that there are no identical paperId's:

In [5]:
all_paperIds = []
for paper in papers:
    all_paperIds.append(paper['paperId'])
    for ref in paper['references']:
        all_paperIds.append(ref['paperId'])

In [6]:
len(all_paperIds), len(set(all_paperIds))

(451060, 193243)

That is a good first indication! The number of nodes in our network corresponds with the length of the set of the paperIds, meaning we have no duplicates. I've been assuming paper Id's are unique; in case they for some reason aren't, let's do the same thing for paper titles:

In [7]:
all_titles = []
for paper in papers:
    all_titles.append(paper['title'])
    for ref in paper['references']:
        all_titles.append(ref['title'])

By counting how many times a given title appears, we can determine how many times a given paper was cited:

In [8]:
cite_count = Counter(all_titles)

In [9]:
max(cite_count, key=cite_count.get), max(cite_count.values())

('THE MOLECULAR BASIS OF DEHYDRATION TOLERANCE IN PLANTS.', 355)

Let's make sure that there is only the number of unique titles we expect:

In [10]:
len(all_titles), len(set(all_titles))

(451060, 219505)

There are more unique titles than paperId's, which seems counterintuitive. Let's take a look at paper Id's that have multiple unique titles:

In [11]:
paperId_to_titles = defaultdict(list)
for paper in papers:
    paperId_to_titles[paper['paperId']].append(paper['title'])
    for ref in paper['references']:
        paperId_to_titles[ref['paperId']].append(ref['title'])

In [12]:
len(paperId_to_titles)

193243

In [13]:
more_than_one = {pid: v for pid, v in paperId_to_titles.items() if len(set(v)) > 1}

In [14]:
len(more_than_one)

1

There is only one paper that has more than one unique title; what's going on there?

In [15]:
more_than_one[None]

['Role of sucrose and raffinose in the desiccation sensitivity of seeds',
 'Desiccation tolerance in resurrection plants: new insights from transcriptome, proteome and metabolome analysis',
 'The glassy state and survival of anhydrous biological systems',
 'Intertwined signatures of desiccaw',
 '2014.Klebsormidium flaccidum genome',
 '2002.An oxidative burst of hydrogen peroxide during rehydration',
 '1982.Desiccation tolerance. In Encyclopedia of Plant Physiology: Physiological Plant seed biology',
 'Disruption of photosynthesis and respiration during desiccation and rehydration generates reactive oxygen species (ROS) that are the target of several protection mechanisms in tolerant cells',
 'Metabolic responses to desiccation provide protection from both mechanical and chemical stresses',
 'Biomolecules 8, 1–14',
 'Mechanisms of Desiccation',
 'Dauer. WormBook 1–19',
 'Regulation of embryonic diapause',
 'The use of the brine shrimp Artemia in aquaculture',
 'What do LEA proteins do? 

In [16]:
extras = len(more_than_one[None])

In [17]:
extras + 192506

223834

It looks like these are all the mis-formatted citations that become their own elements and don't have a paperId! That makes a lot of sense. There are slightly more of these than the difference between the number of unique paperIds and the number of titles; however, I would expect some of these to be actual paper titles in the dataset, so therefore they are already included. 

### Use papers with overlapping citations to make sure nodes coalesce
Paper 1: [Desiccation Tolerance: Avoiding Cellular Damage During Drying and Rehydration](https://www.annualreviews.org/doi/10.1146/annurev-arplant-071219-105542?url_ver=Z39.88-2003&rfr_id=ori%3Arid%3Acrossref.org&rfr_dat=cr_pub++0pubmed#_i20)

Paper 2: [Unexplored dimensions of variability in vegetative desiccation tolerance](https://bsapubs.onlinelibrary.wiley.com/doi/pdfdirect/10.1002/ajb2.1588)

Overlapping citations (manually identified):
```
Alpert P, Oliver MJ. 2002. Drying without dying. In Desiccation and Survival in Plants: Drying Without Dying, ed. M Black, HW Pritchard, pp. 3–43. Wallingford, UK: CABI


Bewley JD. 1979. Physiological aspects of desiccation tolerance. Annu. Rev. Plant Physiol. 30:195–238


Costa M-CD, Artur MAS, Maia J, Jonkheer E, Derks MF, et al. 2017. A footprint of desiccation tolerance in the genome of Xerophyta viscosa. Nat. Plants 3:17038


Farrant JM, Lehner A, Cooper K, Wiswedel S. 2009. Desiccation tolerance in the vegetative tissues of the fern Mohria caffrorum is seasonally regulated. Plant J. 57:65–79


Gaff DF, Oliver MJ. 2013. The evolution of desiccation tolerance in angiosperm plants: a rare yet common phenomenon. Funct. Plant Biol. 40:315–28

Khandelwal A, Cho SH, Marella H, Sakata Y, Perroud P-F, et al. 2010. Role of ABA and ABI3 in desiccation tolerance. Science 327:546


Liu X, Challabathula D, Quan W, Bartels D. 2019. Transcriptional and metabolic changes in the desiccation tolerant plant Craterostigma plantagineum during recurrent exposures to dehydration. Planta 249:1017–35

Moore JP, Farrant JM. 2012. A systems‐based molecular biology analysis of resurrection plants for crop and forage improvement in arid environments. In Improving Crop Resistance to Abiotic Stress, ed. N Tuteja, SS Gill, AF Tiburcio, R Tuteja, pp. 399–418. Hoboken, NJ: Wiley

Oliver MJ, Guo L, Alexander DC, Ryals JA, Wone BWM, Cushman JC. 2011. A sister group contrast using untargeted global metabolomic analysis delineates the biochemical regulation underlying desiccation tolerance in Sporobolus stapfianus. Plant Cell 23:1231–48


Oliver MJ, O'Mahony P, Wood AJ. 1998. “To dryness and beyond” – preparation for the dried state and rehydration in vegetative desiccation-tolerant plants. Plant Growth Regul. 24:193–201


Oliver MJ, Velten J, Mishler BD. 2005. Desiccation tolerance in bryophytes: a reflection of the primitive strategy for plant survival in dehydrating habitats? Integr. Comp. Biol. 45:788–99


Proctor MCF, Oliver MJ, Wood AJ, Alpert P, Stark LR, et al. 2007. Desiccation-tolerance in bryophytes: a review. Bryologist 110:595–622 A comprehensive look at desiccation tolerance in bryophytes.


Stark LR. 2017. Ecology of desiccation tolerance in bryophytes: a conceptual framework and methodology. Bryologist 120:130–66

Suguiyama VF, da Silva EA, Meirelles ST, Centeno DDC, Braga MR. 2014. Leaf metabolite profile of the Brazilian resurrection plant Barbacenia purpurea Hook. (Velloziaceae) shows two time-dependent responses during desiccation and recovering. Front. Plant Sci. 5:96

VanBuren R, Pardo J, Wai CM, Evans S, Bartels D. 2019. Massive tandem proliferation of ELIPs supports convergent evolution of desiccation tolerance across land plants. Plant Physiol. 179:1040–49

VanBuren R, Wai CM, Pardo J, Giarola V, Ambrosini S, et al. 2018. Desiccation tolerance evolved through gene duplication and network rewiring in Lindernia. Plant Cell 30:2943–58

Williams B, Njaci I, Moghaddam L, Long H, Dickman MB, et al. 2015. Trehalose accumulation triggers autophagy during plant desiccation. PLOS Genet. 11:e1005705

Yobi A, Schlauch KA, Tillet RL, Yim WC, Espinoza C, et al. 2017. Sporobolus stapfianus: insights into desiccation tolerance in the resurrection grasses from linking transcriptomics to metabolomics. BMC Plant Biol. 17:67

Yobi A, Wone BWM, Xu W, Alexander DC, Guo L, et al. 2012. Comparative metabolic profiling between desiccation‐sensitive and desiccation‐tolerant species of Selaginella reveals insights into the resurrection trait. Plant J. 72:983–99

Zhang Q, Bartels D. 2018. Molecular responses to dehydration and desiccation in desiccation-tolerant angiosperm plants. J. Exp. Bot. 69:3211–22


Zhu Y, Wang B, Phillips J, Zhang ZN, Du H, et al. 2015. Global transcriptome analysis reveals acclimation-primed processes involved in the acquisition of desiccation tolerance in Boea hygrometrica. Plant Cell Physiol. 56:1429–41
```
There are 21 overlapping citations; so we should see that the total unique citations between these two papers is the sum of each of them minus 21.

In [18]:
paper1_title = 'Desiccation Tolerance: Avoiding Cellular Damage During Drying and Rehydration'
paper2_title =  'Unexplored dimensions of variability in vegetative desiccation tolerance.'

In [19]:
for paper in papers:
    if paper1_title in paper['title']:
        paper1 = paper
    elif paper2_title in paper['title']:
        paper2 = paper

In [20]:
paper1

{'paperId': 'c9695c29c051499f52754f4657d9d559b9898d1a',
 'title': 'Desiccation Tolerance: Avoiding Cellular Damage During Drying and Rehydration.',
 'abstract': 'Desiccation of plants is often lethal but is tolerated by the majority of seeds and by vegetative tissues of only a small number of land plants. Desiccation tolerance is an ancient trait, lost from vegetative tissues following the appearance of tracheids but reappearing in several lineages when selection pressures favored its evolution. Cells of all desiccation-tolerant plants and seeds must possess a core set of mechanisms to protect them from desiccation- and rehydration-induced damage. This review explores how desiccation generates cell damage and how tolerant cells assuage the complex array of mechanical, structural, metabolic, and chemical stresses and survive. Likewise, the stress of rehydration requires appropriate mitigating cellular responses. We also explore what comparative genomics, both structural and responsive, 

In [21]:
paper2

{'paperId': '393cc126bd647a8435072e788a2a033561c6fa97',
 'title': 'Unexplored dimensions of variability in vegetative desiccation tolerance.',
 'abstract': 'Desiccation tolerance has evolved recurrently across diverse land plant lineages as an adaptation for survival in regions where seasonal rainfall drives periodic drying of vegetative tissues. Growing interest in this phenomenon has fueled recent physiological, biochemical, and genomic insights into the mechanistic basis of desiccation tolerance. Although, desiccation tolerance is often viewed as binary and monolithic, substantial variation exists in the phenotype and underlying mechanisms across diverse lineages, heterogeneous populations, and throughout the development of individual plants. Most studies have focused on conserved responses in a subset desiccation-tolerant plants under laboratory conditions. Consequently, the variability and natural diversity of desiccation-tolerant phenotypes remains largely uncharacterized. Here, 

In [22]:
reference_titles = [
    'Drying without dying',
    'Physiological aspects of desiccation tolerance',
    'A footprint of desiccation tolerance in the genome of Xerophyta viscosa',
    'Desiccation tolerance in the vegetative tissues of the fern Mohria caffrorum is seasonally regulated',
    'The evolution of desiccation tolerance in angiosperm plants: a rare yet common phenomenon',
    'Role of ABA and ABI3 in desiccation tolerance',
    'Transcriptional and metabolic changes in the desiccation tolerant plant Craterostigma plantagineum during recurrent exposures to dehydration',
    'A systems‐based molecular biology analysis of resurrection plants for crop and forage improvement in arid environments',
    'A sister group contrast using untargeted global metabolomic analysis delineates the biochemical regulation underlying desiccation tolerance in Sporobolus stapfianus',
    '“To dryness and beyond” – preparation for the dried state and rehydration in vegetative desiccation-tolerant plants',
    'Desiccation tolerance in bryophytes: a reflection of the primitive strategy for plant survival in dehydrating habitats?',
    'Desiccation-tolerance in bryophytes: a review',
    'Ecology of desiccation tolerance in bryophytes: a conceptual framework and methodology',
    'Leaf metabolite profile of the Brazilian resurrection plant Barbacenia purpurea Hook',
    'Massive tandem proliferation of ELIPs supports convergent evolution of desiccation tolerance across land plants',
    'Desiccation tolerance evolved through gene duplication and network rewiring in Lindernia',
    'Trehalose accumulation triggers autophagy during plant desiccation',
    'Sporobolus stapfianus: insights into desiccation tolerance in the resurrection grasses from linking transcriptomics to metabolomics',
    'Comparative metabolic profiling between desiccation‐sensitive and desiccation‐tolerant species of Selaginella reveals insights into the resurrection trait',
    'Molecular responses to dehydration and desiccation in desiccation-tolerant angiosperm plants',
    'Global transcriptome analysis reveals acclimation-primed processes involved in the acquisition of desiccation tolerance in Boea hygrometrica'
]

In [23]:
total_references_paperIds = [p['paperId'] for p in paper1['references'] + paper2['references']]

In [24]:
sum(1 for i in total_references_paperIds if i is None)

6

In [25]:
len(total_references_paperIds), len(set(total_references_paperIds))

(173, 166)

In [26]:
total_references_titles = [p['title'] for p in paper1['references'] + paper2['references']]

In [27]:
len(total_references_titles), len(set(total_references_titles))

(173, 171)

In [28]:
p1_ref_titles = [p['title'] for p in paper1['references']]
p2_ref_titles = [p['title'] for p in paper2['references']]

In [29]:
p1_pa = []
p2_pa = []
for ref_title in reference_titles:
    found_1 = False
    for i, auto_title in enumerate(p1_ref_titles):
        if not found_1:
            if ref_title in auto_title:
                p1_pa.append(1)
                found_1 = True
            else:
                if i == len(p1_ref_titles) - 1:
                    p1_pa.append(0)
    found_2 = False
    for i, auto_title in enumerate(p2_ref_titles):
        if not found_2:
            if ref_title in auto_title:
                p2_pa.append(1)
                found_2 = True
            else:
                if i == len(p2_ref_titles) - 1:
                    p2_pa.append(0)      

In [30]:
title_df = pd.DataFrame({'title': reference_titles, 'paper1_pres_abs': p1_pa, 'paper2_pres_abs': p2_pa})

In [31]:
title_df

Unnamed: 0,title,paper1_pres_abs,paper2_pres_abs
0,Drying without dying,1,0
1,Physiological aspects of desiccation tolerance,0,0
2,A footprint of desiccation tolerance in the ge...,1,0
3,Desiccation tolerance in the vegetative tissue...,1,0
4,The evolution of desiccation tolerance in angi...,1,0
5,Role of ABA and ABI3 in desiccation tolerance,0,0
6,Transcriptional and metabolic changes in the d...,1,0
7,A systems‐based molecular biology analysis of ...,0,0
8,A sister group contrast using untargeted globa...,0,0
9,“To dryness and beyond” – preparation for the ...,0,0


I think something is seriously wrong here. Let's reproduce this directly from the API so I can open a bug report:

In [32]:
for t in p1_ref_titles:
    if 'Physiological' in t:
        print(t)

Perspectives on Structural, Physiological, Cellular, and Molecular Responses to Desiccation in Resurrection Plants
Physiological Aspects of Desiccation Tolerance
1982.Desiccation tolerance. In Encyclopedia of Plant Physiology: Physiological Plant seed biology


In [33]:
r = requests.post(
    'https://api.semanticscholar.org/graph/v1/paper/batch',
    params={'fields': 'corpusId,title,abstract,references'},
    json={"ids": ["c9695c29c051499f52754f4657d9d559b9898d1a", "393cc126bd647a8435072e788a2a033561c6fa97"]},
    headers=header
).json()

In [34]:
paper1, paper2 = r[0], r[1]

In [35]:
paper1

{'paperId': 'c9695c29c051499f52754f4657d9d559b9898d1a',
 'corpusId': 211078498,
 'title': 'Desiccation Tolerance: Avoiding Cellular Damage During Drying and Rehydration.',
 'abstract': 'Desiccation of plants is often lethal but is tolerated by the majority of seeds and by vegetative tissues of only a small number of land plants. Desiccation tolerance is an ancient trait, lost from vegetative tissues following the appearance of tracheids but reappearing in several lineages when selection pressures favored its evolution. Cells of all desiccation-tolerant plants and seeds must possess a core set of mechanisms to protect them from desiccation- and rehydration-induced damage. This review explores how desiccation generates cell damage and how tolerant cells assuage the complex array of mechanical, structural, metabolic, and chemical stresses and survive. Likewise, the stress of rehydration requires appropriate mitigating cellular responses. We also explore what comparative genomics, both str

In [36]:
## Insert manual list of retreived references and reference title list here

In [37]:
p1_ref_titles = [p['title'] for p in paper1['references']]
p2_ref_titles = [p['title'] for p in paper2['references']]
p1_pa = []
p2_pa = []
for ref_title in reference_titles:
    found_1 = False
    for i, auto_title in enumerate(p1_ref_titles):
        if not found_1:
            if ref_title in auto_title:
                p1_pa.append(1)
                found_1 = True
            else:
                if i == len(p1_ref_titles) - 1:
                    p1_pa.append(0)
    found_2 = False
    for i, auto_title in enumerate(p2_ref_titles):
        if not found_2:
            if ref_title in auto_title:
                p2_pa.append(1)
                found_2 = True
            else:
                if i == len(p2_ref_titles) - 1:
                    p2_pa.append(0)
title_df = pd.DataFrame({'title': reference_titles, 'paper1_pres_abs': p1_pa, 'paper2_pres_abs': p2_pa})
title_df

Unnamed: 0,title,paper1_pres_abs,paper2_pres_abs
0,Drying without dying,1,0
1,Physiological aspects of desiccation tolerance,0,0
2,A footprint of desiccation tolerance in the ge...,1,0
3,Desiccation tolerance in the vegetative tissue...,1,0
4,The evolution of desiccation tolerance in angi...,1,0
5,Role of ABA and ABI3 in desiccation tolerance,0,0
6,Transcriptional and metabolic changes in the d...,1,0
7,A systems‐based molecular biology analysis of ...,0,0
8,A sister group contrast using untargeted globa...,0,0
9,“To dryness and beyond” – preparation for the ...,0,0


In [38]:
paper2['paperId']

'393cc126bd647a8435072e788a2a033561c6fa97'

In [39]:
len(paper1['references'])

160

In [40]:
len(paper2['references'])

14

## Crosschecking nodes with no citations
I have a feeling that the nodes with no citations are papers that do actually have citations but don't have any when we draw them from the Semantic Scholar API due to the bug. Will pull these and start manually crosschecking them to see if that is true.

In [41]:
isolates = [node for node in graph.nodes if (graph.out_degree(node) + graph.in_degree(node)) == 0]

In [42]:
isolate_titles = [v['title'] for k, v in graph.nodes(data=True) if (graph.out_degree(k) + graph.in_degree(k)) == 0]

In [43]:
len(isolate_titles)

984

In [44]:
isolates[:10]

['50e881d06365e088683c126eadd0a196b5fb636f',
 '7fa03e4b9ca20d4dbd6b16d8581b4f4ad0f96d49',
 '5c3907ccca7227db46b55a00a4415f9a5c434e70',
 '35bf9c0a324fe978388319319bd8855d26e09042',
 'a191938837d5aea2f34dc694897ac502a93e00e9',
 '9f9024901fb0915ce50139ad91666bca4838b995',
 'd34f89a23739ccfc708be7140904f3c0a904864e',
 '6b25a6237a920eaac5b21e910b1146d7eb9cfcd8',
 '31dc199e451df35474d50f188ce420ecf2ca3a8e',
 '2c12216e868a5981f17fe738d0f13ad673ca72f0']

### Notes on crosschecks
* 50e881d06365e088683c126eadd0a196b5fb636f [doesn't have a paper link](https://www.semanticscholar.org/paper/Plant-desiccation-tolerance-Renshaw/50e881d06365e088683c126eadd0a196b5fb636f) in the GUI, is just a title and an author
* 7fa03e4b9ca20d4dbd6b16d8581b4f4ad0f96d49 [is a book review](https://eds-p-ebscohost-com.proxy2.cl.msu.edu/eds/detail/detail?vid=0&sid=1ffecbeb-e542-4be4-8112-66253bee93c8%40redis&bdata=JnNpdGU9ZWRzLWxpdmU%3d#AN=37287816&db=a9h) and appears to not have any citations
* 5c3907ccca7227db46b55a00a4415f9a5c434e70 also has [no paper link](https://www.semanticscholar.org/paper/Desiccation-tolerance-in-Romanomermis-culicivorax-Saeed-Khan/5c3907ccca7227db46b55a00a4415f9a5c434e70); however, I could find it on Google Scholar, and it does have citations, but is old (1897)
* 35bf9c0a324fe978388319319bd8855d26e09042 is a [book chapter](https://onlinelibrary.wiley.com/doi/10.1002/9780470988862.ch15), and has a bunch of citations in the link that it's connected to on the Semanti Scholar GUI
* a191938837d5aea2f34dc694897ac502a93e00e9 has [no paper link available](https://www.semanticscholar.org/paper/Protein-stability-and-desiccation-tolerance.-Hoekstra-Wolkers/a191938837d5aea2f34dc694897ac502a93e00e9), and I can't find it in Google Scholar either
* 9f9024901fb0915ce50139ad91666bca4838b995 has a [paper link](https://www.sciencedirect.com/science/article/pii/S1049964496900936?via%3Dihub), and has citations, but only in the PDF, not on the website
* d34f89a23739ccfc708be7140904f3c0a904864e has a [link](https://link.springer.com/chapter/10.1007/978-1-4899-0271-9_21), is a book chapter, but there are not references in the chapter itself
* 6b25a6237a920eaac5b21e910b1146d7eb9cfcd8 has a link, references are not on the webpage but there are references in the PDF
* Same for 31dc199e451df35474d50f188ce420ecf2ca3a8e and 2c12216e868a5981f17fe738d0f13ad673ca72f0


Sort of universally almost all of these papers are from before 2000, with some exceptions. What do the date distributions look like for things with and without references? Can this simply be attributed to paper age?

In [45]:
non_isolates = [node for node in graph.nodes if node not in isolates]

In [46]:
len(non_isolates)

191521

In [47]:
384*500

192000

In [48]:
non_isolate_years = []
for i in range(384):
    ids = non_isolates[i*500:(i+1)*500]
    r = requests.post(
        'https://api.semanticscholar.org/graph/v1/paper/batch',
        headers=header,
        params={'fields': 'year'},
        json={"ids": ids}
    ).json()
    non_isolate_years.extend(r)

In [49]:
isolate_years = []
for i in range(2):
    ids = isolates[i*500:(i+1)*500]
    r = requests.post(
        'https://api.semanticscholar.org/graph/v1/paper/batch',
        params={'fields': 'year'},
        json={"ids": ids},
        headers=header
    ).json()
    isolate_years.extend(r)    

In [50]:
non_isolate_ys = [p['year'] for p in non_isolate_years if (p is not None) and (not isinstance(p,str)) and (p['year'] is not None)]
isolate_ys = [p['year'] for p in isolate_years if p['year'] is not None]

TypeError: string indices must be integers

In [None]:
plt.hist(non_isolate_ys, alpha=0.5, color='red', density=True, label='Papers with citations')
plt.hist(isolate_ys, alpha=0.5, color='blue', density=True, label='Papers without citations')
plt.legend()

Those don't look substantially different; so there are some older papers that have citations associated with them, and that's not the driving cause of papers not having citations. I think potentially I should just be leaving out papers with no citations when it comes time to do the analysis; but I would also like to see what Web of Science does for these same papers. I also want to manually crosscheck some of the newer papers with no links, as these may pertain to the issue I opened with the Semantic Scholar team. 

In [None]:
isolates_no_NONE = [p for p in isolate_years if p['year'] is not None]
isolates_new = [p['paperId'] for p in isolates_no_NONE if p['year'] > 2010]

In [None]:
isolates_new[:10]

### Notes on new isolates
* 31bf4fe6b5434952f0aaa3f0da85a1061acac2b0 published 2013, no paper link
* f7a1dcaec950b7b78ca9b6fe11a23016d130656f published 2015, has a Semantic Reader link, actually somehow has no referenes but is a review??
* b41211bb1e917a956c6ab2a3af6f4c58182c86e5 published 2021, has a PubMed link, and has a ton of citations embedded in a webpage
* 4115c2a082f1bcc4599108e763621ee3c97a6665 2012, has a paper link, to a website that contains a PDF, citations are only in the PDF
* c87ee0a26653e31988be44c5fcb10d088adb7f8a 2020, has a link, but goes to a page that says it's under maintenence

## Exploring NOCLASS nodes
We want to look at the titles and abstracts of some of the classless nodes to get an idea of what's going on.

In [52]:
noclass_ids = [node for node, attrs in graph.nodes(data=True) if attrs['study_system'] == 'NOCLASS']

In [53]:
r = requests.post(
    'https://api.semanticscholar.org/graph/v1/paper/batch',
    headers=header,
    params={'fields': 'title,abstract,references'},
    json={"ids": noclass_ids[:100]}
).json() 

In [54]:
abstracts = [p['abstract'] for p in r]

In [55]:
abstracts

[None,
 'The desiccation-tolerant (DT) plants are poikilohydric and capable of surviving the loss of 90-95% of their cell water content. DT plants are important constituents of many ecosystems from the arctic (lichen and moss vegetations) to the tropics (lichen, moss and flower vegetations). Desiccation-tolerant (DT) plants may be subdivided into homoiochlorophyllous (HDT) and poikilochlorophyllous (PDT) types. The HDTs retain their chlorophyll on desiccation, whereas in PDTs desiccation results in the loss of chlorophyll, which must be resynthetized fol- lowing remoistening. The PDT strategy can be seen an the evolutionarily different new strategy. It is based on the dismantling of internal chloroplast structure by an ordered deconstruction process during drying, and its resynthesis upon rehydration. The selective advantage of poikilo- chlorophylly is in minimising photo-oxidative damage and not having to maintain an intact photosynthetic system through long (5-10 months) inactive per

In [56]:
titles_for_no_abstrs = [p['title'] for p in r if p['abstract'] is None]

In [57]:
titles_for_no_abstrs

['Multi‐Omics Insights into the Evolution of Angiosperm Resurrection Plants',
 'Ecophysiology of Homoiochlorophyllous and Poikilochlorophyllous Desiccation-Tolerant Plants and Vegetations',
 'DNADamage',
 'Developmental Biochemistry',
 'Late Embryogenesis Abundant (LEA) proteins confer water stress tolerance to mammalian somatic cells.',
 'An Archaeometric Characterization of Ecuadorian Pottery',
 'Molecular mechanisms driving transcriptional stress responses',
 'Why Can’t Vertebrates Synthesize Trehalose?',
 'An abundant LEA protein in the anhydrobiotic midge, PvLEA4, acts as a molecular shield by limiting growth of aggregating protein particles.',
 'Mechanisms associated with cellular desiccation tolerance of Artemia encysted embryos from locations around the world.',
 'Tracts On The Nature Of Animals And Vegetables',
 'Small Heat Shock Proteins',
 'The continuing conundrum of the LEA proteins',
 'Oxidative stress and its effects during dehydration.',
 'Dehydration-induced expression

Certainly, some of the papers that have no abstracts also don't have titles that allow them to be classified. Hoever, some of these titles most certainly should have been captured by my generic terms list, which, along with the fact that some of those abstracts also look classify-able, implies to me that something may be wrong in my code. Let's read in the generic terms and try and recapitulate the fuzzy matching code here to see if we can detect any errors.

In [58]:
# Directly from classify_papers.py
def fuzzy_match_kingdoms(paper_dict, generic_dict):
    """
    Find fuzzy matches for generic names in the title and abstracts of papers
    not classified by other means.
    parameters:
        paper_dict, dict: keys are "title" and "abstract"
        generic_dict, dict: keys are generic terms, values are kingdom names
    returns:
        classes, list of str: kingdoms identified
    """
    classes = []
    for term, king in generic_dict.items():

        # Combine text to search
        if paper_dict['abstract'] is not None:
            text = paper_dict['title'] + ' ' + paper_dict['abstract']
        else:
            text = paper_dict['title']

        # Build the regex string
        sub_len = len(term)//3
        spacejoined = "\s+".join(term.split())
        reg = fr'\b({spacejoined}){{e<={sub_len}}}\b'

        # Search in paper
        matches = regex.search(reg, text, flags=regex.IGNORECASE)

        # Map to classes
        if matches is not None:
            classes.append(king)

    return classes

In [59]:
with open('../citation_network/maps/term_map.json') as myf:
    generic_dict = json.load(myf)

In [60]:
generic_dict

{'vegetative': 'Plant',
 'seeds': 'Plant',
 'plants': 'Plant',
 'Barley': 'Plant',
 'Cassava': 'Plant',
 'Corn': 'Plant',
 'Maize': 'Plant',
 'Oats': 'Plant',
 'Potato': 'Plant',
 'Sorghum': 'Plant',
 'Soybean': 'Plant',
 'Wheat': 'Plant',
 'Tardigrade': 'Animal',
 'Nematode': 'Animal',
 'Roundworm': 'Animal',
 'Brine shrimp': 'Animal',
 'Rotifers': 'Animal',
 'Crustacean': 'Animal',
 'Arthropods': 'Animal',
 'Bacterial': 'Microbe',
 'Archeal': 'Microbe',
 'Microbial': 'Microbe',
 'Prokaryote': 'Microbe',
 'Cyanobacteria': 'Microbe',
 'Blastospores': 'Fungi',
 'Propagule': 'Fungi',
 'Fungal': 'Fungi',
 'Lichen': 'Fungi'}

In [61]:
papers_with_both = [p for p in r if p['abstract'] is not None]

In [62]:
for p in papers_with_both:
    print(fuzzy_match_kingdoms(p, generic_dict))

['Plant', 'Plant', 'Fungi']
[]
[]
[]
['Animal']
['Microbe']
[]
['Plant', 'Microbe']
[]
[]
['Plant']
[]
[]
[]
[]
[]
[]
['Plant', 'Plant', 'Microbe']
['Animal']
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
['Animal']
[]
[]
[]
[]
[]
['Plant', 'Plant']
['Plant', 'Plant']
['Plant']
['Plant', 'Plant', 'Plant']
[]
['Plant', 'Plant']
['Plant', 'Plant']


So there are some that do get classified -- something may be wrong in my use of this function. However, what about the ones that aren't getting classifications?

In [63]:
noclass = []
for p in papers_with_both:
    classes = fuzzy_match_kingdoms(p, generic_dict)
    if classes == []:
        noclass.append({'title': p['title'], 'abstract': p['abstract']})

In [64]:
noclass

[{'title': 'Preferential adsorption to air–water interfaces: a novel cryoprotective mechanism for LEA proteins',
  'abstract': 'Late embryogenesis abundant (LEA) proteins comprise a diverse family whose members play a key role in abiotic stress tolerance. As intrinsically disordered proteins, LEA proteins are highly hydrophilic and inherently stress tolerant. They have been shown to stabilise multiple client proteins under a variety of stresses, but current hypotheses do not fully explain how such broad range stabilisation is achieved. Here, using neutron reflection and surface tension experiments, we examine in detail the mechanism by which model LEA proteins, AavLEA1 and ERD10, protect the enzyme citrate synthase (CS) from aggregation during freeze–thaw. We find that a major contributing factor to CS aggregation is the formation of air bubbles during the freeze–thaw process. This greatly increases the air–water interfacial area, which is known to be detrimental to folded protein stab

I am noticing a few patterns here. First of all, many of these do actually not mention a system, many are papers about proteins involved in desiccation tolerance and are in vitro studies that don't talk about the organism of origin in the abstract or title. There are also some studies about humans which don't actually seem to be about desiccation tolerance, so maybe I shouldn't worry about those. There are also a few that look like they are reviews across all of the kingdoms; they talk about "eukaryotic cells" and things. I would think, in a situation where we're properly recognizing the class of all papers that can be classified, we could recognize those by nodes with no classification that have a diversity of connections in terms of kingdom. There is at least one plant paper about Cottonseed that doesn't get recognized, I don't know if it's common enough that I should be adding Cotton to my list. There are also a bunch of protein structure databases in here, not sure why those would come up in the search; but I do think that at more than 10,000 results we would run the risk of having too many totally random papers. In summary; I don't really think anything is wrong here; I think the fuzzy matching is working as expected, and there's something wrong on an integration level where I'm using it within the code, which I don't believe I tested.

Does this phenomenon of not catching generic classificaitons hold for my smaller test graphs? I want to be able to debug this without having to wait 3 days every time I change something.

In [100]:
minigraph = nx.read_graphml('../data/citation_network/10_papers_with_extra_class.graphml')

In [101]:
noclass_ids = [node for node, attrs in minigraph.nodes(data=True) if attrs['study_system'] == 'NOCLASS']

In [102]:
noclass_ids

['2ee8adf45800831679c9ee226cf108b34197108b',
 '9405079823b201c061eec9b85f184702289f0472',
 'c7937572a01f822d61d7f1b3eda5dfca10b06915',
 '009f1d4b6fc58cb3452838cde0606e96ba4e9a53',
 'a8a1176af94deee7ca30f1798fb4ce2e9fea154e',
 '6d3384c3e7a9dd49ff1baed9a25f8172762ea939',
 '4123d4d0036c29cbc51ea4909d284a9aa9dcf949',
 '9335a8bc9f8dd18cdf598d3856aefc48dcc81eab',
 '78f1bf2a439beef61f2e0ee5ae9dc9d48b8d2577',
 'f3efd8924a04c44b244b26b8523c80fa154c9dd3',
 '374b348c41a68570a921733bfae75adc6f81150e',
 'bc6970a444cc731fb029ab68c55417a1c8b2b4fc',
 '392219ddc650779a44bd954318cc287ef753898d',
 '04f37893fbca894b476f1051248f4474dd482449',
 '6ca40139e7484fd4d14be150688b082f7749124f',
 'eb5f8b662b19bb61800bc13b2e4dbb076e329bff',
 '9db3e52a3d78fcf5c2def98e1255cce4f32d23bf',
 '1d785f8168c256cd5e5bf48af477fadc5015f7a8',
 'd5bd1a6a7241d4622caffcf3119e3c15184233c7',
 '23c0593e3a136b8a07143e69e5beef7d231a4416',
 '9eacd58f54f43598f336d5c421e5e015fffca4c5',
 'a976a141155fa0680556f4a09c37a5875e9386a3',
 '694d8c86

In [103]:
r = requests.post(
    'https://api.semanticscholar.org/graph/v1/paper/batch',
    headers=header,
    params={'fields': 'title,abstract,references'},
    json={"ids": noclass_ids}
).json() 

In [104]:
papers_with_both = [p for p in r if p['abstract'] is not None]

In [105]:
len(r), len(papers_with_both)

(220, 103)

In [106]:
all_matches = []
ids_for_miss_class = []
for p in papers_with_both:
    matches = fuzzy_match_kingdoms(p, generic_dict)
    all_matches.append(matches)
    if len(matches) != 0:
        ids_for_miss_class.append(p['paperId'])

In [107]:
ids_for_miss_class ## This is the list I used for debugging

['9405079823b201c061eec9b85f184702289f0472',
 '374b348c41a68570a921733bfae75adc6f81150e',
 '4e9755c1228fba28359ee0770389f9e7ef5bf092',
 '04f37893fbca894b476f1051248f4474dd482449',
 '9db3e52a3d78fcf5c2def98e1255cce4f32d23bf',
 'a976a141155fa0680556f4a09c37a5875e9386a3',
 '548fe290e72bdbb228f5579e854ffa1024c003a0',
 '140ca9b5e5f4e991b67248fc21c57fbee2fe5f69',
 '674d7dfbc1b325e59e239abe2b5ec6cb1e62de2e',
 '86921b60eef6858a801a31f4964b2a7803e3cc51',
 '55cd38f2f690ac315bfdd47aea44d0f652f8159e',
 'd9ed336823e8c2cca78fc16a58703f431a6efa4c',
 '22b02f0470db36984e024ae086f5295294236565',
 'e6147302bb27a9324855cb36a6990538722e0d17',
 'a6847b7223e5e4867f8bd769f2f1b4df6447fe2f',
 'caf402dcc612a39fdbcbf8254ecc34b45195325c',
 '2b78a2dfaa70619d7aa3933e0404a41349bffe7e',
 'b90219933f7e1865ef230cdaa4db8fa7efbf00ae',
 '9dcd5a1d23a22f10e45e14a3681fc0a63550e937',
 '6b7c902b2a788b52684c05301d7839bf62869a5f',
 'e45279ec7ac4f960e6801629d5bdf3ef49036b75',
 '0c9ec5f323b841b4d84c0633d9001aac505c7ccc',
 '5f928041

Hmm, certainly looks like we have a bug! Now I'll work on chasing it down with the inputs for this graph.

Debugging strategy: I defined a `verboseprint` function that only printed out my debugging statements if the paper ID was in the list above. I checked, at each stage of `map_paper_species` function, what the value of the `classes` variable was for those papers.

After looking at the output, I found the issue: sometimes, TaxoNERD makes classifications of words that don't get linked back to a kingdom, like "anhydrobiotes". In that case, the paper doesn't go through fuzzy classification, because it has a name in the species list, and then it doesn't get a class. I fixed this by, if at the end of the function there were no classes, once again running fuzzy matching to see if I could remedy it. Let's see if this worked:

In [108]:
fixedgraph = nx.read_graphml('../data/citation_network/10_papers_for_generic_debug.graphml')
noclass_ids_fixed = [node for node, attrs in fixedgraph.nodes(data=True) if attrs['study_system'] == 'NOCLASS']

First check: Do we have fewer NOCLASS nodes than before?

In [112]:
noclass_ids == noclass_ids_fixed

False

In [117]:
len(noclass_ids), len(noclass_ids_fixed), len(set(noclass_ids).intersection(set(noclass_ids_fixed)))

(220, 138, 138)

Yes! We classified 82 more papers this way. Now we want to check and make sure that fuzzy match doesn't return anything for these papers:

In [109]:
r = requests.post(
    'https://api.semanticscholar.org/graph/v1/paper/batch',
    headers=header,
    params={'fields': 'title,abstract,references'},
    json={"ids": noclass_ids_fixed}
).json() 

In [110]:
papers_with_both_fixed = [p for p in r if p['abstract'] is not None]

In [111]:
# These are the papers with NOCLASS that have both a title and abstract
all_matches_fixed = []
ids_for_miss_class_fixed = []
for p in papers_with_both_fixed:
    matches = fuzzy_match_kingdoms(p, generic_dict)
    all_matches_fixed.append(matches)
    if len(matches) != 0:
        ids_for_miss_class_fixed.append(p['paperId'])

In [120]:
ids_for_miss_class_fixed

[]

In [119]:
all_matches_fixed

[[],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 []]

There are no fuzzy matches in there, that solves it! Obviously there are still other issue impacting our ability to classify papers, but it seems like those are things out of my control as a programmer (like there not being a classification indicated in the abstract or the study being done on "eukaryotes")

## Exploration of NOCLASS papers
After debugging the above issue, I re-ran the classification on the network; however, we still have 27% of nodes without a classification. To look at this, I wrote a script that would present me a random sample of abstracts with no classification so that I could manually classify them and get an idea of the distribution of true classification under the NOCLASS categroy. I did this for 50 abstracts (which is 0.1% of the papers with no classification, so really too small; but wanted to start here), let's take a look at the results.

In [9]:
classed = pd.read_csv('../data/citation_network/full_from_20Nov_manual_classification_21Nov2023.csv', index_col=0)
classed = classed.rename(columns={'0': 'classification'})
classed.head()

Unnamed: 0,classification
fb4f0c3b540dc58db7064ddc971e3bbaefa1c910,P
02bf75659dab1090df2fd0fe29b2a4923d662d3e,U
c07a8511bb6873f7ba74979f019268c7f57e6989,A
b4cc6ef56df2c469f5f8cf2eb08aac5143307b57,U
90a69e58d067848a9009afb41276c8dadcb9b831,A


In [11]:
classed.shape

(52, 1)

In [12]:
classed.value_counts()

classification
U                 22
P                 14
A                 12
F                  2
M                  2
Name: count, dtype: int64

In [19]:
classed.value_counts()/classed.shape[0]

classification
U                 0.423077
P                 0.269231
A                 0.230769
F                 0.038462
M                 0.038462
Name: count, dtype: float64

Fewer than one half of the total NOCLASS nodes actually didn't have a classification. Let's do this for a larger sample size and see if we see the same trends.

In [14]:
classed_520 = pd.read_csv('../data/citation_network/full_from_20Nov_manual_classification_21Nov2023_500_docs.csv', index_col=0)
classed_520 = classed_520.rename(columns={'0': 'classification'})
classed_520.head()

Unnamed: 0,classification
9888964448274971590cff46a944aec4e584441b,P
363579488ea250c3cce424563f0b91f44c976d07,U
1c16b960dc336416bc9f58953b385b03a8bdbe11,A
f8de13ac6595177952416abb510b1dd76a976541,A
cbc2d041f899873d2e6f724e4f17f002eb8e34f6,A


In [16]:
classed_520.shape

(520, 1)

In [17]:
classed_520.value_counts()

classification
U                 278
A                 118
P                  83
M                  29
F                  12
Name: count, dtype: int64

In [18]:
classed_520.value_counts()/classed_520.shape[0]

classification
U                 0.534615
A                 0.226923
P                 0.159615
M                 0.055769
F                 0.023077
Name: count, dtype: float64

In the larger sample, about half are actually not-classifiable, which is still not that many all things considered. Manually going through like that was very informative in terms of other keywords I could consider using,a s well as for TaxoNERD's weaknesses. I assume that even if some of the more out-there species names were able to be identified initially, they weren't linked and therefore got dropped; however, that was a minimal proportion of the documents.

There were also a ton of soil science articles, as well as medical articles (many of which were about contact lenses, strangely), that weren't relevant to what we really want to be looking at. This was a good lesson in refining search terms and why it's important to be picky when generating the initial dataset.

There were also a fair number of mal-formatted documents that didn't appear to correspond to real papers; this may be good information for the Semantic Scholar folks.