# Citation network integrity check
We want to do some common-sense checking to make sure that there are no identical nodes repeated in the graph. We'll start off with the file generated by `pull_papers.py`, since we have all of the information we pulled about each paper in that file.

In [11]:
import jsonlines
from collections import defaultdict

In [3]:
with jsonlines.open('../data/semantic_scholar/desiccation_tolerance_10000_with_reference_abstracts_19Sep2023.jsonl') as reader:
    papers = []
    for obj in reader:
        papers.append(obj)

In [4]:
len(papers)

9999

In [5]:
papers[0]

{'paperId': '393cc126bd647a8435072e788a2a033561c6fa97',
 'title': 'Unexplored dimensions of variability in vegetative desiccation tolerance.',
 'abstract': 'Desiccation tolerance has evolved recurrently across diverse land plant lineages as an adaptation for survival in regions where seasonal rainfall drives periodic drying of vegetative tissues. Growing interest in this phenomenon has fueled recent physiological, biochemical, and genomic insights into the mechanistic basis of desiccation tolerance. Although, desiccation tolerance is often viewed as binary and monolithic, substantial variation exists in the phenotype and underlying mechanisms across diverse lineages, heterogeneous populations, and throughout the development of individual plants. Most studies have focused on conserved responses in a subset desiccation-tolerant plants under laboratory conditions. Consequently, the variability and natural diversity of desiccation-tolerant phenotypes remains largely uncharacterized. Here, 

Check that there are no identical paperId's:

In [7]:
all_paperIds = []
for paper in papers:
    all_paperIds.append(paper['paperId'])
    for ref in paper['references']:
        all_paperIds.append(ref['paperId'])

In [8]:
len(all_paperIds), len(set(all_paperIds))

(468217, 192506)

That is a good first indication! The number of nodes in our network corresponds with the length of the set of the paperIds, meaning we have no duplicates. I've been assuming paper Id's are unique; in case they for some reason aren't, let's do the same thing for paper titles:

In [9]:
all_titles = []
for paper in papers:
    all_titles.append(paper['title'])
    for ref in paper['references']:
        all_titles.append(ref['title'])

In [10]:
len(all_titles), len(set(all_titles))

(468217, 225113)

There are more unique titles than paperId's, which seems counterintuitive. Let's take a look at paper Id's that have multiple titles:

In [17]:
paperId_to_titles = defaultdict(list)
for paper in papers:
    paperId_to_titles[paper['paperId']].append(paper['title'])
    for ref in paper['references']:
        paperId_to_titles[ref['paperId']].append(ref['title'])

In [18]:
len(paperId_to_titles)

192506

In [22]:
more_than_one = {pid: len(v) for pid, v in paperId_to_titles.items() if len(v) > 1}

In [23]:
len(more_than_one)

67030

In [25]:
max(more_than_one.values())

39849