# Network descriptive statistics
In this notebook, I'll explore descriptive statistics and analyses about my citation network(s). This will include things like topic modelling in addition to more simple statistics. For the moment, I'm going to work with the Semantic Scholar dataset, so statistics about citations should be taken with a grain of salt.

In [3]:
import jsonlines
import requests
from collections import Counter, defaultdict
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import networkx as nx

In [2]:
# Import API key. This must be requested from https://www.semanticscholar.org/product/api#api-key; we save ours in an untracked file in data and import here
import sys
sys.path.append('../data/')
from semantic_scholar_API_key import API_KEY
header = {'x-api-key': API_KEY}

## Reading in the dataset

In [37]:
with jsonlines.open('../data/semantic_scholar/desiccation_tolerance_10000_with_reference_abstracts_19Sep2023.jsonl') as reader:
    papers = []
    for obj in reader:
        papers.append(obj)

In [40]:
# Wrangling into a flattened object with paperID as indexer to eliminate redundant papers for analyses where we don't
# care about the connectivity of the network
flattened_papers = {}
for p in papers:
    try:
        flattened_papers[p['paperId']] = {'title': p['title'], 'abstract': p['abstract']}
    except KeyError:
        flattened_papers[p['paperId']] = {'title': p['title']}
    for r in p['references']:
        try:
            flattened_papers[r['paperId']] = {'title': r['title'], 'abstract': r['abstract']}
        except KeyError:
            flattened_papers[r['paperId']] = {'title': r['title']}

In [43]:
# Using old classification from before generic bugfix for the moment
classified = nx.read_graphml('../data/citation_network/full_10000_with_classification_gen_terms_debug_15Nov2023.graphml')

In [46]:
paper_classifications = {k: v['study_system'] for k, v in classified.nodes(data=True)}

## Simple descriptive statistics
### Number of papers

In [None]:
print(f'There are {len(flattened_papers)} unique papers in the dataset')

In [None]:
# Check the overlap between both sets
len(set(paper_classifications)), len(set(flattened_papers.keys()).intersection(set(paper_classifications)))

### Number of papers per year
I didn't grab the years in my initial retrieval; will do this now

In [None]:
num_batches = len(flattened_papers)//500 + 1
num_batches, num_batches*500

In [None]:
to_retrieve = list(flattened_papers.keys())

In [None]:
papers_with_years = []
for i in range(num_batches):
    ids = to_retrieve[i*500:(i+1)*500]
    succeeded = False
    while not succeeded:
        r = requests.post(
            'https://api.semanticscholar.org/graph/v1/paper/batch',
            params={'fields': 'year'},
            json={"ids": ids},
            headers=header
        ).json()
        if type(r) == list:
            succeeded = True
        else:
            print(f'Request number {i} failed, trying again')
    papers_with_years.extend(r)  

In [None]:
len(papers_with_years)

In [None]:
len([p for p in papers_with_years if p is None])

In [None]:
paper_years = [p['year'] for p in papers_with_years if (p is not None) and (p['year'] is not None)]

In [None]:
len(paper_years)

In [None]:
_ = plt.hist(paper_years, bins=100)
plt.title('New publications per year for search term "desiccation tolerance"')
plt.xlabel('Publication Year')
plt.ylabel('Count')

Here we can see that the field of desiccation tolerance research really started to take off around 1950. What does the total publications over time look like?

In [None]:
counts_per_year = Counter(paper_years)

In [None]:
ordered_years = sorted(counts_per_year.keys())

In [None]:
cumulative_years = {y:(ordered_years[i] + sum(ordered_years[:i]))/1000 for i, y in enumerate(ordered_years)}

In [None]:
plt.scatter(cumulative_years.keys(), cumulative_years.values())
plt.title('Cumulative publications over time for search term "desiccation tolerance"')
plt.ylabel('Total publications (thousands)')
plt.xlabel('Year')

What does this look like if we subset by the classification that we have?

In [None]:
len(set(paper_classifications.keys()).intersection(set([p['paperId'] for p in papers_with_years if p is not None])))

There are papers in the year dataset that don't match up with any `paperId` in the classification set. I believe this is due to [paper merging](https://github.com/allenai/s2-folks/issues/157); however, I am unaware of a way to get back the paperId's that I started with. For now, I'm going to move on with intersection, with the awareness that ~6,000 papers here are missing .

In [None]:
# Add the classifications to the dataset
for p in papers_with_years:
    if p is not None:
        try:
            p['classification'] = paper_classifications[p['paperId']]
        except KeyError:
#             p['classification'] = 'missing_in_new'
            continue

In [None]:
# Separate by classification
years_per_class = defaultdict(list)
for p in papers_with_years:
    if p is not None:
        if p['year'] is not None:
            try:
                years_per_class[p['classification']].append(p['year'])
            except KeyError:
                continue

In [None]:
years_per_class.keys()

We want to look both at the normal and cumulative versions:

In [None]:
colors = {'Plant': '#E69F00', 'Animal': '#56B4E9', 'Microbe': '#009E73', 'Fungi': '#F0E442', 'NOCLASS': '#CC79A7', 'missing_in_new': '#C7C7C7'}

In [None]:
fig, ax = plt.subplots(1, 1)
for cls, yrs in years_per_class.items():
    _ = ax.hist(yrs, bins=100, color=colors[cls], label=cls, alpha=0.5)
plt.legend()
plt.title('New publications per year by study system for search term "desiccation tolerance"')
plt.xlabel('Publication Year')
plt.ylabel('Count')

The distributions look the same, which is great! Now let's look at the cumulative:

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(24,6))
for cls, yrs in years_per_class.items():
    yr_counts = Counter(yrs)
    ordered_yrs = sorted(yr_counts.keys())
    cumulative_years = {y:(ordered_yrs[i] + sum(ordered_yrs[:i]))/1000 for i, y in enumerate(ordered_yrs)}
    ax.scatter(cumulative_years.keys(), cumulative_years.values(), color=colors[cls], alpha=0.5, label=cls)
plt.legend()
plt.title('Cumulative publications over time for search term "desiccation tolerance"')
plt.ylabel('Total publications (thousands)')
plt.xlabel('Year')

I wouldn't have expected plant and animal to have such similar numbers! Again, `NOCLASS` and `missing_in_new` are accounting for such a large portion here that we have to take these with a grain of salt, but I still think this is promising.