# Semantic Scholar Network Generation
In this notebook, we'll explore building a citation network on results of a specific search using the Semantic Scholar API.

In [58]:
import requests
from tqdm import tqdm
import jsonlines
import networkx as nx
from statistics import mean, median

## Requesting Data

In [1]:
# Import API key. This must be requested from https://www.semanticscholar.org/product/api#api-key; we save ours in an untracked file in data and import here
import sys
sys.path.append('../data/')
from semantic_scholar_API_key import API_KEY
header = {'x-api-key': API_KEY}

We want to look at the citations of papers that are returned when we search "desiccation tolerance" and "anhydrobiosis". We then want to use our previously written TaxoNERD code to get the study organisms and their kingdoms from the title ans abstracts. To do this, we need to build a query that will return some number of papers with title, abstract, and citations. A full reference of the properties we can request is found [here](https://api.semanticscholar.org/api-docs/#tag/Paper-Data/operation/get_graph_get_paper_search).

Using the API, only 99 papers can be requested at a time (maximum value of the `limit` parameter). However, we want to get thousands of search results (the maximum depth we can request into the search results is 10,000). We can do this by combining the `limit` and `offset` parameters; the `offset` parameter tells the request how far down the search results to start retrieving our 99 items.

In [3]:
search_results = []
for offset in tqdm(range(0, 10000, 100)):
    query = f'http://api.semanticscholar.org/graph/v1/paper/search?query=desiccation+tolerance&offset={offset}&limit=99&fields=title,abstract,references'
    search = requests.get(query, headers=header).json()
    search_results.append(search)

100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [09:00<00:00,  5.41s/it]


Let's save our results as a jsonlines so we can come back to it without having to request:

In [6]:
with jsonlines.open('../data/semantic_scholar/desiccation-tolerance_10000_24Aug2023.jsonl', 'w') as writer:
    writer.write_all(search_results)

Now we can experiment around and see if it's possible to build a citation network!

## Building citation network
We need to generate pairs of documents connected by citations. Conveniently, each paper has a unique ID generated by Semantic Scholar, which makes our lives somewhat easier.

To get the paper ID for a single search result:

In [17]:
search_results[0]['data'][0]['paperId']

'393cc126bd647a8435072e788a2a033561c6fa97'

To get the paper ID's of its references:

In [18]:
[p['paperId'] for p in search_results[0]['data'][0]['references']]

['18d8e54e9f384361d3f6af634642e82d2479bef2',
 '2ee8adf45800831679c9ee226cf108b34197108b',
 'af0eb5272536c8b1db53e671b77d9ea61e206ff5',
 '2b32eec751dbe123a7532cfac4fb3c8ce5295a37',
 '5f2084b6eb9286ee6930e2cc0d9e3c4f581def29',
 'efcbec4baa59b1cbe9acada9bbc266b020b1560f',
 '8f8622842a9d31cb7fff1ff4433c28bfd10133df',
 '9405079823b201c061eec9b85f184702289f0472',
 'c7937572a01f822d61d7f1b3eda5dfca10b06915',
 '12b1a3e2df680cd5a94ff89a356ef513f817cba9',
 '282bd0c59d247c35a1a0cbfd3708cd27bfcb9cbe',
 'b812c9a1370d125b91af265dbd2f417b4329a9df',
 'd79eedab7bae594f34c4d355510e57e2f21aecbe',
 'a52b0768702008d78d474fc431b95b2a0288fc65']

Characterizing average number of citations:

In [66]:
def characterize_citations(search_results):
    """
    Get statistics about the number of citations per paper.
    
    parameters:
        search_results, list of dict: query results
    """
    num_cites = []
    for result in search_results:
        for paper in result['data']:
            num_cites.append(len(paper['references']))
    print(f'Average number of citations per paper: {mean(num_cites): .2f}')
    print(f'Median number of citations per paper: {median(num_cites)}')
    print(f'Maximum number of citations per paper: {max(num_cites)}')
    print(f'Minimum number of citations per paper: {min(num_cites)}')

In [67]:
characterize_citations(search_results)

Average number of citations per paper:  45.32
Median number of citations per paper: 36.0
Maximum number of citations per paper: 1000
Minimum number of citations per paper: 0


Define a function to generate links:

In [48]:
def generate_links(search_results):
    """
    Generate a list of edges by paper ID from the results of a Semantic Scholar query.
    
    parameters:
        search_results, list of dict: query results
        
    returns:
        nodes, list of two-tuple: the paper ID and an attribute dictionary containing the paper's title
        edges, list of three-tuple: the paper IDs of both citing and cited paper, and an attribute dictionary with the paper's title
    """
    nodes, edges = [], []
    for result in search_results:
        for paper in result['data']:
            citing = (paper['paperId'], {'title': paper['title']})
            cited = [(p['paperId'], {'title': p['title']}) for p in paper['references']]   
            nodes.append(citing)
            nodes.extend(cited)
            edges.extend([(citing[0], p[0]) for p in cited])
    return nodes, edges

In [49]:
nodes, edges = generate_links(search_results)

Some citations appear to have been improperly formatted somewhere along the line, and result in having no paper ID, and a title that's just part of a full citation (not the actual title of the paper being cited). How many nodes of the network does this comprise?

In [57]:
print(f'{(sum([1 for n in nodes if n[0] is None])/len(nodes))*100: .2f}% of the network\'s nodes are malformed')

 8.96% of the network's nodes are malformed


This is a relatively small percentage -- let's drop them for now, we can come back and troubleshoot later. We also now want to add the taxonomic classification as attributes to nodes.

In [76]:
def classify_orgs(ents, defs):
    """
    Get organism classifications from a list of NCBI Taxonomy IDs

    parameters:
        ents, list of int: NCBI Taxonomy ID's
        defs, dict: keys are lineage categories, values are the final
            kingdom classification for those categories

    returns:
        kings, list of str: unique kingdom classifications
    """
    kings = []
    for i in ents:
        try:
            t1 = taxoniq.Taxon(i)
            lineage = [t.scientific_name for t in t1.ranked_lineage]
            if lineage[-1] == 'Bacteria' or lineage[-1] == 'Archea':
                kings.append(defs[lineage[-1]])
            elif lineage[-1] == 'Eukaryota':
                try:
                    kings.append(defs[lineage[-2]])
                except KeyError:
                    continue
        except KeyError:
            continue

    kings = list(set(kings))
    return kings

In [75]:
def classify_title(title):
    """
    Gets the Kingdom classification of a paper title.
    
    parameters:
        title, str: title of the paper
    
    returns:
        king, str: kindgom of the paper
    """
    # Do TaxoNERD classification
    title_df = taxonerd.find_in_text(title)
    
    # Get the unique organisms
    title_ents = list(set([title_df['entity'][j][0][0].split("NCBI:")[1] for j in
            range(len(title_df))]))
    
    # Set up definitions for kingdom classification
    defs = {
            'Metazoa': 'Animal',
            'Viridiplantae': 'Plant', # Consider adding algae
            'Bacteria': 'Microbe',
            'Archea': 'Microbe'
            }
    
    # Classify unique organisms
    title_classes = classify_orgs(title_ents, defs)
    
    # Get the kingdom
    king = title_classes[0]
    
    return king

In [68]:
def generate_links_with_classification(search_results):
    """
    Generate a list of edges by paper ID from the results of a Semantic Scholar query. Removes malformed
    citations with no paperID, and classifies nodes by the organisms in their titles.
    
    parameters:
        search_results, list of dict: query results
        
    returns:
        nodes, list of two-tuple: the paper ID and an attribute dictionary containing the paper's title
        edges, list of three-tuple: the paper IDs of both citing and cited paper, and an attribute dictionary with the paper's title
    """
    nodes, edges = [], []
    for result in search_results:
        for paper in result['data']:
            citing = (paper['paperId'], {'title': paper['title']})
            cited = [(p['paperId'], {'title': p['title'], }) for p in paper['references'] if p['paperId'] is not None]   
            nodes.append(citing)
            nodes.extend(cited)
            edges.extend([(citing[0], p[0]) for p in cited])
    return nodes, edges

In [69]:
nodes, edges = generate_links_no_empty(search_results)

In [73]:
citenet = nx.MultiDiGraph()
_ = citenet.add_nodes_from(nodes)
_ = citenet.add_edges_from(edges)