Import the required packages and libraries.

In [1]:
import networkx as nx
import json
from tqdm import tqdm

Read the metadata JSON file in order to build a dictionary and assign to each article a unique identifier (different from the DOI for easiest management of the network).

- metadata_dict -> contains all the articles and their data
- nodes -> dictionary containing tuples to map from DOI to node_id and journal title
- journals_dict -> dictionary to map from Journal_title to unique_id of the journal 

In [2]:
# Read metadata JSON file in order to build a dictionary
metadata = open("../Data/metadata.json")
metadata_dict = json.load(metadata)

# Create a dict of pairs "doi: (node_id, journal_title)"
nodes = dict()
reverse_nodes = dict()

# Create a dict of pairs "Journal: unique_identifier"
journals_dict = {}

# Add a number as unique identifier of each one of the papers and to each Journal
i = 0
j = 0
for paper in metadata_dict:
    new_journal = False
    paper["node_id"] = i
    nodes[paper['id']] = (paper['node_id'], paper['source_title'])
    if paper['source_title'] not in journals_dict:
        journals_dict[paper['source_title']] = j
        reverse_nodes[paper['node_id']] = (paper['id'], paper['source_title'], j)
        new_journal = True
    else:
        idx = journals_dict[paper['source_title']]
        reverse_nodes[paper['node_id']] = (paper['id'], paper['source_title'], idx)
    i+=1
    # art_id : (doi, journ_title, journ_id)
    if new_journal:
        j+=1

Build the first network made up of articles.</br>
Also build the undirected network to analyze the structure.

In [3]:
undirected_papers_network = nx.Graph()
papers_network = nx.DiGraph()

Read citations JSON file in order to build a dictionary.

In [4]:
citations = open('../Data/citations.json')
citations_dict = json.load(citations)

Iterate over citations_dict to build a papers citations' network.

In [5]:
for citation_obj in tqdm(citations_dict):
    source = citation_obj['source']
    target = citation_obj['target']
    if source in nodes:
        if target in nodes:
            source_article_id = nodes[source][0]
            target_article_id = nodes[target][0]
            undirected_papers_network.add_edge(source_article_id, target_article_id)
            papers_network.add_edge(source_article_id, target_article_id)

100%|██████████| 189697/189697 [00:00<00:00, 219057.99it/s]


Save the undirected papers' network for the structural analysis.

In [6]:
nx.write_gml(undirected_papers_network, "../gml format networks/undirected_papers_network.gml")

Compute the <i>PageRank</i> value of the nodes of our network.

In [7]:
pr = nx.pagerank(papers_network, alpha=0.85)
pr_list = sorted(pr.items(), key=lambda item: item[1], reverse=True)
pr_list

[(39264, 0.003958045367325669),
 (17440, 0.0034456482660907505),
 (21306, 0.0031866534313306017),
 (25837, 0.002695082402101789),
 (12204, 0.002360879446452903),
 (26129, 0.002305909452775167),
 (12343, 0.002106388438053257),
 (7812, 0.0020691617187695984),
 (40899, 0.002012513218906165),
 (8523, 0.001937907687581662),
 (7573, 0.0018446429959122319),
 (13497, 0.0017858247145916646),
 (26347, 0.0013872477547294958),
 (4212, 0.0013400330698184852),
 (14321, 0.0013368983145352557),
 (40607, 0.0012087719493754636),
 (2883, 0.0011787700327177831),
 (1309, 0.0011633113415966461),
 (35740, 0.0010887784139091451),
 (8690, 0.001038583108748586),
 (25492, 0.001033796150810318),
 (19253, 0.000978599763147506),
 (2750, 0.0009722408998811178),
 (37156, 0.0009716697702568396),
 (2608, 0.000909994171845777),
 (7631, 0.0009043413795882051),
 (24293, 0.0008885521021966464),
 (46699, 0.0008813576389040784),
 (24651, 0.0008655335134827691),
 (20861, 0.0008453928658172722),
 (18807, 0.0008194679992225698)

Test to see which are the most important articles at this point, retrieved with the <i>Eigenvector Centrality</i> measure.

In [9]:
pr = nx.eigenvector_centrality(papers_network, max_iter=1000)
pr_list = sorted(pr.items(), key=lambda item: item[1], reverse=True)
for el in pr_list[:3]:
    print(el)
    print(reverse_nodes[el[0]])

(34948, 0.22629370522315476)
('10.1093/infdis/120.5.576', 'Journal Of Infectious Diseases', 414)
(26230, 0.19000624260331436)
('10.1016/s0140-6736(75)93176-1', 'The Lancet', 82)
(3907, 0.18100263575061926)
('10.1177/030098587301000105', 'Veterinary Pathology', 493)


-----------

In order:
- Read the JSON file containing citations' pairs;
- Create a dictionary called "journal_citations" to store the different citations from journal to journal. The structure of this dictonary will be: "citing_journal_id: list_of_cited_journal_ids" (obviously, in the list we have repetitions of cited journals if articles cites more than one paper of the target journal);
- Populate the network as said above. This is accomplished thanks to a temporary "memo" dict that stores each citations to every target journal and that is initialized every time the source journal changes.
- Populate the "weights" dictionary. Such dictionary will contain the weight of each specific path retrieved and will be used to assign edge attributes to the network.
- article_citations contains pairs of "source article:[list of cited articles]".

<span style="color:red">To retrieve the importance of edges in between journals:
- $\tau_j$ = eigenvector value
- $\Gamma_J$ = importance of a journal
- $j$ = article
- $n_j$ = # articles in journal J
- $n_{c_{AB}}$ = # of citations from journal A to journal B
$$\Gamma_J = \dfrac{\sum \tau}{n_j}$$
</br>

$$\omega_{AB} = \dfrac{1}{\Gamma_J*n_{c_{AB}}}$$

</span>

Compute the importance of each journal.

In [10]:
journal_weights = dict()

for paper in pr_list:
    publication_id = paper[0]
    node_centrality = paper[1]
    if publication_id in reverse_nodes:
        if reverse_nodes[publication_id][2] not in journal_weights:
            journal_weights[reverse_nodes[publication_id][2]] = [0,0]
        journal_weights[reverse_nodes[publication_id][2]][0] += node_centrality
        journal_weights[reverse_nodes[publication_id][2]][1] += 1

Store the importance value of journals into <i>journal_weights</i>.

In [11]:
for journal in journal_weights:
    journal_weights[journal] = journal_weights[journal][0] #/ journal_weights[journal][1]
journal_weights

{414: 0.8652759846732455,
 82: 0.9120825066474676,
 493: 0.5352185837609509,
 2002: 0.17324819920751405,
 849: 0.46269695348856504,
 613: 0.6078417028038179,
 52: 2.025035652234723,
 612: 0.41302308557036477,
 590: 0.8326874052483758,
 346: 0.23780360301577966,
 53: 0.5296245544996079,
 66: 1.8719098500951836,
 1330: 0.22148842935735397,
 196: 0.17642604157000508,
 20: 0.25338275736768606,
 275: 0.0915833693925136,
 95: 1.3155371638500886,
 789: 0.15084994139168928,
 35: 0.6313981999355879,
 135: 0.2926266832105985,
 1221: 0.12980276369292587,
 112: 0.28635700347101345,
 769: 0.1031380693347162,
 44: 0.34709208662517554,
 129: 0.3060896633664373,
 2382: 0.11142555701154713,
 481: 0.05652341596041446,
 369: 0.15315073576534158,
 24: 1.4935516292141486,
 132: 0.25861433793882016,
 59: 0.050398214892811224,
 2539: 0.04720528228249901,
 124: 0.14124961745268041,
 1487: 0.09294818877958284,
 4918: 0.0537526299310552,
 43: 0.252396978492849,
 323: 0.042282193441967576,
 2420: 0.0570406259720

Retrieve citations between journals.

In [12]:
journal_citations = dict()
article_citations = dict()

# Iterate over citations_dict to build a journals citations' network
for citation_obj in tqdm(citations_dict):
    source = citation_obj['source']
    target = citation_obj['target']
    if source in nodes:
        if target in nodes:
            source_article = nodes[source][0]
            target_article = nodes[target][0]
            if source_article != target_article:
                if source_article not in article_citations:
                    article_citations[source_article] = list()
                article_citations[source_article].append(target_article)
                source_journal = nodes[source][1]
                target_journal = nodes[target][1]
                if source_journal in journals_dict:
                    if target_journal in journals_dict:
                        jorunal_source_id = journals_dict[source_journal]
                        journal_target_id = journals_dict[target_journal]
                        if jorunal_source_id not in journal_citations:
                            journal_citations[jorunal_source_id] = list()
                        journal_citations[jorunal_source_id].append(journal_target_id)

100%|██████████| 189697/189697 [00:00<00:00, 496834.95it/s]


Build the second network:
- journals_network -> such network will have the different journals as nodes; the edges will be weighted with the reciprocal of the number of citations of articles that goes from journal A to journal B. To be more accurate, it is correct to specify that target nodes without citations won't be considered at all, giving thus the possibility to avoid the definition of a normalization constant (that could have been useful to avoid 0-weigths in paths).


</br>
Also in this case, we will build the undirected version of this network, useful then to analyze its structure.

In [13]:
# Build the citations graph
undirected_journals_network = nx.Graph()
journals_network = nx.DiGraph()

Populate the networks by adding nodes and edges.

In [14]:
weights = dict()

for source_id in journal_citations:
    memo = dict()
    for target_id in journal_citations[source_id]:
        if target_id not in memo:
            memo[target_id] = 0
        memo[target_id] += 1
    for cited_journal in memo:
        weights[(source_id, cited_journal)] = 1/(journal_weights[source_id]*memo[cited_journal])
        undirected_journals_network.add_edge(source_id, cited_journal)
        journals_network.add_edge(source_id, cited_journal)

Save the undirected version of journals' network.

In [15]:
nx.write_gml(undirected_journals_network, "../gml format networks/undirected_journals_network.gml")

Assign edge_attributes to the network, according to the previously computed weights.

In [16]:
nx.set_edge_attributes(journals_network, weights, "relative_weights")

Compute the <i>Betweenness Centrality</i> measure to retrieve the most important journals. The parameter "weight" will contain the weights attributed to the network in the previous snippet.</br>
The "normalized=True" attribute is useful, in this case, because provides a normalization measure for the direct network.

In [17]:
journals_weighted_betweennes = nx.betweenness_centrality(journals_network, k=None, normalized=True, weight='relative_weights', endpoints=False, seed=None)

Print the 100 most influential journals.

In [18]:
journals_influence = sorted(journals_weighted_betweennes.items(), key=lambda item: item[1], reverse=True)
journals_influence[:100]

[(693, 1.72548669183168e+134),
 (44, 6.577037645768983e+130),
 (24, 1.1386961127535488e+130),
 (465, 4.6571635476505805e+129),
 (112, 1.6709107790970638e+129),
 (171, 4.77888761397174e+128),
 (30, 2.233057630586854e+128),
 (74, 1.7364692343688723e+128),
 (452, 1.6005067694990448e+128),
 (379, 7.441433957476939e+127),
 (1545, 5.392019842314429e+127),
 (53, 4.727808818376524e+127),
 (109, 3.073954563010692e+127),
 (179, 2.988485049928943e+127),
 (457, 2.9354773680173918e+127),
 (970, 2.3293881442243825e+127),
 (9, 2.0082178293013168e+127),
 (590, 1.7801948617850336e+127),
 (129, 1.5022061176882433e+127),
 (477, 7.120124906021158e+126),
 (113, 1.646887352619353e+126),
 (137, 1.6204511988927166e+126),
 (138, 1.5677302139928626e+126),
 (110, 1.3705441870737992e+126),
 (265, 9.655866037624276e+125),
 (2468, 2.3430943082217023e+125),
 (1199, 1.8645488213606365e+125),
 (95, 1.6821411668900837e+125),
 (464, 1.1544787666138944e+125),
 (131, 9.682942745956128e+124),
 (684, 6.683980536929425e+124)

Extract the title of the most influential journal.

In [19]:
for journal_title in journals_dict:
    if journals_dict[journal_title] == journals_influence[0][0]:
        most_influential_journal = journal_title
        break
most_influential_journal

'Philosophical Transactions Of The Royal Society Of London. Series B: Biological Sciences'

Count the number of outgoing edges from each article in the dataset.

In [20]:
# Raw count of how many articles each specific article cites
article_citations_tot = dict()

for citation in citations_dict:
    if citation['source'] in nodes:
        source_article_id = nodes[citation['source']][0]
        if citation['target'] in nodes:
            target_article_id = nodes[citation['target']][0]
            if source_article_id != target_article_id:
                if source_article_id not in article_citations_tot:
                    article_citations_tot[source_article_id] = 0
                article_citations_tot[source_article_id] += 1

Build a "journals_sizes" dictionary, containing pairs "journal_id: journal_size", retrieved by the betweenness centrality dictionary computed above.

In [21]:
journal_influences = journals_weighted_betweennes

<span style="color:red">In the following snippet, is given a weight to citations between articles.</br>
Such weight is computed in the following way:
- $n$ is the raw count of out-going citations from a certain article;
- $\alpha$ is the influence of the specific journal containing the citing article (computed with the betweenness centrality measure);
- $\lambda$ is a constant ($\lambda = 0.1$) that is useful to normalize weights equal to $0$;
</br>
Following a flow of information that goes from the source article to the cited one, the relative weight ($\Phi_{ij}$) of the connection between "article $A$" and "article $B$" is computed as follows:</br>

$$\Phi_{AB} = \dfrac{\alpha + \lambda}{n}$$ 
</br>

The idea behind this computation derives from the will to distribute the importance of a certain article between all the articles that it cites in an equal way. Furthermore, higher the number of cited articles -> smaller the importance passed to each one of them.</span>


In [22]:
paper_weights = dict()

for paper in pr_list:
    paper_id = paper[0]
    paper_weight = paper[1]
    paper_weights[paper_id] = paper_weight
paper_weights

{34948: 0.22629370522315476,
 26230: 0.19000624260331436,
 3907: 0.18100263575061926,
 11156: 0.17289612871374369,
 23638: 0.16595863496464636,
 17015: 0.1644623028002411,
 4553: 0.1492088280751363,
 38019: 0.14674225337814517,
 22743: 0.1426157058405683,
 37554: 0.1410142599703751,
 26716: 0.13412167351633084,
 18056: 0.13400094688527261,
 3483: 0.12666547802518993,
 9898: 0.12651026908365995,
 7872: 0.12372028319689193,
 34995: 0.11748282849496841,
 39719: 0.11249320631270593,
 3120: 0.11238973578866455,
 24734: 0.11145682986366869,
 41900: 0.11028414931483103,
 28213: 0.1080005306252672,
 27931: 0.10721778139734106,
 39177: 0.10049927047979786,
 48638: 0.09991394853111782,
 14763: 0.09940694356144113,
 8026: 0.09572354309683594,
 15640: 0.09116799257089264,
 20037: 0.09101056087693334,
 24451: 0.08869251846391477,
 38825: 0.08672817171584098,
 6113: 0.08490467124649469,
 1990: 0.08455611322108271,
 5966: 0.08364714293847343,
 20996: 0.08084970244769114,
 7977: 0.07770010771666361,
 

Build a new network, that is the citation network of publications contained within the most influential journal.

In [23]:
publications_network = nx.DiGraph()

Add edges to the network and save the weights of these connections.

In [24]:
# articles_weights contains pairs of "(tuple source-target): weight of the connection"
articles_weights = dict()

for citation in tqdm(citations_dict):
    found_all = False
    if citation['source'] in nodes:
        source_article_id = nodes[citation['source']][0]
        source_journal = nodes[citation['source']][1]
        if source_journal in journals_dict:
            source_journal_id = journals_dict[source_journal]
            if source_journal_id in journal_influences:
                if source_article_id in article_citations_tot:
                    article_distributed_weight = ((journal_influences[source_journal_id]/article_citations_tot[source_article_id])*paper_weights[source_article_id])
                    found_all = True
    if found_all:
        if source_article_id in article_citations:
            for cited_article_id in article_citations[source_article_id]:
                if source_article_id != cited_article_id:
                    publications_network.add_edge(source_article_id, cited_article_id)
                    articles_weights[(source_article_id, cited_article_id)] = article_distributed_weight

100%|██████████| 189697/189697 [00:08<00:00, 22888.95it/s]


Print out the totality of citations done by each article.

In [25]:
article_citations_tot

{17591: 91,
 10952: 11,
 7137: 17,
 41280: 308,
 7382: 2,
 35349: 41,
 1376: 31,
 4039: 11,
 31271: 1,
 8434: 65,
 25637: 1,
 1509: 55,
 25106: 170,
 35818: 118,
 39126: 60,
 27986: 99,
 25571: 5,
 26419: 62,
 11160: 1,
 34012: 93,
 19082: 60,
 27898: 36,
 8000: 26,
 32617: 181,
 32429: 136,
 18951: 1,
 24617: 133,
 28960: 52,
 685: 99,
 27936: 39,
 28799: 47,
 37916: 66,
 30619: 288,
 36760: 108,
 27920: 53,
 16171: 42,
 5043: 174,
 11619: 1,
 3121: 165,
 25633: 139,
 22460: 68,
 38818: 64,
 36248: 104,
 12395: 19,
 36293: 3,
 20607: 76,
 40882: 1,
 16614: 75,
 20830: 4,
 6079: 47,
 10032: 1,
 22288: 7,
 27803: 85,
 28217: 146,
 3305: 126,
 27385: 79,
 5933: 31,
 12811: 46,
 11675: 18,
 608: 1,
 31595: 67,
 20584: 84,
 2738: 177,
 23959: 210,
 27386: 62,
 18856: 56,
 44501: 7,
 44535: 66,
 47901: 15,
 48352: 9,
 42874: 21,
 1645: 24,
 34579: 203,
 7074: 353,
 29093: 44,
 38805: 19,
 25880: 77,
 34484: 2,
 10168: 107,
 8665: 8,
 30266: 12,
 37473: 236,
 16107: 11,
 13960: 108,
 10986: 

Set the weights of edges within the most influential journal citations network.

In [26]:
nx.set_edge_attributes(publications_network, articles_weights, "relative_new_nodes_weights")

Finally, compute the <i>Eigenvector Centrality</i> measure in order to find which publications can be identified as key publications within the reference context.

In [27]:
key_papers = nx.eigenvector_centrality(publications_network, max_iter=1000, weight='relative_new_nodes_weights')

In [28]:
three_key_papers = sorted(key_papers.items(), key=lambda item: item[1], reverse=True)[:3]
three_key_papers

[(25492, 0.9999999999983394),
 (29578, 6.706955640859817e-07),
 (37156, 6.706890639619544e-07)]

Compare these 3 articles with the 3 articles found at the beginning (thta is, before assignign weights on the basis of the provenance's journals), in order to see whether our process led to different results.

In [34]:
# old key papers
i=0
print("    Old key papers", "      ---------     " "New key papers")
for el in pr_list[:3]:
    print(el, "--", three_key_papers[i])
    i+=1

    Old key papers       ---------     New key papers
(34948, 0.22629370522315476) -- (25492, 0.9999999999983394)
(26230, 0.19000624260331436) -- (29578, 6.706955640859817e-07)
(3907, 0.18100263575061926) -- (37156, 6.706890639619544e-07)


So, there are differences.

------

Finally retrieve metadata about these new key papers.

In [35]:
for paper in metadata_dict:
    if paper['node_id'] == three_key_papers[0][0]:
        key_paper_1 = paper
    if paper['node_id'] == three_key_papers[1][0]:
        key_paper_2 = paper
    if paper['node_id'] == three_key_papers[2][0]:
        key_paper_3 = paper

In [36]:
key_paper_1

{'id': '10.1111/j.1532-5415.1997.tb01474.x',
 'author': 'Falsey, Mccann, Hall, Criddle, Formica, Wycoff, Kolassa',
 'year': '1997',
 'title': 'The “Common Cold” In Frail Older Persons: Impact Of Rhinovirus And Coronavirus In A Senior Daycare Center',
 'source_title': 'Journal Of The American Geriatrics Society',
 'node_id': 25492}

In [37]:
key_paper_2

{'id': '10.1056/nejmoa030666',
 'author': 'Tsang, Ho, Ooi, Yee, Wang, Chan-Yeung, Lam, Seto, Yam, Cheung, Wong, Lam, Ip, Chan, Yuen, Lai',
 'year': '2003',
 'title': 'A Cluster Of Cases Of Severe Acute Respiratory Syndrome In Hong Kong',
 'source_title': 'New England Journal Of Medicine',
 'node_id': 29578}

In [38]:
key_paper_3

{'id': '10.1056/nejmoa030634',
 'author': 'Poutanen, Low, Henry, Finkelstein, Rose, Green, Tellier, Draker, Adachi, Ayers, Chan, Skowronski, Salit, Simor, Slutsky, Doyle, Krajden, Petric, Brunham, Mcgeer',
 'year': '2003',
 'title': 'Identification Of Severe Acute Respiratory Syndrome In Canada',
 'source_title': 'New England Journal Of Medicine',
 'node_id': 37156}

Save the networks build during the entire process.

In [40]:
nx.write_gml(papers_network, "../gml format networks/directed_first_papers_network.gml")
nx.write_gml(journals_network, "../gml format networks/directed_journals_network.gml")
nx.write_gml(publications_network, "../gml format networks/directed_final_papers_network.gml")