# CORD-19 Citation Network with Title Deduping

The CORD-19 citation network contains papers as nodes and paper citations as edges. We construct this using the bibliography section, which contains metadata related to cited papers. One challenge is the lack of a stable identifier for creating nodes in the network. This notebook outlines a method for constructing the citation network through approximate deduplication of titles.

## Available datasets

This notebook can be added as a data-source for immediate access to the citation network and its intermediate datasets.

dataset | output type | description 
-|-|-
`references.parquet` | parquet | the flattened citations from the bibliography
`titles.parquet` | parquet | list of all titles seen across CORD-19 JSON documents
`corrections/t{threshold}_citation_index.parquet` | parquet | the index for mapping `sha1` of titles to their `citation_id`
`corrections/t{threshold}_ranked_titles.csv` | csv | A list of titles, cluster size, and rank by cluster size.
`corrections/t{threshold}_approx_citation.csv` | csv | The top 50 near-duplicate titles by edit distance for the 20 largest clusters
`corrections/t{threshold}_edit_distance_stats.csv` | csv | Intercluster edit-distance statistics
`edit_distance_stats.csv` | csv | Aggregated intercluster edit-distance statistics from all thresholds
`citation_index.parquet` | parquet | The index of `title_sha` to `citation_id` used for network construction
`citations.parquet` | parquet | The CORD-19 citation network with deduplication, threshold $t=0.3$
`citation_pagerank.csv` | csv | A list of citation_ids to pagerank

Here are some datasets that have been obsoleted:

dataset | output type | description 
-|-|-
`hashed_titles/`  | partitioned parquet | titles that have been shingled and minhashed using `pyspark.ml.feature.MinHashLSH`
`ann_titles/` | partitioned parquet | a self approximate-similarity join of hashed titles with a Jaccard distance threshold of 0.8

### Changelog

#### 2020-03-23 V24
* Update title to include deduping
* removed intermediate spark files to save on space and rely on caching mechanism instead
* edit distance stats keep top 50 results for the 20 largest clusters. 100 rows is sufficient for quality checks
* renamed `cord19_citations` to citations
* include `title_sha` and `num_cited` to titles, for convenience

## Construction

Our goal is to create a `citation_id` by looking at titles in the flattened bibliography. The `sha1` hash of the title alone is not a high quality identifier because titles often contain errors in capitalization or punctuation. We also do not have access to a consistent identifier for all of the citations like a DOI. If we do not correct for these errors, we will miss edges when looking at the number of common neighbors between papers. Proper network construction is important to measures like the [co-citation similarity](https://en.wikipedia.org/wiki/Co-citation).

Inspecting the titles in the bibliography reveals the questionable quality of using them as nodes in a graph:


```
Control of communicable disease manual
Control of communicable diseases manual
Control of Communicable Disease Manual
```

Sometimes, papers are referenced in preprint where the title changes after publication. There may also be typesetting errors in the bibliography. We'll cluster titles that are similar and assign them to a single identifier. The ideal choice of identifier would be the most frequent title in the cluster or the title in its most cited form, but we'll settle for the `max(sha1(title))` of the cluster.

We could calculate the edit distance (Levenshtein distance) between all pairs of titles to determine clusters. Sets of similar titles will get assigned a `citation_id`. However, calculating distances across all pairs will prohibitively expensive (n=1,042,000). We'll accomplish this efficiently using a class of [Locality Sensitive Hashing (LSH)](http://www.mmds.org/mmds/v2.1/ch03-lsh.pdf) techniques for approximate similarity joins. We deduplicate titles using the [pyspark.ml.feature.MinHashLSH module in Spark](https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance) and generate a lookup table that can be used against the flattened references.

Finally, we'll actually construct the CORD-19 citation network (`cord19_citations`).


In [None]:
! pip install pyspark pyhash networkx
! ls /kaggle/input/

from pyspark.sql import SparkSession, functions as F
import matplotlib.pyplot as plt

spark = (
    SparkSession.builder
    .config("spark.driver.memory", "12g")
    .getOrCreate()
)
spark.conf.set("spark.sql.shuffle.partitions", spark.sparkContext.defaultParallelism*2)
spark.conf.get("spark.driver.memory")

We're going to reuse some of the results from [Parquet and BigQuery dataset for CORD-19](https://www.kaggle.com/acmiyaguchi/parquet-and-bigquery-dataset-for-cord-19), which can be found by searching for [`CORD-19 parquet`](https://www.kaggle.com/acmiyaguchi/cord19-parquet) under `+ Add data`. A refresher of PySpark and SQL can be found in [PySpark DataFrame Preprocessing for CORD-19](https://www.kaggle.com/acmiyaguchi/pyspark-dataframe-preprocessing-for-cord-19).

In [None]:
input_path = "/kaggle/input/cord19-parquet/cord19.parquet"
df = spark.read.parquet(input_path)
df.createOrReplaceTempView("cord19")
df.printSchema()

input_path = "/kaggle/input/cord19-parquet/metadata.parquet"
df = spark.read.parquet(input_path)
df.createOrReplaceTempView("metadata")
df.printSchema()
df.show(vertical=True, n=1)

# Exploring the Flattened Bibliography

First thing we need to do is to create a flattened table of all the citations in the bibliography. We register this as our `references` table.

In [None]:
query = """
select
    paper_id,
    metadata.title as paper,
    value.title as citation
from
    cord19
lateral view
    explode(bib_entries) as key, value
"""
references = spark.sql(query)

print("writing references to parquet")
# A single file reduces parallelism, but is more efficient for output
%time references.toPandas().to_parquet("references.parquet")
references = spark.read.parquet("references.parquet")
references.createOrReplaceTempView("references")
references.show(n=5)
print(f"references has {references.count()} rows")

We can use these references to find the most frequently cited papers. We'll also visualize this as the network's degree distribution. A node's degree is defined as the number of edges containing that node.

In [None]:
query = """
select
    citation,
    count(distinct paper) as num_citations
from
    references
group by
    citation
order by
    num_citations desc
"""
spark.sql(query).show(n=10, truncate=60)

spark.sql(query).groupBy("num_citations").count().toPandas().plot.scatter("num_citations", "count")
plt.xscale("log")
plt.yscale("log")
plt.title("degree distribution of citation network w/o deduplication")
plt.show()

The degree distribution follows a power distribution, since it is linear on a log-log scale. Now lets create a list of all of the titles and look at the distribution of title lengths. There are about $10^6$ distinct titles.

In [None]:
query = """
with titles as (
    select paper as title from references
    union
    select citation as title from references
)
select
    title,
    sha1(title) as title_sha,
    count(*) as num_cited
from
    titles
where
    length(title) > 0
group by
    title,
    title_sha
"""
titles = spark.sql(query)

print("writing to titles to parquet")
%time titles.toPandas().to_parquet("titles.parquet")

titles = spark.read.parquet("titles.parquet")
print(f"there are {titles.count()} distinct titles")
titles.show(n=5, truncate=60)

(
    titles
    .withColumn("length", F.expr("log2(length(title))"))
    .groupBy("length")
    .count()
    .select("length")
).toPandas().hist("length", density=1)
plt.title("histogram of log_2(length)")
plt.show()

# Deduping Citations with MinHash

We use MinHash to cluster approximately-duplicate titles of citations. These clusters are assigned a `citation_id`, which become the nodes of the citation network.

Refer to [Chapter 3 of Mining Massive Datasets](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) for an overview of the techniques used here.


## A note on hashing

[MurmurHash3](https://en.wikipedia.org/wiki/MurmurHash) is a fast, non-cryptographic hash. We want to make sure the hashing algorithm is fast since we are digesting quite a bit of text. Here, we use a 4 byte hash to help reduce the size of our characteristic matrix. Since there are a lot of empty entries, we'll use a sparse representation which has natural applications in map-reduce settings.

In [None]:
import pyhash
import sys

hasher = pyhash.murmur3_32()
value = "Control of Communicable Diseases Manual"
hashed = hasher(value)
print(f"hashed '{value}' into {sys.getsizeof(hashed)} bits: {hashed}")

## Hashing k-shingles

A k-shingle is a window of k characters in a piece of text. A set of k-shingles is formed by sliding a window over the length of the entire set. The idea is that a near duplicate title will share a large proportion of these k-shingles. The exact measurement we will be using is the [Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index), which is the normalized overlap of two sets.

$$
d_{J}(A, B) = 1 - \frac{|A \cap B|}{|A \cup B|}
$$

Looking at the distribution of title lengths above, we see that the mode appears around titles lengths of $2^9$, or 512 characters. We use a shingle size of 9, which should give us good enough coverage to reduce the number of false positives. We do not have access to labeled data, so we will make a judgement call on what "good enough" is.


The result is hashed into a fixed number of buckets where the number of buckets is determined by our hashing function. There are limits to the size of a `SparseVector` in the linalg library unfortunately, so we reduce the hash function to a smaller set by taking the modulo of the output. A note from Spark's `FeatureHasher` (which is not to be confused with this application of hashing) notes that taking a modulo of MurmurHash3 will only be evenly distributed among buckets if the number of buckets is a power of 2. Here, we've chosen $2^{24}$ to create a 3 byte hash function. The extra byte in the 32-bit integer will remain unused. 

In [None]:
from pyspark.ml.linalg import Vectors, VectorUDT
import pyhash


@F.udf(VectorUDT())
def hashed_shingles(text, k=9):
    """"Generates a set of hashed k-shingles as a sparse matrix.
    
    Text is lower cased before it is shingled. Punctuation is is
    assumed to be significant.

    The max input dimension is log2(2147483647), or 30.99. This
    determines the number of buckets. To calculate the empirical
    limits of a SparseVector, set `num_buckets` to 2**32-1. Then 
    refer to the resulting exception message.
    """
    num_buckets = 2**24
    
    # A standard library alternative is to use 
    # a check summing routine. This will be slower.
    # import zlib; hasher = zlib.adler32
    hasher = pyhash.murmur3_32()
    
    shingles = (text[i:i+k].lower() for i in range(len(text)-k+1))
    hashed_shingles = {(hasher(s) % num_buckets, 1) for s in shingles}
    
    return Vectors.sparse(num_buckets, hashed_shingles)

## MinHashing citation titles

Running the Minhashing algorithm is simple, and can be incorporated into a model pipeline. We have to ensure that documents are at least the size of the shingle, otherwise the resulting vector will be empty.

In [None]:
from pyspark.ml.feature import MinHashLSH

num_shingles=9
num_hash_tables=5
minhasher = MinHashLSH(
    inputCol="features", 
    outputCol="hashes", 
    numHashTables=num_hash_tables
)

prepared_titles = (
    titles
    .withColumn("length", F.expr("length(title)"))
    .where(f"length(title) >= {num_shingles}")
    .withColumn("features", hashed_shingles("title"))
)

model = minhasher.fit(prepared_titles)
hashed_titles = model.transform(prepared_titles)

## Approximate nearest-neighbor search of titles

Here, we perform a self-join on the hashed titles with a threshold $t=0.8$. We do this to analyze the effect of the threshold on the quality of the clusters. In practice, we should set this to a reasonable value, say $t=0.3$. Increasing the threshold will increase the runtime of this approximate join.

In [None]:
threshold = 0.8
hashed_titles.cache()

ann_titles = (
    model.approxSimilarityJoin(
        hashed_titles, 
        hashed_titles, 
        threshold,
        distCol="jaccard_distance"
    )
    .where("datasetA.title_sha <> datasetB.title_sha")
    .orderBy("jaccard_distance")
)

# Deduping Quality

Now we take a look at clustering titles using the results of the neighborhood search (`ann_titles`).


In [None]:
edgelist = (
    ann_titles
    .select(
        F.col("datasetA.title_sha").alias("src"), 
        F.col("datasetB.title_sha").alias("dst"), 
        "jaccard_distance"
    )
)
edgelist.cache()

%time print(f"there are {edgelist.count()} edges")

In [None]:
import networkx as nx
import pandas as pd
from tqdm import tqdm


cc = []
for i in tqdm(range(0, 85, 10)):
    graph = nx.from_pandas_edgelist(
        edgelist.where(f"jaccard_distance <= {i/100}").toPandas(),
        source="src",
        target="dst",
        edge_attr="jaccard_distance",
    )
    cc.append(
        dict(
            threshold=i/100,
            nodes=len(graph.nodes),
            edges=len(graph.edges),
            connected_components=nx.number_connected_components(graph),
            max_connected_component=max(map(len, nx.connected_components(graph))),
            average_clustering=nx.average_clustering(graph),
        )
    )
df = pd.DataFrame(cc)
df

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
df.plot("threshold", "nodes", ax=axes[0][0])
df.plot("threshold", "edges", ax=axes[0][1])
df.plot("threshold", "connected_components", ax=axes[1][0])
df.plot("threshold", "max_connected_component", ax=axes[1][1])
fig.suptitle("effect of threshold on network construction")
plt.show()

The 20-minute runtime is well worth the results. The first observation is the sharp increase in the size of the max connected component that occurs at $t=0.7$. The larger the size of the maximum connected component, the more likely there are going to be corrections to titles with similar shingles. The number of edges also increases precipitously with the size of the maximum connected component. They are related because each connected component is a clique, and therefore contains $\frac{k(k+1)}{2}$ edges where $k$ is the number of nodes in the clique. We see as the threshold passes $0.75$, the number of components starts to decrease. This is the point at which the graph starts to snowball and become an indistinguishable mass. With a threshold of $1.0$, we would expect to see every possible pair of titles, which would consume $O(n^2)$ space.

## Effect of Jaccard distance threshold on title clustering

Now we'll take a look at the quality of the approximate title matching by manually inspecting the results of the largest cliques and a few random ones. We repeat by varying the threshold from 0.8 down to 0.1

### Generating the citation index and diagnostics

In [None]:
from pyspark.sql import Window
from pyspark.sql import types as T


def index_with_dedupe_threshold(
    threshold, edgelist, title_sha, output_path, n=10
):
    output_prefix = f"{output_path}/t{round(threshold*100):02d}"
    graph = nx.from_pandas_edgelist(
        edgelist.where(f"jaccard_distance <= {threshold}").toPandas(),
        source="src",
        target="dst",
        edge_attr="jaccard_distance",
    )
    
    # index of approximate citation to the canonical citation
    title_index = spark.sparkContext.parallelize(
        [(max(nodes), list(nodes)) for nodes in nx.connected_components(graph)]
    ).toDF(
        schema=T.StructType(
            [
                T.StructField("citation_id", T.StringType()),
                T.StructField("approx_citation_ids", T.ArrayType(T.StringType())),
            ]
        )
    )
    citation_index = (
        title_index.withColumn("near_duplicates", F.size("approx_citation_ids"))
        .withColumn("rank", F.row_number().over(Window.orderBy(F.desc("near_duplicates"))))
    )
    citation_index.cache()
    output = f"{output_prefix}_citation_index.parquet"
    citation_index.toPandas().to_parquet(output)

    # map citation_ids to titles
    indexed_titles = (
        citation_index
        .withColumn("approx_citation_id", F.explode("approx_citation_ids"))
        .drop("approx_citation_ids")
        .join(
            title_sha.selectExpr("title as approx_citation", "title_sha as approx_citation_id"),
            on="approx_citation_id",
        )
        .join(
            title_sha.selectExpr(
                "title as citation", "title_sha as citation_id"
            ),
            on="citation_id",
        )
        .withColumn(
            "edit_distance", F.levenshtein("citation", "approx_citation")
        )
        .orderBy("rank", F.desc("edit_distance"))
    )
    indexed_titles.cache()
    
    # write out the titles and their edit distances
    output = f"{output_prefix}_approx_citation.csv"
    (
        indexed_titles
        .withColumn("secondary_rank", F.row_number().over(
                Window.partitionBy("rank").orderBy(F.desc("near_duplicates"))
            )
        )
        # only keep the top 200 results for each cluster
        .where("secondary_rank < 200")
        .where("rank <= 50")
        .select("rank", "edit_distance", "approx_citation")
        .orderBy("rank", F.desc("edit_distance"))
        .toPandas()
        .to_csv(output)
    )
    
    # write out all of the titles by rank
    output = f"{output_prefix}_ranked_titles.csv"
    (
        indexed_titles
        .groupBy("rank")
        .agg(F.count("*").alias("near_duplications"), F.max("citation").alias("citation"))
        .orderBy("rank")
        .toPandas()
        .to_csv(output)
    )
    
    # write out stats for each cluster
    output = f"{output_prefix}_edit_distance_stats.csv"
    stats = (
        indexed_titles
        .withColumn("x", F.col("edit_distance"))
        .groupBy("rank")
        .agg(
            F.count("x").alias("count"),
            F.mean("x").alias("mean"),
            F.stddev("x").alias("stddev"),
            F.min("x").alias("min"),
            F.expr("percentile(x, array(0.5))[0] as p50"),
            F.expr("percentile(x, array(0.9))[0] as p90"),
            F.max("x").alias("max"),
        )
        .orderBy("rank")
    )
    rounded = stats.select([F.round(c, 2).alias(c) for c in stats.columns])
    rounded.toPandas().to_csv(output)
    
    citation_index.unpersist()
    indexed_titles.unpersist()

In [None]:
# ensure output folder exists
! mkdir -p corrections

for i in tqdm(range(0, 85, 10)):
    index_with_dedupe_threshold(i/100, edgelist, titles, "corrections", 10)

### Statistics about average edit distance within clusters.

In [None]:
df = (
    spark.read.csv("corrections/*_edit_distance_stats.csv", header=True)
    .withColumn("threshold",
        F.regexp_extract(F.input_file_name(), ".*t(\d+)_edit_distance_stats.csv", 1)
        .astype("float") / 100
    )
)
df.show(n=5,truncate=False)
output = "edit_distance_stats.csv"
print(f"writing out {output}")
df.drop("_c0").toPandas().to_csv(output)

(
    df.groupBy("threshold")
    .agg(F.expr("mean(cast(p50 as float)) as mean_p50"))
    .orderBy("threshold")
).toPandas().plot("threshold", "mean_p50")
plt.show()

In [None]:
for threshold in [0.2, 0.5, 0.8]:
    fig, axes = plt.subplots(1, 2, figsize=(10, 3))
    fig.suptitle(f"threshold={threshold}")
    (
        df
        .where(f"threshold={threshold}")
        .selectExpr("log2(cast(p50 as float)) as median_edit_distance")
        .toPandas().hist("median_edit_distance", bins=10, ax=axes[0], density=1.0)
    )
    (
        df
        .where(f"threshold={threshold}")
        .selectExpr("cast(count as float) as degree")
        .groupBy("degree")
        .count()
        .toPandas().plot.scatter("degree", "count", ax=axes[1], title="degree distribution")
    )
    plt.xscale("log")
    plt.yscale("log")
    plt.show()

We observe the average of the median edit distances increases as the tolerance increased. We also note that the degree distribution for tolerances follow a power distribution. However, we can also make out obvious outliers. It might be useful to plot a goodness of fit for a power distribution. The line fit to a plot of $t=0.8$ on a log-log scale should have a signifint shift in distribution due to the outlier (p=0.05).

The distribution of of the median of edit distances shifts to the right as the threshold increases. A divergence measure may be good for determining when it changes.

## Notes on title clustering

Reading through the results of `corrections/t{threshold}_approx_citations.csv` gives a sense of how well the title clustering is performing. Ideally, titles assigned to the same cluster should be referencing the same paper, but with minor typos.

At $t=0.8$, the largest component has 115k nodes, while the second largest has 242. The second title is strange, which looks to be a somewhat random sequence with variations of the following characters.

```
GlcNAcb1-4Galb1-4GlcNAcb-Sp8
Galb1-4GlcNAcb1-6(Fuca1-2Galb1-3GlcNAcb1-3)Galb1-4Glc-Sp21
GalNAca1-3(Fuca1-2)Galb1-3GalNAcb1-3Gala1-4Galb1-4Glc-Sp21
Mana1-3)Mana1-6(Mana1-3)Manb1-4GlcNAcb1-4GlcNAcb-Sp12
GalNAcb1-4(Neu5Aca2-8Neu5Aca2-8Neu5Aca2-3)Galb1-4Glcb-Sp0
Fuca1-2Galb1-4GlcNAcb1-2Mana-Sp0
```

Looking at the 3rd and 4th ranked clusters, the threshold has certainly not been set properly for this to be actionable.


At $t=0.7$, the quality of the largest component at 16.9k nodes is still poor. Looking at the second component with 932 nodes:

```
Rapid and sensitive detection of Taura syndrome virus by reverse ...
North American and European porcine reproductive and respiratory synd...
Porcine reproductive and respiratory syndrome virus antagonizes JAK/ST...
Porcine reproductive-respiratory syndrome virus infection predisposes pigs...
```

At $t=0.6$, the difference between the size of rank 1 and rank 2 component sizes becomes much smaller. However, the tolerance is too high, causing mismatches between titles. From the largest component:

```
Staphylococcus aureus
Meticillin-resistant Staphylococcus aureus (MRSA): screening and decolonisation
Methicillin resistant Staphylococcus aureus colonization in pigs and pig farmers
```


$t=0.3$ is the first where the titles look consistent within their components. A shortcoming of this technique becomes apparent scrolling through the result. Changes in the year or the volume in the references become indistinguishable.

```
"The European Union Summary Report on Trends and Sources of Zoonoses, Zoonotic Agents and Food-borne Outbreaks in 2010"
"The European Union Summary Report on Trends and Sources of Zoonoses, Zoonotic Agents and Food-borne Outbreaks in 2013"
"The European Union Summary Report on Trends and Sources of Zoonoses, Zoonotic Agents and Food-borne Outbreaks in 2012"
"The European Union summary report on trends and sources of zoonoses, zoonotic agents and food-borne outbreaks in (2014)"
"The European Union summary report on trends and sources of zoonoses, zoonotic agents and food-borne outbreaks in (2013)"
```

# Constructing the citation network

Now we build the actual citation network.

In [None]:
! cp corrections/t30_citation_index.parquet citation_index.parquet
citation_index = spark.read.parquet("citation_index.parquet")

print(f"references:\t{references}")
print(f"citation_index:\t{citation_index}")

Lets double check that the `paper_id` is really the `sha1` of the paper title.

In [None]:
query = """
with papers as (
    select distinct
        paper_id,
        paper
    from
        references
)
select 
    paper_id = sha1(paper) sha_matches_paper_id,
    count(*) as num_seen
from
    papers
group by
    sha_matches_paper_id
"""
spark.sql(query).show(truncate=False)

query = """
select distinct
    paper_id,
    sha1(paper)
from
    references
limit 5
"""
spark.sql(query).show(truncate=False)


It turns out that the `sha1(paper)` doesn't match. It doesn't matter since we've generated our own `citation_id` for this purpose that is derived from a 20-byte `sha1` function.

## Indexing papers and citations with `citation_index`

We create the `cord19_citation` table.

In [None]:
index = citation_index.select("citation_id", F.explode("approx_citation_ids").alias("approx_citation_id"))
citations = (
    references
    .withColumn("paper_sha", F.sha1("paper"))
    .withColumn("citation_sha", F.sha1("citation"))
    # left join, since the citation index only contains duplicates. If the title is unique, we'll use
    # the title hash as the id for the node.
    .join(
        index.selectExpr("citation_id as src_id", "approx_citation_id as paper_sha"),
        on="paper_sha",
        how="left",
    )
    .withColumn("src_citation_id", F.coalesce("src_id", "paper_sha"))
    .join(
        index.selectExpr("citation_id as dst_id", "approx_citation_id as citation_sha"),
        on="citation_sha",
        how="left",
    )
    .withColumn("dst_citation_id", F.coalesce("dst_id", "citation_sha"))
    .drop("paper_sha", "citation_sha")
    .select("paper_id", "src_citation_id", "dst_citation_id", "paper", "citation")
    # remove self-edges and duplicate edges
    .where("src_citation_id <> dst_citation_id")
    .distinct()
    .orderBy("paper_id", "src_citation_id")
)

print("writing out citations.parquet")
%time citations.toPandas().to_parquet("citations.parquet")

citations.limit(5).toPandas()

In [None]:
# We're done with Spark, lets try to free up memory
spark.stop()

# kill java to make memory immediately available
! killall -9 java

# Also free up any dangling objects
import gc
gc.collect() 

# Network analysis at a glance

This will be brief, and will just show off how to use the dataset.

In [None]:
import pandas as pd
import networkx as nx

citations = pd.read_parquet("citations.parquet")
citations.head()

## Citation network from titles

## Citation network from deduped titles

In [None]:
pd.set_option('display.max_colwidth', 120)

G = nx.from_pandas_edgelist(citations, source="paper", target="citation")
print(nx.info(G))
%time pr = nx.pagerank(G)
(
    pd.DataFrame(pr.items(), columns=["citation_id", "pagerank"])
    .sort_values(by="pagerank", ascending=False)
    .head(20)
)

In [None]:
pd.set_option('display.max_colwidth', 120)

G = nx.from_pandas_edgelist(citations, source="src_citation_id", target="dst_citation_id")
print(nx.info(G))
%time pr = nx.pagerank(G)
pr_df = pd.DataFrame(pr.items(), columns=["citation_id", "pagerank"]).set_index("citation_id")
pr_df.to_csv("citation_pagerank.csv")
(
    citations[["dst_citation_id", "citation"]]
    .rename(columns={"dst_citation_id": "citation_id"})
    .groupby("citation_id")
    .first()
    .join(pr_df, on="citation_id", how="inner")
    .sort_values(by="pagerank", ascending=False)
    .reset_index()
    .head(20)[["citation", "pagerank"]]
)

## Discussion

The quality of the citation network when we apply deduplication is subjectively better than the naive citation network. The results of the naive citation network are not very relevant, whereas the results of the deduplicated network are useful to the task on hand: finding relevant knowledge in the literature about the ongoing COVID-19 pandemic. 

0. [In Knobil and Neill's Physiology of Reproduction](https://www.sciencedirect.com/book/9780123971753/knobil-and-neills-physiology-of-reproduction#book-description)
1. [Guideline for isolation precautions: preventing transmission of infectious agents in health care settings](https://www.cdc.gov/infectioncontrol/pdf/guidelines/isolation-guidelines-H.pdf)
2. [Biology of natural killer cells](https://www.sciencedirect.com/science/article/pii/S0065277608606641)
3. [The Molecular Biology of Coronaviruses](https://www.sciencedirect.com/science/article/pii/S0065352706660053)
4. [A novel coronavirus associated with severe acute respiratory syndrome.](https://www.ncbi.nlm.nih.gov/pubmed/12690092)

The ranking of these articles reflects the improved connectivity of the network. 

# Next Steps

There are plenty of interesting directions to go with this now. One idea for a future notebook would be to create a citation embedding using a technique such as [node2vec](http://snap.stanford.edu/node2vec/) or [Graph Convolutional Networks](https://tkipf.github.io/graph-convolutional-networks/). Having an embedding would make it easier to incorporate into a recommendation engine. Yet another idea would be to create an author network using LSH for deduplication. The shingling length would likely need to be modified, and perhaps other information like affliation would need to be taken into consideraion.

There are also quality issues with the network that have not been addressed, such as the inability to distinguish between volumes of journals or small variations in the title. There are also some titles like _"Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough ..."_ that make it into the dataset. However, reducing the citation network above into a co-citation network via an embedding would likely reduce the impact of anomalies like this.

Let me know if this notebook has been helpful to you. I hope this notebook can be of use for addressing the tasks at hand.