# PubGraph: What should I read next?

This notebook shows how KGTK supports different tasks built on top of PubGrpah.

We highlight how KGTK facilitates:

1. Training knowledge graph embeddings to make recommendations
2. Identifying communities to find similar papers
3. Running PageRank to find influential papers

In [1]:
import faiss
import numpy as np
import pandas as pd
from kgtk.functions import kgtk

In [2]:
dataset_path = "./datasets/sample_10k.tsv"
tmp_dir = "./tmp"

## Dataset Statistics

Just a count of different entity types that we have:

In [3]:
%%time
kgtk(f"query -i {dataset_path} --match '(instance)-[:P31]->(class)' --return 'class as Class, count(class) as count'")

CPU times: user 17.4 ms, sys: 11.1 ms, total: 28.5 ms
Wall time: 5.27 s


Unnamed: 0,Class,count
0,Q17350442,2834
1,Q386724,10000
2,Q5,25694


## Training Knowledge Graph Embeddings

As the first step we need to filter the links that we want from all the available metadata.
For this turtorial we are going to use the following properties:
-   P50 (Author)
-   P1433 (Venue)
-   P2860 (Cites)
-   P577 (Publication Date)

In [4]:
%%time
kgtk(f"filter -i {dataset_path} --word-separator '|'  -p '; P50|P1433|P2860 ;' -o {tmp_dir}/filtered_graph.tsv")

CPU times: user 10.8 ms, sys: 5.36 ms, total: 16.2 ms
Wall time: 4.61 s


We can now calculate the graph embeddings. Here, we use a TransE method with 30 dimension and save the learned representations.

In [5]:
%%time
kgtk(f"""graph-embeddings
    -i {tmp_dir}/final_graph.tsv
    --log {tmp_dir}/kgtk_graph_embedding.log
    --temporary_directory {tmp_dir}
    --output_format kgtk
    --retain_temporary_data True
    --comparator squared_l2
    --operator TransE
    --num_epochs 100
    --dimension 100
    --dynamic_relations False
    -o {tmp_dir}/embeddings.tsv
    """)

In Processing, Please go to tmp/kgtk_graph_embedding.log to check details
Processed Finished.

CPU times: user 171 ms, sys: 77.5 ms, total: 248 ms
Wall time: 4min 50s


To make recommendations we need to load the embeddings and prepare it to be loaded into a Faiss index.

In [6]:
embedding_df = pd.read_csv(f"{tmp_dir}/embeddings.tsv", sep="\t")
embedding_df = embedding_df[embedding_df["node1"].str.startswith("W")]
embedding_df["node2"] = embedding_df["node2"].apply(lambda x: list(map(float, x.split(","))))

In [7]:
embedding_df.head(n=5)

Unnamed: 0,node1,label,node2
0,W1976848519,graph_embeddings,"[-0.292098314, -0.513507247, -0.574698508, 0.4..."
9,W2123886363,graph_embeddings,"[-0.294932693, -0.777636349, -0.787626445, 0.5..."
11,W2039574702,graph_embeddings,"[-0.427890688, -0.709110439, -0.641926527, 0.6..."
12,W2044733003,graph_embeddings,"[-0.227729842, -0.558198154, -0.610058725, 0.5..."
15,W2048297434,graph_embeddings,"[-0.312093318, -0.658000588, -0.668402135, 0.6..."


Given the embeddings, we create a Faiss index based on euclidean distance and load our learned embeddings into it.

In [8]:
index = faiss.IndexFlatL2(100)
embeddings = np.array(embedding_df["node2"].to_list())
index.add(embeddings)

In [9]:
%%time
titles = kgtk(f"filter -i {dataset_path} -p '; P1476 ;'")

CPU times: user 113 ms, sys: 41.7 ms, total: 155 ms
Wall time: 11.9 s


In [10]:
titles.head(n=5)

Unnamed: 0,id,node1,label,node2
0,W2138810412-V00_118161,W2138810412,P1476,Content Based Rate Estimation Using Lazy Membe...
1,W1546627792-V00_118162,W1546627792,P1476,Probability Approximations via the Poisson Clu...
2,W2594445009-V00_118163,W2594445009,P1476,A Study on Massive Open Online Courses in High...
3,W2092634194-V00_118164,W2092634194,P1476,Massive Open Online Courses on Health and Medi...
4,W4211094263-V00_118165,W4211094263,P1476,Preconditioning@en


In [11]:
papers_idx = embedding_df["node1"].to_list()
titles_idx = dict(titles[["node1", "node2"]].values.tolist())

Now, let's find similar papers to "VECTOR VALUED REPRODUCING KERNEL HILBERT SPACES OF INTEGRABLE FUNCTIONS AND MERCER THEOREM."

First we need to find the embedding of this paper.

In [12]:
candidate = "W2165867509"
candidate_title = titles_idx[candidate]
candidate_embedding = np.array(embedding_df[embedding_df['node1'] == candidate]["node2"].to_numpy()[0]).reshape(1, -1)

Then, we can retrieve the closest papers:

In [13]:
distances, neighbors = index.search(candidate_embedding, 6)

As a sanity check, the closest paper should be itself:

In [14]:
assert candidate_title == titles_idx[papers_idx[neighbors[0][0]]]
assert distances[0][0] == 0.0

In [15]:
for distance, neighbor in zip(distances[0], neighbors[0]):
    print(f"{papers_idx[neighbor]} --> {titles_idx[papers_idx[neighbor]]} ({distance})")

W2165867509 --> VECTOR VALUED REPRODUCING KERNEL HILBERT SPACES OF INTEGRABLE FUNCTIONS AND MERCER THEOREM@en (0.0)
W3176077280 --> Bounding convex relaxations of process models from below by tractable black-box sampling@en (0.18699485063552856)
W2963744345 --> Analysis of a Splitting Approach for the Parallel Solution of Linear Systems on GPU Cards@en (0.21961277723312378)
W2964170750 --> Reproducing kernel Hilbert spaces and variable metric algorithms in PDE-constrained shape optimization@en (0.22445255517959595)
W2048625569 --> Cryptanalysis of a remote login authentication scheme@en (0.22557495534420013)
W2562979639 --> THEORIES, STYLES AND STRATEGIES OF LEARNING IN E-LEARNING@en (0.23612035810947418)


## Identifying Communities

First we need to extract the citation graph, we also want to filter out the papers without a publication date

In [16]:
%%time
kgtk(f"""query -i {dataset_path}
    --match '(node1)-[:P2860]->(node2), (node1)-[:P577]->(node3), (node2)-[:P577]->(node4)'
    --return 'node1, "P2860" as label, node2'
    -o {tmp_dir}/citation_graph_10k.tsv
    """)

CPU times: user 7.67 ms, sys: 14.4 ms, total: 22 ms
Wall time: 6.44 s


Now, let's run a community detection model

In [17]:
%%time
kgtk(f"""community-detection -i {tmp_dir}/citation_graph_10k.tsv
    --method blockmodel
    -o {tmp_dir}/communities_10k.tsv
    """)

CPU times: user 15.2 ms, sys: 25.6 ms, total: 40.8 ms
Wall time: 25.4 s


Let's check the number of communities!

In [18]:
%%time
kgtk(f"query -i {tmp_dir}/communities_10k.tsv --match '()-[]->(class)' --return 'count(distinct class) as count'")

CPU times: user 13.1 ms, sys: 19.9 ms, total: 33 ms
Wall time: 5.78 s


Unnamed: 0,count
0,45


### Visualization

First we need to create a node file containing the community of each node.
A cool feature here is that we can chain commands together without needing to save into intermediary files!

In [25]:
%%time
kgtk(f"""query -i {dataset_path} --as sample
    -i {tmp_dir}/communities_10k.tsv --as communities
    --match 'communities: (work)-[]->(comm), sample: (work)-[:P1476]->(title)'
    --return 'work as id, kgtk_unstringify(title) as label, comm as community_id'

    /

    calc
    --do substring 
    --columns  community_id 
    --into    'community_id' 
    --values 8 
    -o {tmp_dir}/viz_communities_10k.tsv
    """)

CPU times: user 12.7 ms, sys: 10.5 ms, total: 23.2 ms
Wall time: 9.94 s


Now we can generate the Visualization.

In [26]:
%%time
%%capture
kgtk(f"""visualize-graph -i  {tmp_dir}/citation_graph_10k.tsv 
    --node-file {tmp_dir}/viz_communities_10k.tsv
    --node-color-column community_id
    --node-color-style d3.interpolateRainbow
    --edge-color-default "#010101"
    --edge-width-default 0.10
    --tooltip-column label
    -o {tmp_dir}/viz.html
    """)

CPU times: user 47.8 ms, sys: 17.3 ms, total: 65.2 ms
Wall time: 5.23 s


In [27]:
from IPython.display import IFrame

IFrame(src=f"{tmp_dir}/viz.html", width=1920, height=1080)

## PageRank

To calculate pagerank we just filter the citations and then run the algorithm:

In [28]:
%%time
kgtk(f"""
    query -i {dataset_path}
    --match '(node1)-[:P2860]->(node2)'
    --return 'node1, "P2860" as label, node2'
    
    /
    
    add-id --id-style wikidata
    
    /

    graph-statistics --compute-pagerank True 
    --compute-hits False 
    --page-rank-property node_pagerank 
    --vertex-in-degree-property node_in_degree
    --vertex-out-degree-property node_out_degree
    --output-degrees True 
    --output-pagerank True 
    --output-hits False \
    --output-statistics-only 
    --undirected False
    --log-file {tmp_dir}/metadata.pagerank.undirected.summary_10k.txt
    -o {tmp_dir}/metadata.pagerank.undirected_10k.tsv
    """)

CPU times: user 18.8 ms, sys: 20.2 ms, total: 39 ms
Wall time: 27.6 s


In [31]:
%%time
kgtk(f"""query -i {dataset_path} --as sample
    -i {tmp_dir}/metadata.pagerank.undirected_10k.tsv --as pageranks
    --match 'sample: (node1)-[:P1476]->(node2), pageranks: (node1)-[:node_pagerank]->(node3)'
    --return 'node1 as work, node2 as title, node3 as pagerank'
    --order-by 'cast(pagerank, float) desc'
    -o {tmp_dir}/papers_pagerank.tsv
    """)

CPU times: user 8.41 ms, sys: 21.2 ms, total: 29.6 ms
Wall time: 6.44 s


In [32]:
kgtk(f"head -i {tmp_dir}/papers_pagerank.tsv")

Unnamed: 0,work,title,pagerank
0,W1991760834,Algorithm for Diagonalization of Large Matrice...,0.005319
1,W1988027329,Ground state of the fermion one-component plas...,0.004861
2,W2026907619,Ground State of the Electron Gas by a Stochast...,0.004293
3,W2065916059,Modification of Nesbet's algorithm for the ite...,0.004054
4,W2056760934,Equation of State Calculations by Fast Computi...,0.003971
5,W2033943395,User Acceptance of Computer Technology: A Comp...,0.00376
6,W2247216366,The Evaluation of the Collision Matrix@en,0.003315
7,W1983523064,The description of collective motions in terms...,0.003145
8,W1791587663,"Perceived Usefulness, Perceived Ease of Use, a...",0.003094
9,W2009596258,The iterative calculation of a few of the lowe...,0.002858
