# Link Prediction w/ n2v

In [1]:
# !pip install arxiv

You can install the arxiv package in Python with the following command:  
`pip install arxiv`  
or follow the instructions here : https://pypi.org/project/arxiv/  

In [14]:
import networkx as nx
import scipy as sp
import pandas as pd
import numpy as np
import arxiv

from node2vec import Node2Vec as n2v

In [3]:
# constants
queries = [
    'automl', 'machinelearning', 'data', 'phyiscs','mathematics', 'recommendation system', 'nlp', 'neural networks'
]

# Fetch Data

We want to hit th Arxiv API to gather some information about the latest research papers based on the queries we've identified above. This will allow us to then create a network from this research paper data and then we can try to predict links on that network. For the purposes of this article, I will search for a maximum of 1000 results per query, but you don't have to set yourself to the same constraints. The Arxiv API allows users to hit up to 300,000 results per query. The function outlined below will generate a CSV fetching the following information :   
```'title', 'date', 'article_id', 'url', 'main_topic', 'all_topics', 'authors', 'year'```   
You are able to fetch more information like the `links, summary, article` but I decided not to since those features won't really be used for the purposes of this analysis and tutorial.

For reference to the Arxiv API, you can find their detailed documentation here : https://arxiv.org/help/api/user-manual

In [4]:
def search_arxiv(queries, max_results = 1000):
    '''
    This function will search arxiv associated to a set of queries and store
    the latest 10000 (max_results) associated to that search.
    
    params:
        queries (List -> Str) : A list of strings containing keywords you want
                                to search on Arxiv
        max_results (Int) : The maximum number of results you want to see associated
                            to your search. Default value is 1000, capped at 300000
                            
    returns:
        This function will return a DataFrame holding the following columns associated
        to the queries the user has passed. 
            `title`, `date`, `article_id`, `url`, `main_topic`, `all_topics`
    
    example:
        research_df = search_arxiv(
            queries = ['automl', 'recommender system', 'nlp', 'data science'],
            max_results = 10000
        )
    '''
    d = []
    searches = []
    # hitting the API
    for query in queries:
        search = arxiv.Search(
          query = query,
          max_results = max_results,
          sort_by = arxiv.SortCriterion.SubmittedDate,
          sort_order = arxiv.SortOrder.Descending
        )
        searches.append(search)
    
    # Converting search result into df
    for search in searches:
        for res in search.results():
            data = {
                'title' : res.title,
                'date' : res.published,
                'article_id' : res.entry_id,
                'url' : res.pdf_url,
                'main_topic' : res.primary_category,
                'all_topics' : res.categories,
                'authors' : res.authors
            }
            d.append(data)
        
    d = pd.DataFrame(d)
    d['year'] = pd.DatetimeIndex(d['date']).year
    
    # change article id from url to integer
    unique_article_ids = d.article_id.unique()
    article_mapping = {art:idx for idx,art in enumerate(unique_article_ids)}
    d['article_id'] = d['article_id'].map(article_mapping)
    return d

In [5]:
%%time
research_df = search_arxiv(
    queries = queries,
    max_results = 100
)
research_df.shape

CPU times: user 1.04 s, sys: 63 ms, total: 1.11 s
Wall time: 11.2 s


(646, 8)

If you're having trouble querying the data, for reproducibility purposes, the CSV I used for the analysis conducted in this article was uploaded to my GitHub which you can find here. https://github.com/vatsal220/medium_articles/blob/main/link_prediction/data/arxiv_data.csv

## Generate Network

Now that we've fetched the data using the Arxiv API, we can generate a network. The network will have the following structure, nodes will be the article_ids and the edges will be all topics connecting a pair of articles. For example, article_id 1 with the following topics `astro-physics, and stats` can be connected to article_id 10 with the topic `stats` and article_id 7 with the topics `astro-physics, math`. This will be a multi-edge network where each edge will hold a weight of 1.

In [6]:
r_df = research_df.explode('all_topics').copy()

In [7]:
def generate_network(df, node_col = 'article_id', edge_col = 'main_topic'):
    '''
    This function will generate a article to article network given an input DataFrame.
    It will do so by creating an edge_dictionary where each key is going to be a node
    referenced by unique values in node_col and the values will be a list of other nodes
    connected to the key through the edge_col.
    
    params:
        df (DataFrame) : The dataset which holds the node and edge columns
        node_col (String) : The column name associated to the nodes of the network
        edge_col (String) : The column name associated to the edges of the network
        
    returns:
        A networkx graph corresponding to the input dataset
        
    example:
        generate_network(
            research_df,
            node_col = 'article_id',
            edge_col = 'main_topic'
        )
    '''
    edge_dct = {}
    for i,g in df.groupby(node_col):
        topics = df[edge_col].unique()
        edge_df = df[(df[node_col] != i) & (df[edge_col].isin(topics))]
        edges = list(edge_df[node_col].unique())
        edge_dct[i] = edges
    
    # create nx network
    g = nx.Graph(edge_dct)
    return g

In [8]:
%time research_network = generate_network(research_df, node_col = 'article_id', edge_col = 'main_topic')

CPU times: user 724 ms, sys: 45.2 ms, total: 769 ms
Wall time: 789 ms


In [9]:
print(nx.info(research_network))

Name: 
Type: Graph
Number of nodes: 562
Number of edges: 157641
Average degree: 561.0000


In [10]:
%time all_topic_nx = generate_network(r_df, node_col = 'article_id', edge_col = 'all_topics')

CPU times: user 729 ms, sys: 19.1 ms, total: 748 ms
Wall time: 748 ms


In [11]:
print(nx.info(all_topic_nx))

Name: 
Type: Graph
Number of nodes: 562
Number of edges: 157641
Average degree: 561.0000


## Node2Vec

In [15]:
%time g_emb = n2v(research_network, dimensions=16)

Computing transition probabilities:   0%|          | 0/562 [00:00<?, ?it/s]

Generating walks (CPU: 1): 100%|██████████| 10/10 [00:25<00:00,  2.55s/it]

CPU times: user 5min 57s, sys: 4.73 s, total: 6min 2s
Wall time: 6min 3s





In [16]:
WINDOW = 1 # Node2Vec fit window
MIN_COUNT = 1 # Node2Vec min. count
BATCH_WORDS = 4 # Node2Vec batch words

In [17]:
mdl = g_emb.fit(
    window=WINDOW,
    min_count=MIN_COUNT,
    batch_words=BATCH_WORDS
)

In [18]:
input_node = '1'
for s in mdl.wv.most_similar(input_node, topn = 10):
    print(s)

('211', 0.9128776788711548)
('521', 0.911181628704071)
('40', 0.9052026271820068)
('124', 0.9023550152778625)
('287', 0.9022865891456604)
('132', 0.8979486227035522)
('138', 0.8978955745697021)
('467', 0.8966530561447144)
('170', 0.8965973854064941)
('531', 0.8943594098091125)


In [22]:
r_df[r_df.article_id == 1]

Unnamed: 0,title,date,article_id,url,main_topic,all_topics,authors,year
1,Hubble Asteroid Hunter: I. Identifying asteroi...,2022-02-01 06:56:20+00:00,1,http://arxiv.org/pdf/2202.00246v1,astro-ph.EP,astro-ph.EP,"[Sandor Kruk, Pablo García Martín, Marcel Pope...",2022
1,Hubble Asteroid Hunter: I. Identifying asteroi...,2022-02-01 06:56:20+00:00,1,http://arxiv.org/pdf/2202.00246v1,astro-ph.EP,astro-ph.IM,"[Sandor Kruk, Pablo García Martín, Marcel Pope...",2022


In [24]:
r_df[r_df.article_id == 521]

Unnamed: 0,title,date,article_id,url,main_topic,all_topics,authors,year
577,Topological Defects Induced High-Spin Quartet ...,2022-02-08 13:34:05+00:00,521,http://arxiv.org/pdf/2202.03853v1,cond-mat.mes-hall,cond-mat.mes-hall,"[Can Li, Yu Liu, Yufeng Liu, Fu-Hua Xue, Danda...",2022


## Generate Embeddings DataFrame

In [20]:
emb_df = (
    pd.DataFrame(
        [mdl.wv.get_vector(str(n)) for n in research_network.nodes()],
        index = research_network.nodes
    )
)

In [21]:
emb_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0.128435,-0.049074,0.132459,-0.102469,0.844356,0.267155,0.225001,-0.419419,-0.135403,-0.002126,-0.086459,-0.163671,-0.00672,0.043489,0.237503,0.01943
1,0.096374,0.018799,0.035309,-0.023356,0.524462,0.257324,0.579559,-0.595028,-0.237835,-0.113283,0.068707,-0.420708,0.095696,-0.093512,0.159015,0.001616
2,-0.039072,-0.151964,0.139572,-0.417229,0.720359,0.410377,0.155613,-0.352993,0.041561,-0.395659,0.130601,-0.000469,-0.2237,-0.164071,0.044049,0.179982
3,0.163633,0.070856,-0.186758,-0.188014,0.689734,0.305346,0.171601,-0.403026,-0.155164,-0.409362,0.080607,-0.192482,0.092849,0.082888,0.109581,0.110443
4,0.094581,-0.13182,0.26545,-0.071456,0.753454,0.235651,0.430763,-0.240429,-0.041164,-0.16051,-0.171398,-0.306116,0.137343,-0.272153,0.346461,0.089041


## Train Model

## Model Performance

## Generate Predictions

---