# Link Prediction w/ n2v

In [1]:
# !pip install arxiv

You can install the arxiv package in Python with the following command:  
`pip install arxiv`  
or follow the instructions here : https://pypi.org/project/arxiv/  

In [2]:
import networkx as nx
import scipy as sp
import pandas as pd
import numpy as np
import arxiv

from node2vec import node2vec as n2v

In [3]:
# constants
queries = [
    'automl', 'machinelearning', 'data', 'phyiscs','mathematics', 'recommendation system', 'nlp', 'neural networks'
]

# Fetch Data

We want to hit th Arxiv API to gather some information about the latest research papers based on the queries we've identified above. This will allow us to then create a network from this research paper data and then we can try to predict links on that network. For the purposes of this article, I will search for a maximum of 1000 results per query, but you don't have to set yourself to the same constraints. The Arxiv API allows users to hit up to 300,000 results per query. The function outlined below will generate a CSV fetching the following information :   
```'title', 'date', 'article_id', 'url', 'main_topic', 'all_topics', 'authors', 'year'```   
You are able to fetch more information like the `links, summary, article` but I decided not to since those features won't really be used for the purposes of this analysis and tutorial.

For reference to the Arxiv API, you can find their detailed documentation here : https://arxiv.org/help/api/user-manual

In [4]:
def search_arxiv(queries, max_results = 1000):
    '''
    This function will search arxiv associated to a set of queries and store
    the latest 10000 (max_results) associated to that search.
    
    params:
        queries (List -> Str) : A list of strings containing keywords you want
                                to search on Arxiv
        max_results (Int) : The maximum number of results you want to see associated
                            to your search. Default value is 1000, capped at 300000
                            
    returns:
        This function will return a DataFrame holding the following columns associated
        to the queries the user has passed. 
            `title`, `date`, `article_id`, `url`, `main_topic`, `all_topics`
    
    example:
        research_df = search_arxiv(
            queries = ['automl', 'recommender system', 'nlp', 'data science'],
            max_results = 10000
        )
    '''
    d = []
    searches = []
    # hitting the API
    for query in queries:
        search = arxiv.Search(
          query = query,
          max_results = max_results,
          sort_by = arxiv.SortCriterion.SubmittedDate,
          sort_order = arxiv.SortOrder.Descending
        )
        searches.append(search)
    
    # Converting search result into df
    for search in searches:
        for res in search.results():
            data = {
                'title' : res.title,
                'date' : res.published,
                'article_id' : res.entry_id,
                'url' : res.pdf_url,
                'main_topic' : res.primary_category,
                'all_topics' : res.categories,
                'authors' : res.authors
            }
            d.append(data)
        
    d = pd.DataFrame(d)
    d['year'] = pd.DatetimeIndex(d['date']).year
    
    # change article id from url to integer
    unique_article_ids = d.article_id.unique()
    article_mapping = {art:idx for idx,art in enumerate(unique_article_ids)}
    d['article_id'] = d['article_id'].map(article_mapping)
    return d

In [5]:
%%time
research_df = search_arxiv(
    queries = queries,
    max_results = 1000
)
research_df.shape

CPU times: user 8.02 s, sys: 456 ms, total: 8.48 s
Wall time: 3min 39s


(5332, 8)

## Generate Network

Now that we've fetched the data using the Arxiv API, we can generate a network. The network will have the following structure, nodes will be the article_ids and the edges will be all topics connecting a pair of articles. For example, article_id 1 with the following topics `astro-physics, and stats` can be connected to article_id 10 with the topic `stats` and article_id 7 with the topics `astro-physics, math`. This will be a multi-edge network where each edge will hold a weight of 1.

In [7]:
research_df

Unnamed: 0,title,date,article_id,url,main_topic,all_topics,authors,year
0,Review of automated time series forecasting pi...,2022-02-03 17:26:27+00:00,0,http://arxiv.org/pdf/2202.01712v1,cs.LG,[cs.LG],"[Stefan Meisenbacher, Marian Turowski, Kaleb P...",2022
1,Hubble Asteroid Hunter: I. Identifying asteroi...,2022-02-01 06:56:20+00:00,1,http://arxiv.org/pdf/2202.00246v1,astro-ph.EP,"[astro-ph.EP, astro-ph.IM]","[Sandor Kruk, Pablo García Martín, Marcel Pope...",2022
2,NAS-Bench-Suite: NAS Evaluation is (Now) Surpr...,2022-01-31 18:02:09+00:00,2,http://arxiv.org/pdf/2201.13396v1,cs.LG,"[cs.LG, cs.AI, stat.ML]","[Yash Mehta, Colin White, Arber Zela, Arjun Kr...",2022
3,Online AutoML: An adaptive AutoML framework fo...,2022-01-24 15:37:20+00:00,3,http://arxiv.org/pdf/2201.09750v1,cs.LG,"[cs.LG, cs.AI]","[Bilge Celik, Prabhant Singh, Joaquin Vanschoren]",2022
4,Automated Reinforcement Learning (AutoRL): A S...,2022-01-11 12:41:43+00:00,4,http://arxiv.org/pdf/2201.03916v1,cs.LG,[cs.LG],"[Jack Parker-Holder, Raghu Rajan, Xingyou Song...",2022
...,...,...,...,...,...,...,...,...
5327,Reinforcement Learning-Based Deadline and Batt...,2022-01-25 14:42:29+00:00,4509,http://arxiv.org/pdf/2201.10361v2,cs.NI,[cs.NI],"[Anne Catherine Nguyen, Turgay Pamuklu, Aisha ...",2022
5328,Resource-efficient Deep Neural Networks for Au...,2022-01-25 14:41:08+00:00,4510,http://arxiv.org/pdf/2201.10360v1,eess.SP,"[eess.SP, cs.CV]","[Johanna Rock, Wolfgang Roth, Mate Toth, Paul ...",2022
5329,Ultra Low-Parameter Denoising: Trainable Bilat...,2022-01-25 14:33:56+00:00,4511,http://arxiv.org/pdf/2201.10345v1,eess.IV,"[eess.IV, cs.CV]","[Fabian Wagner, Mareike Thies, Mingxuan Gu, Yi...",2022
5330,Distributed Image Transmission using Deep Join...,2022-01-25 14:25:26+00:00,4512,http://arxiv.org/pdf/2201.10340v1,cs.IT,"[cs.IT, cs.LG, math.IT]","[Sixian Wang, Ke Yang, Jincheng Dai, Kai Niu]",2022
