# Link Prediction w/ n2v

In [1]:
# !pip install arxiv

You can install the arxiv package in Python with the following command:  
`pip install arxiv`  
or follow the instructions here : https://pypi.org/project/arxiv/  

## What is Link Prediction?



## Cold Start Problem in Recommendation Systems

In [68]:
import networkx as nx
import pandas as pd
import numpy as np
import arxiv

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, matthews_corrcoef, confusion_matrix, classification_report
from itertools import product
from sklearn.metrics.pairwise import cosine_similarity
from node2vec import Node2Vec as n2v

In [3]:
# constants
queries = [
    'automl', 'machinelearning', 'data', 'phyiscs','mathematics', 'recommendation system', 'nlp', 'neural networks'
]

# Fetch Data

We want to hit th Arxiv API to gather some information about the latest research papers based on the queries we've identified above. This will allow us to then create a network from this research paper data and then we can try to predict links on that network. For the purposes of this article, I will search for a maximum of 1000 results per query, but you don't have to set yourself to the same constraints. The Arxiv API allows users to hit up to 300,000 results per query. The function outlined below will generate a CSV fetching the following information :   
```'title', 'date', 'article_id', 'url', 'main_topic', 'all_topics', 'authors', 'year'```   
You are able to fetch more information like the `links, summary, article` but I decided not to since those features won't really be used for the purposes of this analysis and tutorial.

For reference to the Arxiv API, you can find their detailed documentation here : https://arxiv.org/help/api/user-manual

In [4]:
def search_arxiv(queries, max_results = 1000):
    '''
    This function will search arxiv associated to a set of queries and store
    the latest 10000 (max_results) associated to that search.
    
    params:
        queries (List -> Str) : A list of strings containing keywords you want
                                to search on Arxiv
        max_results (Int) : The maximum number of results you want to see associated
                            to your search. Default value is 1000, capped at 300000
                            
    returns:
        This function will return a DataFrame holding the following columns associated
        to the queries the user has passed. 
            `title`, `date`, `article_id`, `url`, `main_topic`, `all_topics`
    
    example:
        research_df = search_arxiv(
            queries = ['automl', 'recommender system', 'nlp', 'data science'],
            max_results = 10000
        )
    '''
    d = []
    searches = []
    # hitting the API
    for query in queries:
        search = arxiv.Search(
          query = query,
          max_results = max_results,
          sort_by = arxiv.SortCriterion.SubmittedDate,
          sort_order = arxiv.SortOrder.Descending
        )
        searches.append(search)
    
    # Converting search result into df
    for search in searches:
        for res in search.results():
            data = {
                'title' : res.title,
                'date' : res.published,
                'article_id' : res.entry_id,
                'url' : res.pdf_url,
                'main_topic' : res.primary_category,
                'all_topics' : res.categories,
                'authors' : res.authors
            }
            d.append(data)
        
    d = pd.DataFrame(d)
    d['year'] = pd.DatetimeIndex(d['date']).year
    
    # change article id from url to integer
    unique_article_ids = d.article_id.unique()
    article_mapping = {art:idx for idx,art in enumerate(unique_article_ids)}
    d['article_id'] = d['article_id'].map(article_mapping)
    return d

In [5]:
%%time
research_df = search_arxiv(
    queries = queries,
    max_results = 100
)
research_df.shape

CPU times: user 987 ms, sys: 46.1 ms, total: 1.03 s
Wall time: 7.79 s


(646, 8)

If you're having trouble querying the data, for reproducibility purposes, the CSV I used for the analysis conducted in this article was uploaded to my GitHub which you can find here. https://github.com/vatsal220/medium_articles/blob/main/link_prediction/data/arxiv_data.csv

## Generate Network

Now that we've fetched the data using the Arxiv API, we can generate a network. The network will have the following structure, nodes will be the article_ids and the edges will be all topics connecting a pair of articles. For example, article_id 1 with the following topics `astro-physics, and stats` can be connected to article_id 10 with the topic `stats` and article_id 7 with the topics `astro-physics, math`. This will be a multi-edge network where each edge will hold a weight of 1.

In [6]:
def generate_network(df, node_col = 'article_id', edge_col = 'main_topic'):
    '''
    This function will generate a article to article network given an input DataFrame.
    It will do so by creating an edge_dictionary where each key is going to be a node
    referenced by unique values in node_col and the values will be a list of other nodes
    connected to the key through the edge_col.
    
    params:
        df (DataFrame) : The dataset which holds the node and edge columns
        node_col (String) : The column name associated to the nodes of the network
        edge_col (String) : The column name associated to the edges of the network
        
    returns:
        A networkx graph corresponding to the input dataset
        
    example:
        generate_network(
            research_df,
            node_col = 'article_id',
            edge_col = 'main_topic'
        )
    '''
    edge_dct = {}
    for i,g in df.groupby(node_col):
        topics = g[edge_col].unique()
        edge_df = df[(df[node_col] != i) & (df[edge_col].isin(topics))]
        edges = list(edge_df[node_col].unique())
        edge_dct[i] = edges
    
    # create nx network
    g = nx.Graph(edge_dct, create_using = nx.MultiGraph)
    return g

In [7]:
all_tp = research_df.explode('all_topics').copy()

In [8]:
%%time
tp_nx = generate_network(
    all_tp, 
    node_col = 'article_id', 
    edge_col = 'all_topics'
)

CPU times: user 443 ms, sys: 34.5 ms, total: 477 ms
Wall time: 490 ms


In [9]:
print(nx.info(tp_nx))

Name: 
Type: Graph
Number of nodes: 570
Number of edges: 26178
Average degree:  91.8526


In [10]:
%%time
research_nx = generate_network(
    research_df, 
    node_col = 'article_id', 
    edge_col = 'main_topic'
)

CPU times: user 351 ms, sys: 3.65 ms, total: 354 ms
Wall time: 353 ms


In [11]:
print(nx.info(research_nx))

Name: 
Type: Graph
Number of nodes: 570
Number of edges: 10698
Average degree:  37.5368


## Node2Vec

This component will cover running node2vec on the graph generated above and creating the associated node embeddings for that network. These embeddings will play a crucial role coming up as they're the main features necessary for building a link prediction model.

In [12]:
%time g_emb = n2v(research_nx, dimensions=16)

Computing transition probabilities:   0%|          | 0/570 [00:00<?, ?it/s]

Generating walks (CPU: 1): 100%|██████████| 10/10 [00:10<00:00,  1.01s/it]

CPU times: user 13.7 s, sys: 143 ms, total: 13.9 s
Wall time: 13.9 s





In [13]:
WINDOW = 1 # Node2Vec fit window
MIN_COUNT = 1 # Node2Vec min. count
BATCH_WORDS = 4 # Node2Vec batch words

In [14]:
mdl = g_emb.fit(
    window=WINDOW,
    min_count=MIN_COUNT,
    batch_words=BATCH_WORDS
)

In [15]:
input_node = '1'
for s in mdl.wv.most_similar(input_node, topn = 10):
    print(s)

('300', 0.6815548539161682)
('395', 0.6645826697349548)
('109', 0.6567216515541077)
('363', 0.6363925337791443)
('204', 0.6336174607276917)
('371', 0.6288920640945435)
('139', 0.6258803009986877)
('458', 0.6248643398284912)
('566', 0.6243730187416077)
('274', 0.6211979389190674)


## Generate Embeddings DataFrame

In [16]:
emb_df = (
    pd.DataFrame(
        [mdl.wv.get_vector(str(n)) for n in research_nx.nodes()],
        index = research_nx.nodes
    )
)

In [17]:
emb_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,-0.511957,-0.580585,-0.626385,-0.827639,0.904748,0.777848,0.182476,-1.217629,-0.396125,-1.085773,-0.119347,0.081139,-0.253311,0.879596,0.55627,0.284939
1,-1.5491,0.13405,-1.983533,-1.282142,0.762598,0.978624,0.186682,0.436884,-1.926421,-1.661931,-0.821661,-2.232859,-1.446645,-0.479345,1.687179,-1.335484
2,-0.48835,-0.587227,-0.625464,-0.782213,0.869252,0.836026,0.185684,-1.158214,-0.469872,-1.117828,-0.063733,0.220845,-0.207277,0.864076,0.567868,0.222027
3,-0.520829,-0.603254,-0.620381,-0.765303,0.90781,0.850281,0.182577,-1.161945,-0.394557,-1.037542,-0.090025,0.194334,-0.269508,0.866738,0.623747,0.283081
4,-0.536191,-0.615628,-0.488054,-0.758666,0.870371,0.816817,0.229364,-1.192694,-0.459576,-1.160274,-0.103331,0.272849,-0.131484,0.901864,0.59828,0.257405


## Recommendations w/ Distance Measures

Now that we have a embedding vector representing each node in the network, we can then use distance measures like cosine similarity, euclidean distance, manhattan distance, etc. to measure the amount of distance between nodes. The assumption that we're making by using these distance measures is that nodes in close proximity with each other should also have an edge connecting each other. This is a good assumption to make as node2vec tries to preserve the initial structure of the original input graph. Now we can essentially write out code to measure similarity levels between two vectors using cosine similarity (or a different distance measure) and identify pairs of nodes which don't currently have an edge between them but do have a large similarity should create an edge between them. This interpretation can be different for multi / weighted / directed graphs. Pick and use a similarity measure appropriate to the network and problem you're trying to solve. Also be aware that different measures have different interperations, for this problem you want to pick maximal cosine similarity scores whereas if you were to use something like euclidean distance, you would want to pick the minimal distance between two vectors.

On a side note, I do want to mention that the curse of dimensionality is rampant when solving these types of problems. It's especially problematic when using euclidean distance in particular to measure the distance between vectors in higher dimensions. The term higher dimensions is broad and open to interpretation, the threshold for a dimension to be "high" is not strictly defined and varies from problem to problem. Without going to deep into the mathematics behind things, euclidean distance is not a good measure to use for sparse or high dimensional vectors. You can reference this post on stack exchange which outlines the mathematical reasoning as to why this is the case. 

- https://stats.stackexchange.com/questions/29627/euclidean-distance-is-usually-not-good-for-sparse-data-and-more-general-case
- https://stats.stackexchange.com/questions/99171/why-is-euclidean-distance-not-a-good-metric-in-high-dimensions

For more on the curse of dimensionality, refer to this [paper](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf) by the Computer Science department from the University of Washington.

In [18]:
def predict_links(G, df, article_id, N):
    '''
    This function will predict the top N links a node (article_id) should be connected with
    which it is not already connected with in G.
    
    params:
        G (Netowrkx Graph) : The network used to create the embeddings
        df (DataFrame) : The dataframe which has embeddings associated to each node
        article_id (Integer) : The article you're interested 
        N (Integer) : The number of recommended links you want to return
        
    returns:
        This function will return a list of nodes the input node should be connected with.
    '''
    
    # separate target article with all others
    article = df[df.index == article_id]
    
    # other articles are all articles which the current doesn't have an edge connecting
    all_nodes = G.nodes()
    other_nodes = [n for n in all_nodes if n not in list(G.adj[article_id]) + [article_id]]
    other_articles = df[df.index.isin(other_nodes)]
    
    # get similarity of current reader and all other readers
    sim = cosine_similarity(article, other_articles)[0].tolist()
    idx = other_articles.index.tolist()
    
    # create a similarity dictionary for this user w.r.t all other users
    idx_sim = dict(zip(idx, sim))
    idx_sim = sorted(idx_sim.items(), key=lambda x: x[1], reverse=True)
    
    similar_articles = idx_sim[:N]
    articles = [art[0] for art in similar_articles]
    return articles

In [19]:
predict_links(G = research_nx, df = emb_df, article_id = 1, N = 10)

[300, 395, 109, 363, 204, 371, 139, 458, 566, 274]

## Modelling Based Recommendations

In this section we're going to build a binary classifier to predict the probability of a pair of edges to be connected or not. To do this we first need to identify all pairs of nodes which can form an edge, and identify the subset of those pairs which have already have an edge between them in the original network. We can then combine the embeddings (through vector addition) associated to all possible edges and pass that into a train test split function to generate training and testing partitions to pass into a classification model. 

The model I'll be using for this tutorial will be a gradient boosting classifier, for your own problems & experiments I advise you to try out a variety of different classifiers and selected the overall best performing one. We are building a binary classifier, however in almost all situations regarldess of the input data / network there will be a class imbalance on this classifier. Since we're taking all possible cominbations of edges which can be formed in the network, that result is N^N (where N is the number of nodes in the network), this is exponentially large. The only way that this wouldn't be a binary classifier is if your graph is already almost fully connected.

In [20]:
unique_nodes = list(research_nx.nodes())

In [38]:
%time all_possible_edges = [(x,y) for (x,y) in product(unique_nodes, unique_nodes)]

CPU times: user 42.4 ms, sys: 11.8 ms, total: 54.3 ms
Wall time: 57.6 ms


In [37]:
len(all_possible_edges)

162165

In [39]:
len(all_possible_edges)

324900

In [48]:
%%time
edge_features = [
    (mdl.wv.get_vector(str(i)) + mdl.wv.get_vector(str(j))) for i,j in all_possible_edges
]

CPU times: user 734 ms, sys: 32.4 ms, total: 766 ms
Wall time: 780 ms


In [40]:
edges = list(research_nx.edges())

In [43]:
%time is_con = [1 if e in edges else 0 for e in all_possible_edges]

CPU times: user 1min 4s, sys: 227 ms, total: 1min 4s
Wall time: 1min 4s


In [44]:
sum(is_con)

10698

## Train Model

In [49]:
X = np.array(edge_features)
y = is_con

In [66]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [52]:
%%time
# classifier
clf = GradientBoostingClassifier()

# train the model
clf.fit(x_train, y_train)

CPU times: user 1min 57s, sys: 351 ms, total: 1min 57s
Wall time: 1min 57s


GradientBoostingClassifier()

## Model Performance

When dealing with class imbalances we can't look at traditional accuracy measures to evaluate the performance of the model. This is because if the model continously predicts the class which has the greater amount of data, then regardless of the model never predicting the other class, it will still achieve a high level of accuracy. The way to combat this is to use measures like Matthews Correlation Coefficient. 

```quote
The MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also known as the phi coefficient.
```
- https://en.wikipedia.org/wiki/Phi_coefficient

In [55]:
y_pred = clf.predict(x_test)
y_true = y_test

In [57]:
y_pred = clf.predict(x_test)
x_pred = clf.predict(x_train)
test_acc = accuracy_score(y_test, y_pred)
train_acc = accuracy_score(y_train, x_pred)
print("Testing Accuracy : ", test_acc)
print("Training Accuracy : ", train_acc)

Testing Accuracy :  0.9658356417359187
Training Accuracy :  0.9673704729660408


In [56]:
matthews_corrcoef(y_true, y_pred)

0.059086223731077636

In [59]:
confusion_matrix(y_pred,y_test)

array([[78408,  2667],
       [  108,    42]])

In [69]:
print("Test Classification Report : ")
print(classification_report(y_test, clf.predict(x_test)))

Test Classification Report : 
              precision    recall  f1-score   support

           0       0.97      1.00      0.98     94216
           1       0.02      0.00      0.00      3254

    accuracy                           0.96     97470
   macro avg       0.49      0.50      0.49     97470
weighted avg       0.93      0.96      0.95     97470



## Generate Predictions

In [64]:
pred_ft = [(mdl.wv.get_vector(str('42'))+mdl.wv.get_vector(str('210')))]
clf.predict(pred_ft)[0]

0

Clearly as you can see that this is a poor performing model for it's given task (reflected by the MCC, ft-score, recall and percision). But that's alright since this article was simply for an educational purpose. Not every approach you use to solve a given problem will pan out or work, often times its not at fault of the approach but rather the data. In the case of this tutorial, the data was definetly at fault since I was only using a small sample of a actual research network which would be available if I scraped Arxiv for more data. But doing that would also increase the computational complexity to a lot of different components of this article (like running node2vec, generating all possible pairs of edges, training the model, etc.). 

The rule of thumb most data scientists follow when solving problems is that if the simple solutions work, then the more complex solutions should also work. Simpler solutions in recommendation systems involving collaborative filtering / content based are often quite easy to implement and yield relatively informative results and indications whether or not a more complex solution would work for solving this problem as well. Its often not the best to throw a neural network at the problem (like node2vec) for a variety of reasons (like training time, interpretability, model inference time, etc.). 

For a guideline on how to implement the simpler solutions in recommendation systems, you can refer to another article I've written [here](). Furthermore, if you want to follow through in the jupyter notebook associated to this project, you can reference my [GitHub]().

---