# Node Classification

In [None]:
# !pip install arxiv

You can install the arxiv package in Python with the following command:   
`pip install arxiv`   
or follow the instructions here : https://pypi.org/project/arxiv/

## What is Node Classification?  
Node classification is a common application of machine learning on graphs. Generally, you train a classification model to learn which class a certain node is apart of. This approach is common for both binary and multiclass classification [1]. In binary classification you're dealing with two different classes wheras multiclass classificaiton you are dealing with more than 2 different classes. In the context of this tutorial, we are going to use node2vec to generate node embeddings of the network. Node2vec is designed to preserve the initial structure within the original graph.  

## Problem Statement   
Given the main topic of research papers published on arXiv, we will build a pipeline which will train a model to classify a research paper based on its main topic.  

## Solution Architecture  
We will being by creating a network with nodes as articles and edges connecting these nodes based on the main topic connecting a pair of articles. After creating this network we will use node2vec to generate node embeddings associated to each artile. Finally, we can map the node embeddings associated to each node to its associated topic. The embeddings can be passed on as features and the main topic as the target to train a classification model. 

In [1]:
import networkx as nx
import pandas as pd
import numpy as np
import arxiv

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, matthews_corrcoef, confusion_matrix, classification_report
from node2vec import Node2Vec as n2v

In [2]:
# constants
queries = [
    'automl', 'machinelearning', 'data', 'phyiscs','mathematics', 'recommendation system', 'nlp', 'neural networks'
]

## Fetch Data

We want to hit th Arxiv API to gather some information about the latest research papers based on the queries we've identified above. This will allow us to then create a network from this research paper data and then we can try to classify nodes on that network. For the purposes of this article, I will search for a maximum of 1000 results per query, but you don't have to set yourself to the same constraints. The Arxiv API allows users to hit up to 300,000 results per query. The function outlined below will generate a CSV fetching the following information :
'title', 'date', 'article_id', 'url', 'main_topic', 'all_topics', 'authors', 'year'
You are able to fetch more information like the links, summary, article but I decided not to since those features won't really be used for the purposes of this analysis and tutorial.

For reference to the Arxiv API, you can find their detailed documentation here : https://arxiv.org/help/api/user-manual

In [3]:
def search_arxiv(queries, max_results = 100):
    '''
    This function will search arxiv associated to a set of queries and store
    the latest 10000 (max_results) associated to that search.
    
    params:
        queries (List -> Str) : A list of strings containing keywords you want
                                to search on Arxiv
        max_results (Int) : The maximum number of results you want to see associated
                            to your search. Default value is 1000, capped at 300000
                            
    returns:
        This function will return a DataFrame holding the following columns associated
        to the queries the user has passed. 
            `title`, `date`, `article_id`, `url`, `main_topic`, `all_topics`
    
    example:
        research_df = search_arxiv(
            queries = ['automl', 'recommender system', 'nlp', 'data science'],
            max_results = 10000
        )
    '''
    d = []
    searches = []
    # hitting the API
    for query in queries:
        search = arxiv.Search(
          query = query,
          max_results = max_results,
          sort_by = arxiv.SortCriterion.SubmittedDate,
          sort_order = arxiv.SortOrder.Descending
        )
        searches.append(search)
    
    # Converting search result into df
    for search in searches:
        for res in search.results():
            data = {
                'title' : res.title,
                'date' : res.published,
                'article_id' : res.entry_id,
                'url' : res.pdf_url,
                'main_topic' : res.primary_category,
                'all_topics' : res.categories,
                'authors' : res.authors
            }
            d.append(data)
        
    d = pd.DataFrame(d)
    d['year'] = pd.DatetimeIndex(d['date']).year
    
    # change article id from url to integer
    unique_article_ids = d.article_id.unique()
    article_mapping = {art:idx for idx,art in enumerate(unique_article_ids)}
    d['article_id'] = d['article_id'].map(article_mapping)
    return d

In [4]:
%%time
research_df = search_arxiv(
    queries = queries,
    max_results = 250
)
research_df.shape

CPU times: user 2.9 s, sys: 188 ms, total: 3.09 s
Wall time: 1min 9s


(1546, 8)

If you're having trouble querying the data, for reproducibility purposes, the CSV I used for the analysis conducted in this article was uploaded to my GitHub which you can find here. https://github.com/vatsal220/medium_articles/blob/main/link_prediction/data/arxiv_data.csv

## Generate Network

Now that we've fetched the data using the Arxiv API, we can generate a network. The network will have the following structure, nodes will be the article_ids and the edges will be all topics connecting a pair of articles. For example, article_id 1 with the following topics astro-physics, and `stats` can be connected to article_id 10 with the topic stats and article_id 7 with the topics `astro-physics`, `math`. This will be a multi-edge network where each edge will hold a weight of 1.

In [5]:
def generate_network(df, node_col, edge_col):
    '''
    This function will generate a article to article network given an input DataFrame.
    It will do so by creating an edge_dictionary where each key is going to be a node
    referenced by unique values in node_col and the values will be a list of other nodes
    connected to the key through the edge_col.
    
    params:
        df (DataFrame) : The dataset which holds the node and edge columns
        node_col (String) : The column name associated to the nodes of the network
        edge_col (String) : The column name associated to the edges of the network
        
    returns:
        A networkx graph corresponding to the input dataset
        
    example:
        generate_network(
            research_df,
            node_col = 'article_id',
            edge_col = 'main_topic'
        )
    '''
    edge_dct = {}
    for i,g in df.groupby(node_col):
        topics = g[edge_col].unique()
        edge_df = df[(df[node_col] != i) & (df[edge_col].isin(topics))]
        edges = list(edge_df[node_col].unique())
        edge_dct[i] = edges
    
    # create nx network
    g = nx.Graph(edge_dct, create_using = nx.MultiGraph)
    return g

In [6]:
%%time
tp_nx = generate_network(
    research_df, 
    node_col = 'article_id', 
    edge_col = 'main_topic'
)

CPU times: user 898 ms, sys: 16.4 ms, total: 914 ms
Wall time: 939 ms


In [7]:
print(nx.info(tp_nx))

Name: 
Type: Graph
Number of nodes: 1350
Number of edges: 66703
Average degree:  98.8193


## Apply Node2Vec

This component will cover running node2vec on the graph generated above and creating the associated node embeddings for that network. These embeddings will play a crucial role coming up as they're the main features necessary for building a node classification model.

In [8]:
%time g_emb = n2v(tp_nx, dimensions=16)

Computing transition probabilities:   0%|          | 0/1350 [00:00<?, ?it/s]

Generating walks (CPU: 1): 100%|██████████| 10/10 [00:27<00:00,  2.78s/it]

CPU times: user 1min 20s, sys: 686 ms, total: 1min 21s
Wall time: 1min 21s





In [9]:
WINDOW = 1 # Node2Vec fit window
MIN_COUNT = 1 # Node2Vec min. count
BATCH_WORDS = 4 # Node2Vec batch words

In [10]:
mdl = g_emb.fit(
    window=WINDOW,
    min_count=MIN_COUNT,
    batch_words=BATCH_WORDS
)

In [11]:
emb_df = (
    pd.DataFrame(
        [mdl.wv.get_vector(str(n)) for n in tp_nx.nodes()],
        index = tp_nx.nodes
    )
)

In [12]:
emb_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0.982184,-0.418723,0.648704,0.777569,1.278544,-0.68207,-0.140218,-0.352046,-0.15425,-0.430749,0.095327,0.495061,0.123365,-0.055955,-0.018547,1.452183
1,1.009432,-0.445544,0.703062,0.828961,1.277971,-0.544692,-0.077958,-0.37426,-0.158336,-0.590008,0.082956,0.530169,0.090699,-0.133567,-0.039257,1.416739
2,0.964131,-0.389936,0.690117,0.866124,1.258624,-0.66062,-0.128217,-0.401386,-0.151005,-0.549149,0.070321,0.506534,0.066263,-0.155309,-0.049878,1.412636
3,0.948457,-0.382216,0.711337,0.833575,1.244138,-0.752523,-0.095578,-0.392389,-0.137476,-0.570683,0.031246,0.55727,0.073654,-0.155646,0.006648,1.400579
4,0.910574,-0.306636,0.748167,0.808214,1.259955,-0.739075,-0.202448,-0.436149,-0.051608,-0.510787,-0.043116,0.593685,0.033465,-0.159935,-0.001556,1.427669


In [13]:
emb_df = emb_df.merge(
    research_df[['article_id', 'main_topic']].set_index('article_id'),
    left_index = True,
    right_index = True
)

In [14]:
ft_cols = emb_df.drop(columns = ['main_topic']).columns.tolist()
target_col = 'main_topic'

In [15]:
# train test split
x = emb_df[ft_cols].values
y = emb_df[target_col].values

x_train, x_test, y_train, y_test = train_test_split(
    x, 
    y,
    test_size = 0.3
)

## Train Model

In [16]:
%%time
# GBC classifier
clf = GradientBoostingClassifier()

# train the model
clf.fit(x_train, y_train)

CPU times: user 38.5 s, sys: 177 ms, total: 38.7 s
Wall time: 38.8 s


GradientBoostingClassifier()

## Evaluate Model

In [17]:
def clf_eval(clf, x_test, y_test):
    '''
    This function will evaluate a sk-learn multi-class classification model based on its
    x_test and y_test values
    
    params:
        clf (Model) : The model you wish to evaluate the performance of
        x_test (Array) : Result of the train test split
        y_test (Array) : Result of the train test split
    
    returns:
        This function will return the following evaluation metrics:
            - Accuracy Score
            - Matthews Correlation Coefficient
            - Classification Report
            - Confusion Matrix
    
    example:
        clf_eval(
            clf,
            x_test,
            y_test
        )
    '''
    y_pred = clf.predict(x_test)
    y_true = y_test
    
    y_pred = clf.predict(x_test)
    x_pred = clf.predict(x_train)
    test_acc = accuracy_score(y_test, y_pred)
    print("Testing Accuracy : ", test_acc)
    
    print("MCC Score : ", matthews_corrcoef(y_true, y_pred))
    
    print("Classification Report : ")
    print(classification_report(y_test, clf.predict(x_test)))
    
    print(confusion_matrix(y_pred,y_test))

In [18]:
%%time
clf_eval(
    clf,
    x_test,
    y_test
)

Testing Accuracy :  0.9094827586206896
MCC Score :  0.903356554232795
Classification Report : 
                    precision    recall  f1-score   support

       astro-ph.CO       0.00      0.00      0.00         3
       astro-ph.EP       1.00      1.00      1.00         6
       astro-ph.GA       0.46      1.00      0.63         6
       astro-ph.HE       1.00      1.00      1.00         2
       astro-ph.IM       0.33      1.00      0.50         1
       astro-ph.SR       1.00      1.00      1.00         2
   cond-mat.dis-nn       1.00      1.00      1.00         1
 cond-mat.mes-hall       1.00      1.00      1.00         3
 cond-mat.mtrl-sci       1.00      1.00      1.00         6
cond-mat.quant-gas       1.00      1.00      1.00         1
     cond-mat.soft       0.00      0.00      0.00         0
cond-mat.stat-mech       0.75      1.00      0.86         3
   cond-mat.str-el       1.00      0.86      0.92         7
 cond-mat.supr-con       1.00      1.00      1.00         1
    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Predictions

In [19]:
pred_ft = [mdl.wv.get_vector(str('21'))]
clf.predict(pred_ft)[0]

'cs.LG'

## Concluding Remarks


## Resources
- [1] https://neo4j.com/docs/graph-data-science/current/algorithms/ml-models/node-classification/#:~:text=Node%20Classification%20is%20a%20common,classification%20problems%3A%20binary%20and%20multiclass.  

---