# Generate Labels for Node Classification Task

We use the venue of a paper as a proxy to the paper's general domain. In order to extract labels based on the venue, we use a fine-tuned BERTopic model, tuned on arXiv papers in the fields of machine learning and computer science. Note that in the initial retrieval of the system, the embeddings are abstract based, so we needed less dependecy in them when training the GNN, thus the choice of using the venue.

## Imports, Consts and Setup

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import torch
from torch.utils.data import DataLoader, Dataset
import numpy as np
from tqdm import tqdm
from bertopic import BERTopic


In [2]:
BATCH_SIZE = 256

In [3]:
acm_df = pd.read_csv('/home/student/FinalProject/PaperFeedback/Datasets/acm_citation_network_v8.csv')

In [4]:
acm_df['venue'] = acm_df['venue'].fillna('')
acm_df['venue'] = acm_df['venue'].apply(lambda x: str(x))
acm_df['id'] = range(len(acm_df))

In [5]:
acm_df[['id', 'venue']]

Unnamed: 0,id,venue
0,0,INFORMS Journal on Computing
1,1,Theoretical Computer Science
2,2,Theoretical Computer Science
3,3,Graphics Interface 1990
4,4,Using program slicing in software maintenance
...,...,...
2381670,2381670,The QCP File Format and Media Types for Speech...
2381671,2381671,Multicast Source Discovery Protocol (MSDP)
2381672,2381672,RTP Control Protocol Extended Reports (RTCP XR)
2381673,2381673,Uniform Resource Identifier (URI) Scheme and A...


## Embedding Generation Process

Usage of BERTopic requires initial text embeddings for documents, which will be clustered by the model. For efficient embedding process, we use SentenceTransformers model <code>all-MiniLM-L6-v2</code>, and encode documents in batches with custom torch dataset and dataloader objects.

In [6]:
class DocDataset(Dataset):
    def __init__(self, df, doc_field):
        self.df = df
        self.doc_field = doc_field

    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, index):
        return self.df.iloc[index]['id'], self.df.iloc[index][self.doc_field]

In [7]:
doc_dataset=  DocDataset(acm_df, doc_field='venue')
doc_dataloader = DataLoader(doc_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=8, pin_memory=True)
if torch.cuda.is_available():
    print('using GPU')
    device = 'cuda'
else:
    print('using CPU')
    device = 'cpu'
model = SentenceTransformer('all-MiniLM-L6-v2')
model = model.to(device)

using GPU


In [8]:
doc_embeddings = []
with torch.no_grad():
    for _, batch in tqdm(doc_dataloader):
        batch_embeddings = model.encode(batch, convert_to_tensor=True, show_progress_bar=False)
        doc_embeddings.extend(batch_embeddings.cpu().numpy())

100%|██████████| 9304/9304 [12:42<00:00, 12.21it/s]


In [9]:
doc_embeddings = np.array(doc_embeddings)

In [14]:
docs = acm_df['venue'].to_list()


## BERTopic model labeling

We chose <code><a href='https://huggingface.co/etanios/short-arxiv-bertopic'> short-arxiv-bertopic</a></code> as the labeling model. The model classifies documents to one of 38 different topics from domains of machine learning and computer science, which mathces our corpus.

In [None]:
topic_model = BERTopic.load("etanios/short-arxiv-bertopic")

In [None]:
topics, probs = topic_model.transform(documents=docs, embeddings=doc_embeddings)

In [23]:
acm_df['topic'] = topics
acm_df.columns

Index(['index', 'title', 'authors', 'year', 'venue', 'references', 'abstract',
       'id', 'topic'],
      dtype='object')

In [27]:
acm_df.to_csv('/home/student/FinalProject/PaperFeedback/Datasets/acm_citation_network_v8_labeled.csv')