# Training the paper recommendation model using feature Title, Abstract, Keywords and Scopes (TAKS)

This notebook demonstrates the process of training a paper recommendation model using features such as Title, Abstract, Keywords, and Scopes (TAKS). It includes data preparation, feature selection, tokenization, and the creation of data loaders. The model architecture consists of a sentence embedder and a classifier that incorporates external features (Aims & Scopes). The training loop involves optimizing the model using AdamW optimizer and evaluating its performance on validation data. Finally, the notebook includes testing the model on a separate test dataset and reporting the final results.

Outline:
1. Import necessary packages
2. Data preparation
3. Feature selection
4. Tokenization
5. Data loader
6. Model definition
    - Pooler layer
    - Sentence Embedder
    - Load fine-tuned LM
    - Model for downstream task
7. Training
    - Optimizer and Loss function
    - Training settings
    - Training loop
8. Testing
9. Final results

## Import necessary packages

In [1]:
import os # file management

import torch
from torch import nn, Tensor # neural network
from torch.nn import functional as F

# numerical matrix processing
import numpy as np 
from numpy import ndarray

# data/parameter loading
import pandas as pd 
import pickle

# visualization
from tqdm.notebook import trange, tqdm

# transfomers
from transformers import AutoTokenizer, AutoModel
from transformers.modeling_outputs import BaseModelOutputWithPoolingAndCrossAttentions

# code instruction
from typing import Union, List, Dict
# filter out warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
import torch

if torch.backends.mps.is_available():
    print("MPS (Apple GPU) is available.")
else:
    print("MPS (Apple GPU) is not available.")

MPS (Apple GPU) is available.


## Some useful functions

In [3]:
# Utils
def save_parameter(save_object, save_file):
    with open(save_file, 'wb') as f:
        pickle.dump(save_object, f, protocol=pickle.HIGHEST_PROTOCOL)

def load_parameter(load_file):
    with open(load_file, 'rb') as f:
        output = pickle.load(f)
    return output

def sim_matrix(a, b, eps=1e-8):
    """
    Calculate cosine similarity between two matrices. 
    Note: added eps for numerical stability
    """
    a_n, b_n = a.norm(dim=1)[:, None], b.norm(dim=1)[:, None]
    a_norm = a / torch.clamp(a_n, min=eps)
    b_norm = b / torch.clamp(b_n, min=eps)
    sim_mt = torch.mm(a_norm, b_norm.transpose(0, 1))
    return sim_mt

def batch2device(batch, device):
    """
    Transfer batch of training to GPU/CPU
    Args:
        batch: Dict[str, Tensor], represent for transformer input (input_ids, attention_mask)
        device: torch.device, GPU or CPU
    """
    for key, value in batch.items():
        batch[key] = batch[key].to(device)
    return batch

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


# GPU accelerator
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")


# Data preparation

In [4]:
# working dir
work_path = "./" # Removed ./ from the beginning
checkpoint_path = "./checkpoint/"

In [7]:
data_train = pd.read_csv("./data/preprocess_data/biology_train.csv", encoding = "ISO-8859-1")
data_validate = pd.read_csv("./data/preprocess_data/biology_validate.csv", encoding = "ISO-8859-1")
data_test = pd.read_csv("./data/preprocess_data/biology_test.csv", encoding = "ISO-8859-1")
data_aims = pd.read_csv("./data/preprocess_data/biology_aims.csv", encoding = "ISO-8859-1")

data_train.fillna("", inplace=True)
data_validate.fillna("", inplace=True)
data_test.fillna("", inplace=True)
data_aims.fillna("", inplace=True)

n_classes = len(data_aims)

## Feature selection

In [8]:
X_train = (
    data_train['Title']  
    + " " + data_train['Abstract']
    + " " + data_train['Keywords']
    ).tolist()
X_valid = (
    data_validate['Title']  
    + " " + data_validate['Abstract']
    + " " + data_validate['Keywords']
    ).tolist()
X_test = (
    data_test['Title']  
    + " " + data_test['Abstract']
    + " " + data_test['Keywords']
    ).tolist()

X_aims = data_aims["Aims"].tolist()

Y_train = data_train['Label'].tolist()
Y_validate = data_validate['Label'].tolist()
Y_test = data_test['Label'].tolist()

## Tokenization

In [9]:
pretrain_model = 'BAAI/bge-small-en-v1.5'
pretrain_model_dimension = 384

In [10]:
tokenizer = AutoTokenizer.from_pretrained(pretrain_model)

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [11]:
train_encodings = tokenizer(
    X_train,
    truncation=True,
    padding="max_length",
    max_length=300,
    return_tensors="pt"
)
valid_encodings = tokenizer(
    X_valid,
    truncation=True,
    padding="max_length",
    max_length=300,
    return_tensors="pt"
)
test_encodings = tokenizer(
    X_test,
    truncation=True,
    padding="max_length",
    max_length=300,
    return_tensors="pt"
)

# save_parameter(train_encodings, checkpoint_path + "pickle/distilroberta_training_encodings.pickle")
# save_parameter(valid_encodings, checkpoint_path + "pickle/distilroberta_valid_encodings.pickle")
# save_parameter(test_encodings, checkpoint_path + "pickle/distilroberta_test_encodings.pickle")

# train_encodings = load_parameter(checkpoint_path + "pickle/distilroberta_training_encodings.pickle")
# valid_encodings = load_parameter(checkpoint_path + "pickle/distilroberta_valid_encodings.pickle")
# test_encodings = load_parameter(checkpoint_path + "pickle/distilroberta_test_encodings.pickle")

## Data loader

In [12]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        x = {
            key: torch.tensor(val[idx]) for key, val in self.encodings.items()
        }
        y = torch.tensor(self.labels[idx])
        return x, y
    def __len__(self):
        return len(self.labels)

In [13]:
# Dataset
train_dataset = Dataset(train_encodings, Y_train)
valid_dataset = Dataset(valid_encodings, Y_validate)
test_dataset = Dataset(test_encodings, Y_test)

In [14]:
# Data loaders
train_loader = torch.utils.data.DataLoader(train_dataset,
                                         batch_size=64,
                                         shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset,
                                         batch_size=32,
                                         shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset,
                                         batch_size=32,
                                         shuffle=False)

# Model definition

## Pooler layer

In [19]:
class Pooler(nn.Module):
    """
    Parameter-free poolers to get the sentence embedding
    'cls': [CLS] representation with BERT/RoBERTa's MLP pooler.
    'cls_before_pooler': [CLS] representation without the original MLP pooler.
    'avg': average of the last layers' hidden states at each token.
    'avg_top2': average of the last two layers.
    'avg_first_last': average of the first and the last layers.
    """
    def __init__(self, pooler_type):
        super().__init__()
        self.pooler_type = pooler_type
        assert self.pooler_type in ["cls", "cls_before_pooler", "avg", "avg_top2", "avg_first_last"], "unrecognized pooling type %s" % self.pooler_type

    def forward(self, attention_mask, outputs):
        last_hidden = outputs.last_hidden_state
        hidden_states = outputs.hidden_states

        if self.pooler_type in ['cls_before_pooler', 'cls']:
            return last_hidden[:, 0]
        elif self.pooler_type == "avg":
            return ((last_hidden * attention_mask.unsqueeze(-1)).sum(1) / attention_mask.sum(-1).unsqueeze(-1))
        elif self.pooler_type == "avg_first_last":
            first_hidden = hidden_states[0]
            last_hidden = hidden_states[-1]
            pooled_result = ((first_hidden + last_hidden) / 2.0 * attention_mask.unsqueeze(-1)).sum(1) / attention_mask.sum(-1).unsqueeze(-1)
            return pooled_result
        elif self.pooler_type == "avg_top2":
            second_last_hidden = hidden_states[-2]
            last_hidden = hidden_states[-1]
            pooled_result = ((last_hidden + second_last_hidden) / 2.0 * attention_mask.unsqueeze(-1)).sum(1) / attention_mask.sum(-1).unsqueeze(-1)
            return pooled_result
        else:
            raise NotImplementedError


## Sentence Embedder

In [20]:
class ModelForSE(nn.Module):
    def __init__(self, model_name_or_path, pooler_type):
        super(ModelForSE, self).__init__()
        '''
        Model for sentence embedding
        '''
        self.bert = AutoModel.from_pretrained(model_name_or_path)
        self.pooler_type = pooler_type
        self.pooler = Pooler(self.pooler_type)
        
    def forward(self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
        mlm_input_ids=None,
        mlm_labels=None,
    ):
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=True if self.pooler_type in ['avg_top2', 'avg_first_last'] else False,
            return_dict=return_dict,
        )
        if self.pooler_type in ["cls", "cls_before_pooler", "avg", "avg_top2", "avg_first_last"]:
            pooler_output = self.pooler(attention_mask, outputs)
        
        return BaseModelOutputWithPoolingAndCrossAttentions(
            pooler_output=pooler_output,
            last_hidden_state=outputs.last_hidden_state,
            hidden_states=outputs.hidden_states,
        )
    def encode(self, sentences: Union[str, List[str]],
               batch_size: int = 8,
               show_progress_bar: bool = None,
               convert_to_numpy: bool = True,
               convert_to_tensor: bool = False,
               device: str = None) -> Union[List[Tensor], ndarray, Tensor]:
        self.eval()

        if convert_to_tensor:
            convert_to_numpy = False

        input_was_string = False

        if isinstance(sentences, str) or not hasattr(sentences, '__len__'): #Cast an individual sentence to a list with length 1
            sentences = [sentences]
            input_was_string = True

        if device is None:
            device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

        self.to(device)

        all_embeddings = []
        for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=not show_progress_bar):
            sentence_batch = sentences[start_index: start_index+batch_size]
            features = tokenizer(sentence_batch,
                       padding='max_length', 
                       truncation=True, 
                       max_length=300,
                       return_tensors='pt').to(device)
            
            with torch.no_grad():
                out_features = self.forward(**features)
                embeddings = []
                # gather the embedding vectors
                for row in out_features.pooler_output:
                    embeddings.append(row.cpu())
                all_embeddings.extend(embeddings)
        if convert_to_tensor:
            all_embeddings = torch.vstack(all_embeddings)
        elif convert_to_numpy:
            all_embeddings = np.asarray([emb.numpy() for emb in all_embeddings])
        
        if input_was_string:
            all_embeddings = all_embeddings[0]
        return all_embeddings

## Load fine-tuned LM

In [21]:
# Fine-tuned LM checkpoint (by contrastive learning)
# checkpoint_cl = torch.load(checkpoint_path + "saved_model/Epoch:09 SupCL-DistilRoBERTa.pth")


# Baseline model of sentence embeddings
model_args = {
    "model_name_or_path": pretrain_model,
    "pooler_type": "cls_before_pooler"
}
base_model = ModelForSE(**model_args)
# base_model.load_state_dict(checkpoint_cl["model_state_dict"])

## Model for downstream task

In [22]:
class WithAim_Classifier(nn.Module):
    def __init__(self, base_model, num_classes):
        super(WithAim_Classifier, self).__init__()
        self.base_model = base_model
        self.linear1_1 = nn.Linear(pretrain_model_dimension, 512)
        self.act1_1 = nn.ReLU()
        self.drop1_1 = nn.Dropout(0.1)


        self.linear2_1 = nn.Linear(pretrain_model_dimension, 512)
        self.act2_1 = nn.ReLU()
        

        self.linear_main_1 = nn.Linear(512+num_classes, num_classes)
        self.act_main_1 = nn.LogSoftmax(dim=1)

    def forward(self, inputs_tak, inputs_aims):
        '''
        Args:
            inputs_tak: (Dict) batch of TAK samples, shape as [bs, n_samples, encoding_dim]
            inputs_aims: (Tensor) batch of aims embeddings taken by cls tokens, shape as [bs, n_samples, hidden_size]
        '''
        output_tak = self.base_model(**inputs_tak)
        last_hidden = output_tak.last_hidden_state[:,0,:] # cls tokens
        x = self.linear1_1(last_hidden)
        x = self.act1_1(x)
        x = self.drop1_1(x)
        
        if inputs_aims is not None: # Aims
            y = self.linear2_1(inputs_aims)
            y = self.act2_1(y)

            cosine_feats = sim_matrix(x, y)
            concat_feats = torch.cat((x, cosine_feats), dim=1)

            out = self.linear_main_1(concat_feats)
            out = self.act_main_1(out)

            return out
        else:
            return x

In [23]:
model = WithAim_Classifier(base_model, n_classes)
model.to(device)

WithAim_Classifier(
  (base_model): ModelForSE(
    (bert): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(30522, 384, padding_idx=0)
        (position_embeddings): Embedding(512, 384)
        (token_type_embeddings): Embedding(2, 384)
        (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0-11): 12 x BertLayer(
            (attention): BertAttention(
              (self): BertSdpaSelfAttention(
                (query): Linear(in_features=384, out_features=384, bias=True)
                (key): Linear(in_features=384, out_features=384, bias=True)
                (value): Linear(in_features=384, out_features=384, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=384, out_features=384, bias=

In [24]:
print("Model summary:\n")
print(">> Total params: ", count_parameters(model))

Model summary:

>> Total params:  33890498


# Training

Firstly, we encode Aims&Scopes into the embedding features as external features for training

In [37]:
aims_embeddings = base_model.encode(X_aims, show_progress_bar=True , convert_to_tensor=True)
# if torch.cuda.is_available():
# aims_embeddings = aims_embeddings.cuda()
aims_embeddings = aims_embeddings.to(device)

Batches:   0%|          | 0/25 [00:00<?, ?it/s]

## Optimizer and Loss function

In [19]:
# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer=optimizer, gamma=0.96)

# Loss function
loss_fn = nn.NLLLoss().to(device)

## Training settings

In [20]:
max_epochs = 6
topks = [1, 3, 5, 10]
history = {
    "train_loss": [],
    "val_loss": [],
    "train_acc@k": [],
    "val_acc@k": [],
}
min_valid_loss = np.inf

## Training loop

In [23]:
for epoch in range(max_epochs):
    train_loss = 0.0
    train_loop = tqdm(train_loader, leave=True)
    batch_train_accuracy = {k: 0 for k in topks}
    batch_valid_accuracy = {k: 0 for k in topks}
    num_correct_at_k = {
        "train": {k: 0 for k in topks},
        "val": {k: 0 for k in topks}
    }
    # Training
    model.train()

    for features, labels in train_loop:
                
        # Transfer Data to GPU if available
        # if torch.cuda.is_available():
        features, labels = batch2device(features, device), labels.to(device)
        # forward pass
        logits = model(features, aims_embeddings)
        # Clear the gradients
        optimizer.zero_grad()
        # Find the Loss
        loss = loss_fn(logits, labels)
        # Calculate gradients
        loss.backward()
        # Update Weights
        optimizer.step()
        # Calculate accuracy
        probs_des = torch.argsort(torch.exp(logits), axis=1, descending=True)
        for k in topks:
            batch_num_correct = 0
            nPoints = len(labels)
            for i in range(nPoints):
                if labels[i] in probs_des[i, 0:k]:
                    batch_num_correct += 1
                    num_correct_at_k["train"][k] += 1 # globally counting number of correct at each k's for whole valid set
            batch_train_accuracy[k] = batch_num_correct / nPoints
        # Calculate Loss
        train_loss += loss.item()
        train_loop.set_description('Epoch: {0} - lr: {1}, Training'.format(epoch, optimizer.param_groups[0]['lr']))
        train_loop.set_postfix(train_loss=loss.item(), 
                               top01=batch_train_accuracy[1], 
                               top03=batch_train_accuracy[3], 
                               top05=batch_train_accuracy[5],
                               top10=batch_train_accuracy[10])
    train_loss = train_loss/len(train_loader)
    history["train_loss"].append(train_loss)
    history["train_acc@k"].append(
        {k: val/len(X_train) for k, val in num_correct_at_k["train"].items()}
    )

    # Validation
    valid_loss = 0.0
    valid_loop = tqdm(valid_loader, leave=True)
    with torch.no_grad():
        model.eval()
        # Transfer Data to GPU if available
        for features, labels in valid_loop:

            # if torch.cuda.is_available():
            features, labels = batch2device(features, device), labels.to(device)
            # Forward pass
            logits = model(features, aims_embeddings)
            
            # Find the Loss
            loss = loss_fn(logits, labels)
            # Calculate accuracy
            probs_des = torch.argsort(torch.exp(logits), axis=1, descending=True)
            for k in topks:
                num_correct = 0
                nPoints = len(labels)
                for i in range(nPoints):
                    if labels[i] in probs_des[i, 0:k]:
                        num_correct += 1
                        num_correct_at_k["val"][k] += 1 # globally counting number of correct at each k's for whole valid set
                batch_valid_accuracy[k] = num_correct / nPoints
            # Calculate Loss
            valid_loss += loss.item()
            valid_loop.set_description('Epoch: {0} - lr: {1}, Validating'.format(epoch, optimizer.param_groups[0]['lr']))
            valid_loop.set_postfix(val_loss=loss.item(), 
                                val_top01=batch_valid_accuracy[1], 
                                val_top03=batch_valid_accuracy[3], 
                                val_top05=batch_valid_accuracy[5],
                                val_top10=batch_valid_accuracy[10])
        valid_loss = valid_loss/len(valid_loader)
        history["val_loss"].append(valid_loss)
        history["val_acc@k"].append(
            {k: val/len(X_valid) for k, val in num_correct_at_k["val"].items()}
        )
        print(f'>> Epoch {epoch} \t\t Training Loss: {train_loss} \t\t Validation Loss: {valid_loss}')
        lr_scheduler.step()

        if min_valid_loss > valid_loss:
            print(f'Validation Loss Decreased({min_valid_loss:.6f}--->{valid_loss:.6f}) \t Saving The Model')
            min_valid_loss = valid_loss
            saved_path = checkpoint_path + "weight/"
            if not os.path.exists(saved_path):
                os.makedirs(saved_path)
            # Saving State Dict
            torch.save(
                {
                    'model_state_dict': model.state_dict(),
                    'optimizer_state_dict': optimizer.state_dict(),
                    'history': history,
                    'epoch': epoch
                }, saved_path + "Epoch:{:0>2} DistilRoberta_TAKS.pth".format(epoch)
            )

  0%|          | 0/1586 [00:00<?, ?it/s]

  0%|          | 0/1057 [00:00<?, ?it/s]

>> Epoch 0 		 Training Loss: 1.7886731448344895 		 Validation Loss: 1.1183531758575476
Validation Loss Decreased(inf--->1.118353) 	 Saving The Model


  0%|          | 0/1586 [00:00<?, ?it/s]

  0%|          | 0/1057 [00:00<?, ?it/s]

>> Epoch 1 		 Training Loss: 0.9511508512278974 		 Validation Loss: 0.9060179531264102
Validation Loss Decreased(1.118353--->0.906018) 	 Saving The Model


  0%|          | 0/1586 [00:00<?, ?it/s]

  0%|          | 0/1057 [00:00<?, ?it/s]

>> Epoch 2 		 Training Loss: 0.7358793677365194 		 Validation Loss: 0.8766046859457906
Validation Loss Decreased(0.906018--->0.876605) 	 Saving The Model


  0%|          | 0/1586 [00:00<?, ?it/s]

  0%|          | 0/1057 [00:00<?, ?it/s]

>> Epoch 3 		 Training Loss: 0.5912459004872384 		 Validation Loss: 0.8286942860345967
Validation Loss Decreased(0.876605--->0.828694) 	 Saving The Model


  0%|          | 0/1586 [00:00<?, ?it/s]

KeyboardInterrupt: 

# Testing

In [30]:
# load model checkpoint for testing
checkpoint = torch.load("./checkpoint/recommendation_checkpoint/biology_recomendation_model/Epoch_02 bge_TAKS_biology.pth", map_location=torch.device('mps'))

# Rename keys in the state dictionary
state_dict = checkpoint['model_state_dict']

model = WithAim_Classifier(base_model, n_classes)
model.load_state_dict(state_dict)
model.to(device)

history = checkpoint['history']

In [25]:
# Loss function
loss_fn = nn.NLLLoss().to(device)

# Test 
topks = [1, 3, 5, 10]
num_correct_at_k = {}
test_loop = tqdm(test_loader, leave=True)
num_correct_at_k["test"] = {k: 0 for k in topks}
batch_test_accuracy = {k: [] for k in topks}
history["test_acc@k"] = []
history["test_loss"] = []
test_loss = 0.0

with torch.no_grad():
    model.eval() 
    for features, labels in test_loop:
        # Transfer Data to GPU if available
        # if torch.cuda.is_available():
        features, labels = batch2device(features, device), labels.to(device)
        logits = model(features, aims_embeddings)
        # Find the Loss
        loss = loss_fn(logits, labels)
        # Calculate accuracy
        probs_des = torch.argsort(torch.exp(logits), axis=1, descending=True)
        for k in topks:
            num_correct = 0
            nPoints = len(labels)
            for i in range(nPoints):
                if labels[i] in probs_des[i, 0:k]:
                    num_correct += 1
                    num_correct_at_k["test"][k] += 1 # globally counting number of correct at each k's for whole valid set
            batch_test_accuracy[k] = num_correct / nPoints
        # Calculate Loss
        test_loss += loss.item()
        test_loop.set_description('Testing...')
        test_loop.set_postfix(test_loss=loss.item(), 
                            test_top01=batch_test_accuracy[1], 
                            test_top03=batch_test_accuracy[3], 
                            test_top05=batch_test_accuracy[5],
                            test_top10=batch_test_accuracy[10])
    test_loss = test_loss/len(test_loader)
    history["test_loss"].append(test_loss)
    history["test_acc@k"].append(
        {k: val/len(X_test) for k, val in num_correct_at_k["test"].items()}
    )

  0%|          | 0/1057 [00:00<?, ?it/s]

# Final results

In [26]:
print(">> Final results (Best model): ")
print("\tTraining loss: {}".format(history["train_loss"][-1]))
print("\tValidating loss: {}".format(history["val_loss"][-1]))
print("\tTesting loss: {}".format(history["test_loss"][-1]))
print("\n")
for k in topks:
    print("\tTrain accuracy@{}: {}".format(k, history["train_acc@k"][-1][k]))
print("\n")
for k in topks:
    print("\tValidate accuracy@{}: {}".format(k, history["val_acc@k"][-1][k]))
print("\n")
for k in topks:
    print("\tTest accuracy@{}: {}".format(k, history["test_acc@k"][-1][k]))

>> Final results (Best model): 
	Training loss: 0.5912459004872384
	Validating loss: 0.8286942860345967
	Testing loss: 0.8385795818654117


	Train accuracy@1: 0.8326154877243472
	Train accuracy@3: 0.9397009688451721
	Train accuracy@5: 0.9661643390071062
	Train accuracy@10: 0.9866746828830782


	Validate accuracy@1: 0.7804553518628031
	Validate accuracy@3: 0.905972797161443
	Validate accuracy@5: 0.9425783560023655
	Validate accuracy@10: 0.9744825547013601


	Test accuracy@1: 0.7784808255226043
	Test accuracy@3: 0.9071582744448715
	Test accuracy@5: 0.9434375092398214
	Test accuracy@10: 0.9744833091866001


# Demo

In [98]:
data_test
print(data_test.iloc[5][['Title', 'Abstract', 'Keywords']])

Title          single cell proteogenomics immediate prospects
Abstract    recent technical advances genomic technology l...
Keywords    proteomics transcriptomics proteogenomics sing...
Name: 5, dtype: object


In [None]:

def infer_single_data_point(model, data_point, tokenizer, device, aims_embeddings, labels=Y_test):
    """
    Perform inference on a single data point.

    Args:
        model: The trained model.
        data_point: A single data point for inference.
        tokenizer: The tokenizer used for preprocessing.
        device: The device to run the inference on (CPU or GPU).

    Returns:
        The inference result.
    """
    # Preprocess the data point
    inputs = tokenizer(data_point, return_tensors='pt', padding=True, truncation=True, max_length=300).to(device)

    # Set the model to evaluation mode
    model.eval()

    # Perform inference
    with torch.no_grad():
        outputs = model(inputs, aims_embeddings)
    
    # Process the output
    logits = outputs
    probs_des = torch.argsort(torch.exp(logits), axis=1, descending=True)
    # print('probs_des: ', probs_des)
    topks = [1, 3, 5, 10]
    # for k in topks:
    #     num_correct = 0
    #     nPoints = len(labels)
    #     for i in range(nPoints):
    #         if labels[i] in probs_des[i, 0:k]:
    #             num_correct += 1
    #             num_correct_at_k["test"][k] += 1 # globally counting number of correct at each k's for whole valid set
    #     batch_test_accuracy[k] = num_correct / nPoints

    return probs_des



In [111]:
# Example usage
# index = int(input("Enter index: "))
title = input("Enter paper title: ")
abstract = input("Enter paper abstract: ")
keywords = input("Enter paper keywords: ")
data_point = title + " " + abstract + " " + keywords
# data_point = X_test[index]
print("Paper information: ", data_point)
# print("Paper information: ", data_test.iloc[index][['Title', 'Abstract', 'Keywords']])

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(pretrain_model)
model = WithAim_Classifier(base_model, n_classes).to(device)

# Perform inference on a single data point
result = infer_single_data_point(model, data_point, tokenizer, device, aims_embeddings)

# Load journal data
journal_data = pd.read_csv("./data/biology/biology_journal.csv", encoding="ISO-8859-1")

# Get top 10 journal recommendations
list_journal_recommend = result[0, 0:5].tolist()
result = result.to(device)
recommendation = []
print('-----------------')
# Print and store the recommendations
print("Recommendation top 5 journal: ")
for i in list_journal_recommend:
    itr = data_test[data_test['Label'] == i]['itr'].unique()[0]
    journal_title = journal_data[journal_data['itr'] == itr]['journal_title'].unique()[0]
    journal_aims = journal_data[journal_data['itr'] == itr]['aims'].unique()[0]
    text = f"- Journal Title: {journal_title} \n Journal Aims: {journal_aims}"
    recommendation.append(text)
    print(text)

Paper information:  Single Cell Proteogenomics — Immediate Prospects Recent technical advances in genomic technology have led to the explosive growth of transcriptome-wide studies at the level of single cells. The review describes the first steps of the single cell proteomics that has originated soon after development of transcriptomics methods. The first studies on the shotgun proteomics of single cells that used liquid chromatography/mass spectrometry have been already published. In these works, the cells were separated by the methods used in transcriptomics studies (e.g., cell sorting) and analyzed by modified mass spectrometry with tandem mass tags. The new proteogenomics approach involving integration of single cell transcriptomics and proteomics data will provide better understanding of the mechanisms of cell interactions in normal development and disease. proteomics transcriptomics proteogenomics single cell analysis tandem mass tag (TMT) mass spectrometry
-----------------
Reco