# SNSF Grants Similarity: A Word Embeddings Approach

- Gabriel Okasa, Data Team, Swiss National Science Foundation

Outline:

1) download publicly available text data from the SNSF data portal: [data.snf.ch](https://data.snf.ch/)

2) pre-process the texts for transformer model: english texts, lower casing, tokenization, string truncation

3) apply pre-trained transformer model from [HuggingFace](https://huggingface.co/) and extract the embeddings via CLS token

4) compute the cosine similarity between the text embeddings

5) rank the texts based on the similarity score

### Library Imports

First, we import the neccessary libraries for data wrangling and natural language processing.

In [None]:
# import standard libraries
import os
import platform
import numpy as np
import pandas as pd

# import pytorch and transformers
import torch
from transformers import AutoTokenizer, AutoModel

# import similarity metrics from scikit-learn
from sklearn.metrics.pairwise import cosine_similarity
# and lanuage detection
from langdetect import detect

### Setup GPU for faster computing

Running deep learning models, such as the transformer models in this notebook, is more efficient using a GPU unit. Below we check for the availability of a GPU unit and set it as a primary device to perform the computations if available. Note, that for running PyTorch on a GPU, you must first install the CUDA toolkit: https://docs.nvidia.com/cuda/ .

In [None]:
# use GPU if available
current_os = platform.system()
# active device based on OS
if current_os == 'Darwin':
    # specify device as mps for Mac
    device = 'mps'
    print('MPS will be used as a device.')
else:
    # check if gpu is available, if yes use cuda, if not stick to cpu
    # install CUDA here:https://pytorch.org/get-started/locally/
    if torch.cuda.is_available():
        # must be 'cuda:0', not just 'cuda', see: https://github.com/deepset-ai/haystack/issues/3160
        device = torch.device('cuda:0')
        print('GPU', torch.cuda.get_device_name(0) ,'is available and will be used as a device.')
    else:
        device = torch.device('cpu')
        print('No GPU available, CPU will be used as a device instead.'
              + 'Be aware that the computation time increases significantly.')

### Data Import and Pre-Processing

Data on SNSF grants is publicly available through the SNSF Data Portal: [data.snf.ch](https://data.snf.ch/), including text data, namely titles and abstracts of the grants.

In [None]:
# download the text data from the SNSF data portal
data = pd.read_csv('https://data.snf.ch/exportcsv/GrantWithAbstracts.csv', sep=';')

We perform some data wrangling first as we remove missing values and non-english texts, lower-case and concatenate the texts of titles and abstracts and reduce the dataset only to a subset of grants from year 2023 for demonstration purposes.

In [None]:
# clean data, remove missing values, subset data
data = data[['Title', 'Abstract', 'GrantNumber', 'CallDecisionYear']]
# drop NAs as some grants do not have an abstract, e.g. not yet started grants
data = data.dropna()

# for demonstration purposes reduce only to year 2023
data = data[data['CallDecisionYear'] == 2023]

# concatenate titles and abstracts
data['TitleAbstract'] = data.Title + '. ' + data.Abstract
# lower case
data['TitleAbstract'] = data.TitleAbstract.str.lower()
# detect language of titles and abstracts
data['Lang'] = data.TitleAbstract.apply(detect)
# keep only english texts
data = data[data.Lang == 'en']
# extract texts as a list
texts = data.TitleAbstract.tolist()

### Model and Tokenizer

In order to create a numerical representation of the texts, we rely on the word embeddings approach via transformers (Vaswani et al., 2017). In particular, we use the SPECTER model (Cohan et al., 2019), a BERT-type of model (Devlin et al., 2018), that has been pre-trained specifically on scientific texts and further augmented by citation graph.
In such word embedding approach the numerical representation of the text in vector space is based on the full text sequence and takes the context into account thanks to the self-attention mechanism of transformers. The transformer vectorization results in high-dimension dense vectors. In order to access the pre-trained models such as SPECTER and use them for extracting the contextual embeddings, we rely on the HuggingFace platform: https://huggingface.co/ (Wolf et al., 2019). In what follows, we specify the model name and load the corresponding tokenizer and the model itself. Note, that for transformer models, no particular text pre-processing is neccessary, except lower-casing.

In [None]:
# specify the model name (SPECTER: BERT model pre-trained on scientific texts and augmented by a citation graph)
model_name = 'allenai/specter2_base'
# load the tokenizer and the model from HuggingFace and pass it to device
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name).to(device)

We then apply the BERT tokenizer onto the grant texts. Due to the limited context window of the pre-trained models, the texts are truncated at the maximum length of 512 of tokens.

In [None]:
# storage for encoded texts
encoded_text = list(range(len(texts)))
# tokenize the texts
for text_idx in range(len(encoded_text)):
    # and send it to device
    encoded_text[text_idx] = tokenizer(texts[text_idx],
                                       max_length=512, # for BERT models the context window is limited to 512 tokens
                                       truncation=True, # truncate the texts exceeding 512 tokens
                                       padding='max_length', # pad shorter texts to 512 tokens
                                       return_tensors = "pt").to(device)

### Word Embeddings

We then pass the tokenized texts through the pre-trained model and extract the so-called CLS token from the last hidden layer. The CLS token is a special token that provides an aggregate sequence representation (Devlin et al., 2018; Cohan et al., 2019) forming the text embedding. An alternative represenation could be the so-called mean pooling, i.e. averaging of all tokens from the last hidden layer.

In [None]:
# extract the CLS token (first special token summarizing the sentence level embeddings in BERT models)
# storage for embeddings
embeddings = {}
# run the inputs through the model (sequentially to not overload CUDA memory in Jupyter)
for text_idx in range(len(encoded_text)):
    # first get the model output
    with torch.no_grad():
        output = model(**encoded_text[text_idx])
    # First element of model_output contains all token embeddings (last hidden state)
    token_embeddings = output[0]
    # extract the first out of 512 tokens, i.e. the so-called CLS token (torch dimension: [1,512,768])
    cls_token = token_embeddings[:,0,:]
    # normalize the CLS tokens with L2 norm to get similarity as dot product
    cls_token = torch.nn.functional.normalize(cls_token, p=2, dim=1)
    # pass back to CPU and convert to numpy array
    embeddings[data.GrantNumber.iloc[text_idx]] = cls_token.detach().to('cpu').numpy()[0]


We extract the embeddings and reformat them into a standard pandas dataframe.

In [None]:
# and save as pandas dataframe
embeddings = pd.DataFrame(embeddings.items(), columns=['GrantNumber', 'TextEmbedding'])
# and explode the list of embeddings into separate columns
embeddings = pd.DataFrame(embeddings["TextEmbedding"].tolist(),
                          columns=list(range(token_embeddings.shape[2])),
                          index=embeddings["GrantNumber"].to_list())

### Similarity Metric

In order to compare the similarity of the grants represented by the SPECTER vectors, we compute the cosine distance between the vectors.

In [None]:
# compute the similarity matrix among all grant texts
text_similarity = pd.DataFrame(cosine_similarity(embeddings),
                               columns=data.GrantNumber).set_index(data.GrantNumber)

### Ranking

To retrieve the most similar grants relative to a target grant of interest, we rank-order the grants according to their cosine similarity. In what follows, let us review the text of an example grant:

In [None]:
# give grant number for which the best matches should be found
grant_number = 221462
# print the tile of this grant
print("Text of the grant n. " + str(grant_number) + ": "
      + str(data.TitleAbstract[data.GrantNumber == grant_number].to_list()[0]))

We sort the similarity scores of all the other grants and return the top 5 most similar grants with their corresponding texts of titles and abstracts.

In [None]:
# search top 5 most similar grants (1st one is the grant itself)
top5_grants = text_similarity[grant_number].sort_values(ascending=False)[1:6].index.to_list()
# and print the titles
print("Top 5 most similar texts of grants are the following: " + "\n")
# loop thorugh top5
for grant_idx in range(len(top5_grants)):
    print("Top " + str(grant_idx+1) + " most similar title is of the grant n. " + str(top5_grants[grant_idx]) + ": "
          + str(data.TitleAbstract[data.GrantNumber == top5_grants[grant_idx]].to_list()[0]))

### Summary

- HuggingFace platform enables easy text pre-processing and embedding extraction for transformer approaches such as the BERT-type models
- SPECTER model captures the context of the grant texts thanks to its specific pre-training on scientific texts and citation prediction task
- the end-to-end pipeline for grant similarity retrieval can be implemented with just few lines of code

### References

- Cohan, A., Feldman, S., Beltagy, I., Downey, D., & Weld, D. S. (2020). Specter: Document-level representation learning using citation-informed transformers. arXiv preprint arXiv:2004.07180.
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
- Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., ... & Rush, A. M. (2019). Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.