# SNSF Grants Similarity: A Bag-of-Words Approach

- Gabriel Okasa, Data Team, Swiss National Science Foundation

Outline:

1) download publicly available text data from the SNSF data portal: [data.snf.ch](https://data.snf.ch/)

2) pre-process the texts for the tf-idf model: english texts, lower casing, stop words and punctuation removal, stemming, n-grams

3) apply the tf-idf weighting model and extract the tf-idf vectors

4) compute the cosine similarity between the tf-idf vectors

5) rank the texts based on the similarity score

### Library Imports

First, we import the neccessary libraries for data wrangling and natural language processing.

In [None]:
# import standard libraries
import numpy as np
import pandas as pd

# import NLP/text libraries
import nltk
import string

# import tfidf vectorizer and similarity metrics from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# and lanuage detection
from langdetect import detect

### Setup stopwords dictionary

In order to filter out stop words from the texts, we need to download the dictionary of stopwords available in the 'nltk' package (Bird et al., 2009).

In [None]:
# download / update stopwords dictionary
nltk.download('stopwords')

### Data Import and Pre-Processing

Data on SNSF grants is publicly available through the SNSF Data Portal: [data.snf.ch](https://data.snf.ch/), including text data, namely titles and abstracts of the grants.

In [None]:
# download the text data from the SNSF data portal
data = pd.read_csv('https://data.snf.ch/exportcsv/GrantWithAbstracts.csv', sep=';')

We perform some data wrangling first as we remove missing values and non-english texts, lower-case and concatenate the texts of titles and abstracts and reduce the dataset only to a subset of grants from year 2023 for demonstration purposes.

In [None]:
# clean data, remove missing values, subset data
data = data[['Title', 'Abstract', 'GrantNumber', 'CallDecisionYear']]
# drop NAs as some grants do not have an abstract, e.g. not yet started grants
data = data.dropna()

# for demonstration purposes reduce only to year 2023
data = data[data['CallDecisionYear'] == 2023]

# concatenate titles and abstracts
data['TitleAbstract'] = data.Title + '. ' + data.Abstract
# lower case
data['TitleAbstract'] = data.TitleAbstract.str.lower()
# detect language of titles and abstracts
data['Lang'] = data.TitleAbstract.apply(detect)
# keep only english texts
data = data[data.Lang == 'en']
# extract texts as a list
texts = data.TitleAbstract.tolist()

### Text Processing and Tokenizer

We begin the text pre-processing by removing the punctuation.

In [None]:
# remove punctuation (string.punctuation: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')
for text_idx in range(len(texts)):
   texts[text_idx] = texts[text_idx].translate(str.maketrans('', '', string.punctuation))

and further create the so-called unigrams by splitting the text sequence into separate words (tokens), while removing stop words and performing stemming of the remaining words.

In [None]:
# create tokens as unigrams while removing stop words, use nltk english stopwords list
# create empty list for storage
tokens_unigrams = list()
# and loop through all texts
for text_idx in range(len(texts)):
   tokens_unigrams.append([word for word in texts[text_idx].split() if word not in nltk.corpus.stopwords.words('english')])
# perform stemming on unigrams
# load the Porter stemmer: https://www.nltk.org/api/nltk.stem.porter.html
ps = nltk.stem.PorterStemmer()
# and loop through all texts
for token_idx in range(len(tokens_unigrams)):
   # use again list comprehension to perform stemming word by word
   tokens_unigrams[token_idx] = [ps.stem(word) for word in tokens_unigrams[token_idx]]
   # keep only unigrams that have at least 2 characters, as otherwise the unigrams are not informative
   tokens_unigrams[token_idx] = [word for word in tokens_unigrams[token_idx] if len(word) > 1 ]

Additionally to the unigrams, we create short word combinations called n-grams, up to n=3, i.e. combinations of 3 words. As with unigrams, we perform stemming, but keep the stopwords.

In [None]:
# tokenize the text into n-grams as well, n is a tuning parameter (do not remove stop words here)
n_grams = 3
# only if n-grams are desired
if n_grams > 1:
   # create empty list for storage
   tokens_ngrams = list()
   # perform stemming first for all texts
   for text_idx in range(len(texts)):
      # stem words in text
      tokens_ngrams.append([ps.stem(word) for word in texts[text_idx].split()])
      # create n-grams from 2 up to n
      tokens_ngrams[text_idx] = nltk.everygrams(tokens_ngrams[text_idx], 2, n_grams)
      # and concatenate tuples and convert back to list of strings
      tokens_ngrams[text_idx] = [' '.join(token_idx) for token_idx in list(tokens_ngrams[text_idx])]
else:
   # otherwise return just an empty list
   tokens_ngrams = list()

Finally, we concatenate the unigrams with n-grams to complete the tokenization process.

In [None]:
# concatenate unigrams with n-grams to create the final vector of tokens
tokens = [(list_unigrams + list_ngrams) for list_unigrams, list_ngrams in zip(tokens_unigrams, tokens_ngrams)]

### TF-IDF Model

In order to create a numerical representation of the tokens, we apply the so-called TF-IDF (Term Frequency – Inverse Document Frequency) weighting (Sparck Jones, 1972). TF-IDF is a type of bag-of-words approach, where the numerical representation of the text in vector space is based on a token decomposition of the text, ignoring the sequential nature of the text. This corresponds to the tokenization procedure conducted above. The TF-IDF then applies a weighting scheme that puts a higher weight on words that appear frequently in one document, but rarely across documents. The TF-IDF vectorization results in high-dimensional sparse vectors. Such TF-IDF vectorization has proven to be very effective in text similarity tasks, despite its simplicity (compare e.g. Shahmirzadi et al, 2019).

In [None]:
# compute tfidf via scikit-learn 
# initiate the vectorizer (identity function for tokenizer and preprocessor as we already tokenized the texts)
# specify l2 norm to get cosine similarity as dot product
tfidf = TfidfVectorizer(tokenizer=lambda x: x, preprocessor=lambda x: x, use_idf=True, norm='l2')  
# compute the tf-idf vector
tfidf_vector = tfidf.fit_transform(tokens)

We can inspect the most important words based on the TF-IDF weighting:

In [None]:
# check the first element of the tf-idf vector for the tokens with largest weight
tfidf_df = pd.DataFrame(tfidf_vector[0].T.todense(), columns=["tf-idf"], index = tfidf.get_feature_names_out())
# sort the values
tfidf_df = tfidf_df.sort_values('tf-idf', ascending=False)
# check top 10
tfidf_df.head(10)

### Similarity Metric

In order to compare the similarity of the grants represented by the TF-IDF vectors, we compute the cosine distance between the vectors.

In [None]:
# compute the similarity matrix among all grant texts
text_similarity = pd.DataFrame(cosine_similarity(tfidf_vector),
                               columns=data.GrantNumber).set_index(data.GrantNumber)

### Ranking

To retrieve the most similar grants relative to a target grant of interest, we rank-order the grants according to their cosine similarity. In what follows, let us review the text of an example grant:

In [None]:
# give grant number for which the best matches should be found
grant_number = 221462
# print the tile of this grant
print("Text of the grant n. " + str(grant_number) + ": " +
      str(data.TitleAbstract[data.GrantNumber == grant_number].to_list()[0]))

We sort the similarity scores of all the other grants and return the top 5 most similar grants with their corresponding texts of titles and abstracts.

In [None]:
# search top 5 most similar grants (1st one is the grant itself)
top5_grants = text_similarity[grant_number].sort_values(ascending=False)[1:6].index.to_list()
# and print the titles
print("Top 5 most similar texts of grants are the following: " + "\n")
# loop thorugh top5
for grant_idx in range(len(top5_grants)):
    print("Top " + str(grant_idx+1) + " most similar title is of the grant n. " + str(top5_grants[grant_idx]) + ": "
          + str(data.TitleAbstract[data.GrantNumber == top5_grants[grant_idx]].to_list()[0]))

### Summary

- data and text pre-processing is an integral part for bag-of-words approaches such as the TF-IDF vectors
- TF-IDF weighting applies a higher weight to words and combination of words in the grant texts that appear often within the grant but rarely across the grants
- the end-to-end pipeline for grant similarity retrieval can be implemented with just few lines of code

### References

- Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.".
- Shahmirzadi, O., Lugowski, A., & Younge, K. (2019, December). Text similarity in vector space models: a comparative study. In 2019 18th IEEE international conference on machine learning and applications (ICMLA) (pp. 659-666). IEEE.
- Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1), 11-21.