# SNSF Review Classification: Fine-tuned Hugging Face Models

- Gabriel Okasa, Data Team, Swiss National Science Foundation

Outline:

1) load text data from a grant peer review report

2) pre-process the texts for transformer model: english texts, lower casing, sentence segmentation, tokenization

3) apply fine-tuned transformer models from [HuggingFace](https://huggingface.co/snsf-data) and classify sentences

4) aggregate the classified categories onto a review level

5) compute the shares of each classified category in a review

### Library Imports

First, we import the neccessary libraries for data wrangling and natural language processing.

In [None]:
# import standard libraries
import os
import platform
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# import pytorch and transformers with the relevant functions
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline

# import text processing and lanuage detection
import re
from langdetect import detect

### Setup GPU for faster computing

Running deep learning models, such as the transformer models in this notebook, is more efficient using a GPU unit. Below we check for the availability of a GPU unit and set it as a primary device to perform the computations if available. Note, that for running PyTorch on a GPU, you must first install the CUDA toolkit: https://docs.nvidia.com/cuda/ .

In [None]:
# use GPU if available
current_os = platform.system()
# active device based on OS
if current_os == 'Darwin':
    # specify device as mps for Mac
    device = 'mps'
    print('MPS will be used as a device.')
else:
    # check if gpu is available, if yes use cuda, if not stick to cpu
    # install CUDA here:https://pytorch.org/get-started/locally/
    if torch.cuda.is_available():
        # must be 'cuda:0', not just 'cuda', see: https://github.com/deepset-ai/haystack/issues/3160
        device = torch.device('cuda:0')
        print('GPU', torch.cuda.get_device_name(0) ,'is available and will be used as a device.')
    else:
        device = torch.device('cpu')
        print('No GPU available, CPU will be used as a device instead.'
              + 'Be aware that the computation time increases significantly.')

### Data Import and Pre-Processing

For the demonstration purposes and due to data privacy laws, we will use a fake AI-generated grant peer review report as an example. The following prompt to ChatGPT has been used to generate a fake example review:

'You are a scientific expert in economics and have been requested by the Swiss National Science Foundation (SNSF) to review a grant proposal on the topic of causal inference in economics. Please, write a concise grant peer review report reflecting on the following SNSF's evaluation criteria: scientific relevance, topicality and originality; suitability of methods and feasibility; applicants' scientific track record and expertise. For each of the criteria, provide your assessment of strengths and weaknesses and a general comment. Please, provide the report in plain text. Thank you.'

The AI-generated report is provided below.

In [None]:
review_text = """

1. Scientific Relevance, Topicality and Originality

Strengths:
The proposal addresses a highly relevant topic in contemporary economics: causal inference in observational settings, with applications to policy evaluation. The research questions are well-defined and pertinent to current empirical challenges, particularly in microeconometrics and applied public economics. The proposal demonstrates awareness of recent advances, including machine learning integration into causal analysis, and lies at the intersection of economic theory and statistical innovation.

Weaknesses:
While the research questions are timely, parts of the proposal could further clarify the added value over existing approaches. Some stated contributions—such as improving robustness in difference-in-differences designs—are valuable but incremental, and the proposal would benefit from clearer articulation of novel theoretical insights or methodological leaps.

General Comment:
The proposal is of solid scientific relevance and topicality. It engages with ongoing developments in causal inference and seeks to contribute to applied economic analysis. Originality is moderate but acceptable given the empirical ambition and interdisciplinary approach.

2. Suitability of Methods and Feasibility

Strengths:
The proposed methodology is rigorous and appropriate for the stated objectives. The use of quasi-experimental designs, high-dimensional covariate control via machine learning, and sensitivity analyses demonstrates strong methodological awareness. The proposal also includes a realistic data access and management plan, which adds to feasibility.

Weaknesses:
The proposal could benefit from a more detailed discussion of potential identification challenges and how alternative specifications or falsification tests will be employed. There is also limited discussion on potential data limitations or ethical considerations in handling administrative microdata.

General Comment:
The methodological framework is sound and the proposed analysis plan is feasible within the timeline. A more comprehensive risk assessment would strengthen confidence in the empirical execution.

3. Applicants' Scientific Track Record and Expertise

Strengths:
The principal investigator (PI) has a strong publication record with many publications with high impact factor in top-tier journals in applied economics and econometrics. The PI has demonstrated prior success in projects involving causal inference and is well integrated in relevant research networks. Co-investigators and collaborators also bring complementary skills in statistics and computation.

Weaknesses:
The proposal would benefit from clearer delineation of roles among team members, particularly junior researchers, and how their expertise contributes to the project's success. In addition, evidence of prior experience managing large empirical projects could be elaborated.

General Comment:
The applicants have an excellent track record and relevant expertise, providing high confidence in their ability to deliver the proposed research.

Overall Assessment

This is a strong proposal that addresses an important and topical area in economics. The methods are robust and appropriate, the research team is highly competent, and the proposal is feasible as presented. Some improvements could be made in terms of clarifying the originality of the theoretical contribution and expanding the discussion on identification challenges. Nonetheless, the proposal is well-positioned to make a valuable contribution to the field of causal inference in economics.
"""

Given the generated example review, we proceed with text pre-processing as follows: we remove headers, section titles, special characters and trailing spaces.

In [None]:
# Remove headers and section titles
review_text = re.sub(r'\d+\.\s+[^\n]+', '', review_text)  # Matches lines as "1. Something"
review_text = re.sub(r'(Strengths:|Weaknesses:|General Comment:|Overall Assessment|)', '', review_text, flags=re.IGNORECASE)

# and remove special characters
review_text = review_text.replace('\n', ' ')  # Replace line breaks with space
review_text = review_text.replace('\t', ' ')  # Replace tabs with space
review_text = re.sub(r'\s+', ' ', review_text)  # Collapse multiple spaces into one
review_text = review_text.strip()  # Remove leading/trailing whitespace

Next, we check that the review is written in English and proceed with lower-casing the text and segmenting it by sentence. This is important as the classification models were fine-tuned on a sentence level.

In [None]:
# detect language of the review
assert detect(review_text) == "en", "The review is not in English, models cannot be used."

In [None]:
# split review onto sentence level using regular expressions by sentence-ending punctuation followed by a space and a capital letter
review_sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z])', review_text)
# and lower-case the sentences
review_sentences = [sentence_idx.lower() for sentence_idx in review_sentences]

Now, the text inputs are ready to be fed into the classifiers.

### Load Tokenizer and Models

The base model for the SNSF fine-tuned models is [SPECTER2](https://huggingface.co/allenai/specter2_base) (Cohan et al., 2019), so we directly load the tokenizer of the base model from HuggingFace (Wolf et al., 2019).

In [None]:
# load tokenizer from specter2_base - the base model
tokenizer = AutoTokenizer.from_pretrained("allenai/specter2_base")

We can now load the open-sourced fine-tuned models from the SNSF's [HuggingFace](https://huggingface.co/snsf-data) account. Specifically, we will load all models, but the one classifying 'Rationale' as it's classification accuracy was notably low. Therefore any deployment should be approached with caution and thorough consideration. For details on the models' accuracy and fine-tuning procedure, please refer to the respective model cards on HuggingFace.

In [None]:
# specify the names for the models to be applied in a dictionary
model_names = {'Track record': 'snsf-data/specter2-review-track-record',
               'Relevance, originality, topicality': 'snsf-data/specter2-review-relevance-originality-topicality',
               'Suitability': 'snsf-data/specter2-review-suitability',
               'Feasibility': 'snsf-data/specter2-review-feasibility',
               'Applicant': 'snsf-data/specter2-review-applicant',
               'Applicant Quantity': 'snsf-data/specter2-review-applicant-quantity',
               'Proposal': 'snsf-data/specter2-review-proposal',
               'Method': 'snsf-data/specter2-review-method',
               'Positive': 'snsf-data/specter2-review-positive',
               'Negative': 'snsf-data/specter2-review-negative',
               'Suggestion': 'snsf-data/specter2-review-suggestion'}

### Classify Review Sentences

Below, we apply the models in a loop to classify each sentence of the review report into the given categories that the models were fine-tuned for. We save the results in a dataframe, where each column indicates if the given category is present (True) or absent (False).

In [None]:
# initiate storage as an empty dictionary
review_classified = {}
# start the loop for each model
for model_name_idx, model_idx in model_names.items():

    # print the progress
    print('Currently applying ' + model_name_idx + ' model for classification.\n')

    # load the SNSF fine-tuned model for classification of review texts
    model = AutoModelForSequenceClassification.from_pretrained(model_idx).to(device)
    # setup the classification pipeline
    classification_pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer)
    # feed in the review sentences
    review_sentences_classified = classification_pipeline(review_sentences)
    # and post-process the results: as dataframe and boolean
    review_classified[model_name_idx] = pd.DataFrame(review_sentences_classified)['label'] == 'yes'

# save as dataframe
review_classified = pd.DataFrame(review_classified)
# and prepend sentences
review_classified.insert(0, 'Sentence', review_sentences)

### Results

Below, we inspect the raw classification results on a sentence level and aggregate them to represent prevalence of each category within the review.

In [None]:
# inspect the results
review_classified

In [None]:
# compute the prevalences for the given review
review_classified_prevalence = review_classified.mean(numeric_only=True)
print(review_classified_prevalence)

In [None]:
# visualize the results
plt.figure(figsize=(8, 5))
review_classified_prevalence.plot(kind='bar', color='skyblue')
# Add descriptions
plt.ylabel('Prevalence in %')
plt.title('Prevalence of classified categories in a review')
plt.ylim(0, 1)
plt.xticks(rotation=75)
plt.tight_layout()
# Show plot
plt.show()

### Summary

- SNSF's fine-tuned models can be accessed and loaded directly from the [HuggingFace](https://huggingface.co/snsf-data) hub.
- Models can be applied directly to a text data from review reports to classify review text to given categories relevant to research funders.
- Classification results provide an aggregate overview of the contents of the grant peer review report.

### References

- Cohan, A., Feldman, S., Beltagy, I., Downey, D., & Weld, D. S. (2020). Specter: Document-level representation learning using citation-informed transformers. arXiv preprint arXiv:2004.07180.
- Okasa, G., de León, A., Strinzel, M., Jorstad, A., Milzow, K., Egger, M., & Müller, S. (2024). A Supervised Machine Learning Approach for Assessing Grant Peer Review Reports. arXiv preprint arXiv:2411.16662.
- Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., ... & Rush, A. M. (2019). Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.