# Training Toy SetFit Models for NSF Award Abstract Software Prediction

Quick notebook which uses only a sample of our data, merged in the current annotations from Lindsey and Richard, gets the NSF award abstract texts and then trains a model with SetFit.

Larger example to come soon^tm.

In [1]:
import pandas as pd

## Setup

1. Read a sample of the "NSF + GitHub Linked" data (output from Eva's script)
2. Read Lindsey's labelled GitHub Repos for Software Classification
3. Read Richard's labelled GitHub Repos for Software Classification
4. Join the datasets together and drop any NA

In [2]:
# Read nsf + github linked sample
linked_nsf_github_sample = pd.read_parquet(
    "linked-github-nsf-results.parquet",
)

In [3]:
# Read lindseys labelled github repos data and clean
lindsey_coded_repos = pd.read_csv(
    "all-github-search-results-duplicates-removed - Lindsey.csv",
)
lindsey_coded_repos = lindsey_coded_repos[["include/exclude", "link"]]
lindsey_coded_repos["annotator"] = "lindsey"

In [4]:
# Read richards labelled github repos data and clean
richard_coded_repos = pd.read_csv(
    "all-github-search-results-duplicates-removed - Richard.csv",
)
richard_coded_repos = richard_coded_repos[["include/exclude", "link"]]
richard_coded_repos["annotator"] = "richard"

In [5]:
# Join and clean
data_lindsey = linked_nsf_github_sample.join(
    lindsey_coded_repos.set_index("link"), on="github_link",
)
data_richard = linked_nsf_github_sample.join(
    richard_coded_repos.set_index("link"), on="github_link",
)
data = pd.concat([data_lindsey, data_richard])
data = data.dropna(
    subset=["include/exclude"],
)
data.head()

Unnamed: 0,github_link,nsf_award_id,nsf_link,from_template_repo,is_a_fork,include/exclude,annotator
1,https://github.com/MPEDS/mpeds-coder,1918342,https://www.nsf.gov/awardsearch/showAward?AWD_...,False,False,exclude,lindsey
2,https://github.com/MPEDS/mpeds-coder,1423784,https://www.nsf.gov/awardsearch/showAward?AWD_...,False,False,exclude,lindsey
3,https://github.com/kristen-johnson/STM-image-m...,1807225,https://www.nsf.gov/awardsearch/showAward?AWD_...,False,False,exclude,lindsey
7,https://github.com/bloose/bias_correction_by_ML,1429940,https://www.nsf.gov/awardsearch/showAward?AWD_...,False,False,exclude,lindsey
11,https://github.com/arasdar/BCI,1565962,https://www.nsf.gov/awardsearch/showAward?AWD_...,False,False,exclude,lindsey


## Quick Value Counts

In [6]:
data.loc[
    data.annotator == "lindsey"
]["include/exclude"].value_counts()

exclude    48
include     5
Name: include/exclude, dtype: int64

In [7]:
data.loc[
    data.annotator == "richard"
]["include/exclude"].value_counts()

exclude    41
include    12
Name: include/exclude, dtype: int64

## Get NSF Award Abstracts

In [8]:
from typing import Dict, Union

import requests
from tqdm.contrib.concurrent import thread_map

from soft_search.constants import NSFFields

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
def _get_abstract_text(award_id: int) -> Dict[str, Union[int, str]]:
    return {
        "award_id": award_id,
        "abstract_text": requests.get(
            f"https://api.nsf.gov/"
            f"services/v1/awards/{award_id}.json"
            f"?printFields={NSFFields.abstractText}"
        ).json()["response"]["award"][0][NSFFields.abstractText]
    }

abstract_texts = pd.DataFrame(
    thread_map(
        _get_abstract_text,
        data.nsf_award_id.unique(),
    )
)

data = data.join(abstract_texts.set_index("award_id"), on="nsf_award_id")
data.head()

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 68/68 [00:04<00:00, 16.91it/s]


Unnamed: 0,github_link,nsf_award_id,nsf_link,from_template_repo,is_a_fork,include/exclude,annotator,abstract_text
1,https://github.com/MPEDS/mpeds-coder,1918342,https://www.nsf.gov/awardsearch/showAward?AWD_...,False,False,exclude,lindsey,Social movements reflect societal concerns wit...
2,https://github.com/MPEDS/mpeds-coder,1423784,https://www.nsf.gov/awardsearch/showAward?AWD_...,False,False,exclude,lindsey,"This project builds, tests, and validates an o..."
3,https://github.com/kristen-johnson/STM-image-m...,1807225,https://www.nsf.gov/awardsearch/showAward?AWD_...,False,False,exclude,lindsey,Electronic devices are playing an ever-increas...
7,https://github.com/bloose/bias_correction_by_ML,1429940,https://www.nsf.gov/awardsearch/showAward?AWD_...,False,False,exclude,lindsey,Production of organic matter from carbon dioxi...
11,https://github.com/arasdar/BCI,1565962,https://www.nsf.gov/awardsearch/showAward?AWD_...,False,False,exclude,lindsey,Title: CRII: SCH: Brain-Body Sensor Fusion: Me...


## Prep Data for Training

In [10]:
from datasets import Dataset
from sklearn.model_selection import train_test_split

In [11]:
# Set up data splits of train=0.6 test=0.2 valid=0.2

# select only the columns we need
subset_data = data[["annotator", "abstract_text", "include/exclude"]]

# set the labels to ints
subset_data = subset_data.replace({"exclude": 0, "include": 1})

# lindsey
lindsey_data = subset_data.loc[subset_data.annotator == "lindsey"].drop(
    columns=["annotator"]
)
lindsey_train, lindsey_test_and_valid = train_test_split(
    lindsey_data,
    test_size=0.6,
    stratify=lindsey_data["include/exclude"],
)
lindsey_test, lindsey_valid = train_test_split(
    lindsey_test_and_valid,
    test_size=0.5,
    stratify=lindsey_test_and_valid["include/exclude"],
)

# richard
richard_data = subset_data.loc[subset_data.annotator == "richard"].drop(
    columns=["annotator"]
)
richard_train, richard_test_and_valid = train_test_split(
    richard_data,
    test_size=0.6,
    stratify=richard_data["include/exclude"],
)
richard_test, richard_valid = train_test_split(
    richard_test_and_valid,
    test_size=0.5,
    stratify=richard_test_and_valid["include/exclude"],
)

In [12]:
# Convert to Huggingface Dataset objects
lindsey_train = Dataset.from_pandas(lindsey_train, preserve_index=False)
lindsey_test = Dataset.from_pandas(lindsey_test, preserve_index=False)
lindsey_valid = Dataset.from_pandas(lindsey_valid, preserve_index=False)
richard_train = Dataset.from_pandas(richard_train, preserve_index=False)
richard_test = Dataset.from_pandas(richard_test, preserve_index=False)
richard_valid = Dataset.from_pandas(richard_valid, preserve_index=False)

## Train Models for Each Person

In [13]:
from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitModel, SetFitTrainer
from sklearn.metrics import accuracy_score

In [14]:
# Load a SetFit model from Hub
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


In [15]:
# Run model training and eval
models = {}
for ds_name, train_ds, test_ds in [
    ("lindsey", lindsey_train, lindsey_test),
    ("richard", richard_train, lindsey_test),
]:  
    # Create trainer
    trainer = SetFitTrainer(
        model=model,
        train_dataset=train_ds,
        eval_dataset=test_ds,
        loss_class=CosineSimilarityLoss,
        metric="accuracy",
        batch_size=2,
        num_iterations=20,
        num_epochs=1,
        column_mapping={"abstract_text": "text", "include/exclude": "label"},
    )

    # Train and evaluate
    trainer.train()
    metrics = trainer.evaluate()
    models[ds_name] = trainer.model
    
    # Print stats and predictions
    print(ds_name)
    print("training accuracy:", metrics["accuracy"])

Applying column mapping to training dataset
***** Running training *****
  Num examples = 840
  Num epochs = 1
  Total optimization steps = 420
  Total train batch size = 2
Epoch:   0%|                                                                                                                                                            | 0/1 [00:00<?, ?it/s]
Iteration:   0%|                                                                                                                                                      | 0/420 [00:00<?, ?it/s][A
Iteration:   0%|▎                                                                                                                                             | 1/420 [00:00<03:25,  2.04it/s][A
Iteration:   0%|▋                                                                                                                                             | 2/420 [00:00<03:23,  2.06it/s][A
Iteration:   1%|█                                     

lindsey
training accuracy: 0.875


Epoch:   0%|                                                                                                                                                            | 0/1 [00:00<?, ?it/s]
Iteration:   0%|                                                                                                                                                      | 0/420 [00:00<?, ?it/s][A
Iteration:   0%|▎                                                                                                                                             | 1/420 [00:00<02:55,  2.38it/s][A
Iteration:   0%|▋                                                                                                                                             | 2/420 [00:00<03:07,  2.23it/s][A
Iteration:   1%|█                                                                                                                                             | 3/420 [00:01<03:08,  2.21it/s][A
Iteration:   1%|█▎               

richard
training accuracy: 0.875


In [16]:
# Run validation accuracy
for ds_name, valid_ds in [
    ("lindsey", lindsey_valid),
    ("richard", richard_valid),
]:
    print(ds_name)
    model = models[ds_name]
    preds = model(valid_ds["abstract_text"])
    print("predictions from validation set:", preds)
    print("ground truth for validation set:", valid_ds["include/exclude"])
    print("validation accuracy:", accuracy_score(valid_ds["include/exclude"], preds))
    print("-" * 80)

lindsey
predictions from validation set: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
ground truth for validation set: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
validation accuracy: 0.875
--------------------------------------------------------------------------------
richard
predictions from validation set: [0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
ground truth for validation set: [1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0]
validation accuracy: 0.625
--------------------------------------------------------------------------------
