# **Fine-tuning RoBERTa for named-entity recognition**

In [4]:
#!pip install seqeval

In [1]:
#!pip install tensorflow

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer, get_cosine_schedule_with_warmup
from seqeval.metrics import classification_report
import math
import os
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
print(device)

  from .autonotebook import tqdm as notebook_tqdm
2023-09-03 22:30:45.685768: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


cuda


In [1]:
!nvidia-smi

Sun Sep  3 21:37:38 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:3B:00.0 Off |                  N/A |
| 14%   27C    P0    56W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:5E:00.0 Off |                  N/A |
| 13%   27C    P5    12W / 250W |      0MiB / 11264MiB |      0%      Default |
|       

In [11]:
TRAIN_BATCH_SIZE = 16
VALID_BATCH_SIZE = 16
EPOCHS = 10# 5 for more than 1024, 10 for less
LEARNING_RATE = 5e-5 #1e-05
MAX_GRAD_NORM = 10
MAX_LEN = 256


## Preprocessing the dataset

Named entity recognition (NER) uses a specific annotation scheme, which is defined (at least for European languages) at the word level. An annotation scheme that is widely used is called IOB-tagging, which stands for Inside-Outside-Beginning. Each tag indicates whether the corresponding word is inside, outside or at the beginning of a specific named entity. The reason this is used is because named entities usually comprise more than 1 word.

Let's have a look at an example. If you have a sentence like "Barack Obama was born in Hawaï", then the corresponding tags would be [B-PERS, I-PERS, O, O, O, B-GEO]. B-PERS means that the word "Barack" is the beginning of a person, I-PERS means that the word "Obama" is inside a person, "O" means that the word "was" is outside a named entity, and so on. So one typically has as many tags as there are words in a sentence.

In [10]:
#df = pd.read_csv("cleaned_plain-text_labeled_term+combined_no_ref_no_cit_def_same_len_only.csv", delimiter=',')
#df = pd.read_csv("cleaned_plain-text_labeled_term+combined_no_ref_no_cit_def_same_len_only_must_conatin_B.csv", delimiter=',') # len = 13692
df = pd.read_csv("cleaned_plain-text_labeled_term+combined_no_ref_no_cit_def_same_len_only_must_conatin_B_less_than_500tokens.csv", delimiter=',')
len(df)

13653

In [12]:
all_data = df[['plain_text_def','labeled_def','plain_text_term']].copy()

all_data.rename(columns={"plain_text_def": "sentence", "labeled_def": "word_labels" }, inplace=True)

all_data['word_labels'] = all_data['word_labels'].str.replace('I_MATH_TERM','I-MATH_TERM')
all_data['word_labels'] = all_data['word_labels'].str.replace('B_MATH_TERM','B-MATH_TERM')


data = all_data[:-1024] #make a small sample first
data

Unnamed: 0,sentence,word_labels,plain_text_term
0,\nLet G be a finite group and let X→ S be a fa...,O O O O O O O O O O O O O O O O O O O O O O O ...,admissible G-cover
1,"\n Let π∈_n and let x,y be a pair of rooks th...",O O O O O O O O O O O O O O O O O O O O O O B-...,light reduction pair;heavy reduction pair
2,"A local modification rule is a pair (A,T), w...",O B-MATH_TERM I-MATH_TERM I-MATH_TERM O O O O ...,Local modification rule
3,A vertex set is any set which is at most cou...,O O O O O O O O O O O O O O O O B-MATH_TERM O ...,Vertex sets
4,A finite palette is a sequence K = (K_j)_j=0^∞...,O O O O O O O O O O O O O O O O O O O O O O O ...,Palettes
...,...,...,...
12624,\n\tA plabic graph is an undirected graph G dr...,O B-MATH_TERM I-MATH_TERM O O O O O O O O O O ...,plabic graph
12625,"\n\tFor a plabic graph G, the trip π_G describ...",O O O O O O O O O O O O O B-MATH_TERM I-MATH_T...,decorated trip permutation
12626,\n\tThe matroid polytope Γ_M of the matroid M ...,O B-MATH_TERM I-MATH_TERM O O O O O O O O O O ...,matroid polytope;positroid polytope
12627,"[Maximal rooted rainbow tree] Given r≥ 3, let...",O O O O O O O O O O O O O O O O O O O O O O O ...,such that ; such that


In [13]:
gen_data = all_data[-1024:]
gen_data

Unnamed: 0,sentence,word_labels,plain_text_term
12629,Let P be a G-poset. The compatibility graph of...,O O O O O O B-MATH_TERM I-MATH_TERM O O O O O ...,Compatibility graph
12630,Let P be a G-poset. The strong compatibility g...,O O O O O O B-MATH_TERM I-MATH_TERM I-MATH_TER...,Strong compatibility graph
12631,\nLet p=p_1p_2⋯ p_n be a permutation. \nWe say...,O O O O O O O O O O O O O O O O O O O O B-MATH...,good pair;bad pair
12632,\nThe family of probability measures on partit...,O O O O O O O O O O B-MATH_TERM O O O O O O O ...,multiplicative
12633,\nWe call a family of measures μ^(n) ergodic i...,O O O O O O O B-MATH_TERM O O O O O O O O O O ...,ergodic;limit shape
...,...,...,...
13648,"Consider ⟨ G→ S, α→ A⟩, two adjacent vertices...",O O O O O O O O O O O O O O O O O O O O O O O ...,preferred direction;special claw corresponding...
13649,"\nAccording to Theorem <ref>, ^d has exactly d...",O O O O O O O O O O O O O O O O O O O O O O O ...,Cyclicity classes
13650,By Wielandt number we mean the following funct...,O B-MATH_TERM I-MATH_TERM O O O O O O O O O O O O,Wielandt number
13651,"\nThe girth of , denoted it by g(),\nis the sm...",O B-MATH_TERM O O O O O O O O O O O O O O O O,Girth


In [14]:
gen_data = gen_data[-1024:]
gen_data.to_csv('data/test_GPT+labels.csv', index=False)

# Preparing the dataset and dataloader

Now that our data is preprocessed, we can turn it into PyTorch tensors such that we can provide it to the model. Let's start by defining some key variables that will be used later on in the training/evaluation process:


In [6]:
def tokenize_and_preserve_labels(sentence, text_labels, tokenizer):
    """
    Word piece tokenization makes it difficult to match word labels
    back up with individual word pieces. This function tokenizes each
    word one at a time so that it is easier to preserve the correct
    label for each subword. It is, of course, a bit slower in processing
    time, but it will help our model achieve higher accuracy.
    """

    tokenized_sentence = []
    labels = []

    for word, label in zip(sentence.split(), text_labels.split()):

        # Tokenize the word and count # of subwords the word is broken into
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)

        # Add the tokenized word to the final tokenized word list
        tokenized_sentence.extend(tokenized_word)

        # Add the same label to the new list of labels `n_subwords` times
        labels.extend([label] * n_subwords)

    return tokenized_sentence, labels

In [7]:
labels = ['B-MATH_TERM', 'I-MATH_TERM', 'O']

label2id = { label : labels.index(label) for label in labels}

id2label = { labels.index(label) : label for label in labels}

label2id

{'B-MATH_TERM': 0, 'I-MATH_TERM': 1, 'O': 2}

In [8]:
class dataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        # step 1: tokenize (and adapt corresponding labels)
        sentence = self.data.sentence[index]  
        word_labels = self.data.word_labels[index]  
        tokenized_sentence, labels = tokenize_and_preserve_labels(sentence, word_labels, self.tokenizer)
        
        # step 2: add special tokens (and corresponding labels)
        tokenized_sentence = ["<s> "] + tokenized_sentence + [" </s>"] # add special tokens of Roberta
        labels.insert(0, "O") # add outside label for [CLS] token
        labels.append("O") # add outside label for [SEP] token

        # step 3: truncating/padding
        maxlen = self.max_len

        if (len(tokenized_sentence) > maxlen):
          # truncate
          tokenized_sentence = tokenized_sentence[:maxlen]
          labels = labels[:maxlen]
        else:
          # pad
          tokenized_sentence = tokenized_sentence + ['<pad>'for _ in range(maxlen - len(tokenized_sentence))]
          labels = labels + ["O" for _ in range(maxlen - len(labels))]

        # step 4: obtain the attention mask
        attn_mask = [1 if tok != '<pad>' else 0 for tok in tokenized_sentence] #modifié selon https://huggingface.co/docs/transformers/v4.21.1/en/model_doc/camembert
        
        # step 5: convert tokens to input ids
        ids = self.tokenizer.convert_tokens_to_ids(tokenized_sentence)

        label_ids = [label2id[label] for label in labels]
        # the following line is deprecated
        #label_ids = [label if label != 0 else -100 for label in label_ids]
        
        return {
              'ids': torch.tensor(ids, dtype=torch.long),
              'mask': torch.tensor(attn_mask, dtype=torch.long),
              #'token_type_ids': torch.tensor(token_ids, dtype=torch.long),
              'targets': torch.tensor(label_ids, dtype=torch.long)
        } 
    
    def __len__(self):
        return self.len

In [7]:
#tokenizer = AutoTokenizer.from_pretrained("InriaValda/cc_math_roberta_ep01", from_tf=True)

In [None]:
#splite dataset and load for the first time
"""train_size = 0.9
train_dataset = data.sample(frac=train_size,random_state=200)
val_dataset = data.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

#save the data sets
train_dataset.to_csv('data/train.csv', index=False)
val_dataset.to_csv('data/val.csv', index=False)
gen_data.to_csv('data/test.csv', index=False)

print("FULL TrainigDataset: {}".format(data.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("VALIDATION Dataset: {}".format(val_dataset.shape))"""

In [None]:
#10-fold splite dataset and load for the first time
"""
from sklearn.model_selection import KFold

skf = KFold(n_splits=10)
n = 1                                                        
for train_index, val_index in skf.split(X=data['sentence'].to_numpy(), y=data['word_labels'].to_numpy()):
    train_set = data.iloc[train_index]
    val_set = data.iloc[val_index]
    train_file_name = 'data/10-fold/train_499_' + str(n) + '.csv'
    train_set.to_csv(train_file_name, index = False)
    val_file_name = 'data/10-fold/val_499_' + str(n) + '.csv'
    val_set.to_csv(val_file_name, index = False)
    n += 1
"""

## verify tokenization

In [11]:
model = AutoModelForTokenClassification.from_pretrained("InriaValda/cc_math_roberta_ep01",
                                                                    from_tf=True,
                                                                    num_labels=len(id2label),
                                                                    id2label=id2label,
                                                                    label2id=label2id)

2023-07-18 18:57:54.770841: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
All TF 2.0 model weights were used when initializing RobertaForTokenClassification.

All the weights of RobertaForTokenClassification were initialized from the TF 2.0 model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use RobertaForTokenClassification for predictions without further training.


In [12]:
validation_set = dataset(pd.read_csv('data/val.csv'), tokenizer, MAX_LEN)

validation_set[0]

{'ids': tensor([    3,    37,  4172,  1530,  2486, 17040,   309,    69,  8060,    12,
            37,    16,    56,   414,  2723,    37,   309,    69,  4270,  3539,
            16,   562,    56,    30,    22,    66,    37, 44419,    63,    22,
            65,    22, 44911,    20,    16,    21,    97,   309,    69,  8645,
          7599,  8838,   266,    37,  4163,    63,    22,    65,  1694,    95,
            20,    16,    21,  1472,  2723,    63,    22,    65, 13548,    95,
            21,    16,    22,  1472,  3014,   309,  4605,  2758,   761,   508,
          1694,  4372,  4165,   511,  1574,    21,   562,    22, 28521,    18,
         14602, 12663,  6404,  3539,    58,    16,  2776, 35767,    69,  1530,
          2486,  8645,    56,    66,    12,    58,  2213,    22,    66,    37,
         44419,    58,    22,  1004,    22,    66,    58,    22,  2879,  4555,
          5893,   314,   306,  4888,    12,    90,    67,    21,    16,    90,
            67,    22,    13,   274,    58,  

In [13]:
# print the first 50 tokens and corresponding labels
for token, label in zip(tokenizer.convert_ids_to_tokens(validation_set[15]["ids"][:50]), validation_set[15]["targets"][:50]):
  print('{0:15}  {1}'.format(token, id2label[label.item()]))

<unk>            O
For              O
k                O
âī¥              O
2                O
,                O
a                O
k                B-MATH_TERM
-                B-MATH_TERM
tensor           B-MATH_TERM
with             O
entries          O
in               O
is               O
a                O
function         O
T                O
:{               O
1                O
,...,            O
d                O
}                O
^                O
k                O
âŁ               O
¶                O
.                O
We               O
refer            O
to               O
the              O
number           O
k                O
as               O
the              O
order            O
of               O
the              O
tensor           O
T                O
.                O
We               O
denote           O
by               O
T                O
_                O
i                O
_                O
1                O
âĭ¯              O


In [14]:
# print the first 50 tokens and corresponding labels
for token, label in zip(tokenizer.convert_ids_to_tokens(validation_set[49]["ids"][:30]), validation_set[49]["targets"][:30]):
  print('{0:15}  {1}'.format(token, id2label[label.item()]))

<unk>            O
The              O
mixed            B-MATH_TERM
volume           I-MATH_TERM
(                O
P                O
_                O
1                O
,                O
âĢ               O
¦                O
,                O
P                O
_                O
n                O
)                O
is               O
the              O
coefficient      O
of               O
the              O
monomial         O
_                O
1                O
âĭ¯              O
_                O
n                O
in               O
the              O
polynomial       O


In [15]:
# 3 labels: -ln(1/3) = 1.09861228867
ids = validation_set[0]["ids"].unsqueeze(0)
mask = validation_set[0]["mask"].unsqueeze(0)
targets = validation_set[0]["targets"].unsqueeze(0)

ids = ids.to(device)#, dtype = torch.long)
mask = mask.to(device)#, dtype = torch.long)
targets = targets.to(device)#, dtype = torch.long)
model.to(device)
outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
initial_loss = outputs[0]

print(f"intial loss = {initial_loss.item()}")
# it seems that initial loss grows with the pretraining epochs of cc-xxxbert 

intial loss = 1.147135853767395


# Training

In [9]:
# Defining the training function on the 80% of the dataset for tuning the bert model
def train(model, training_loader, optimizer, scheduler=None):
    tr_loss, tr_accuracy = 0, 0
    nb_tr_examples, nb_tr_steps = 0, 0
    tr_preds, tr_labels = [], []
    # put model in training mode
    model.train()
    
    for idx, batch in enumerate(training_loader):
        
        ids = batch['ids'].to(device, dtype = torch.long)
        mask = batch['mask'].to(device, dtype = torch.long)
        targets = batch['targets'].to(device, dtype = torch.long)

        outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
        loss, tr_logits = outputs.loss, outputs.logits
        '''
        loss, tr_logits  = model(input_ids=ids, attention_mask=mask, labels=targets)#temporary modification for transformer 3'''
        
        tr_loss += loss.item()

        nb_tr_steps += 1
        nb_tr_examples += targets.size(0)
        
        #if idx % 100==0:
        #    loss_step = tr_loss/nb_tr_steps
        #    print(f"Training loss per 100 training steps: {loss_step}")
           
        # compute training accuracy
        flattened_targets = targets.view(-1) # shape (batch_size * seq_len,)
        active_logits = tr_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
        flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
        # now, use mask to determine where we should compare predictions with targets (includes [CLS] and [SEP] token predictions)
        active_accuracy = mask.view(-1) == 1 # active accuracy is also of shape (batch_size * seq_len,)
        targets = torch.masked_select(flattened_targets, active_accuracy)
        predictions = torch.masked_select(flattened_predictions, active_accuracy)
        
        tr_preds.extend(predictions)
        tr_labels.extend(targets)
        
        tmp_tr_accuracy = accuracy_score(targets.cpu().numpy(), predictions.cpu().numpy())
        tr_accuracy += tmp_tr_accuracy
    
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(
            parameters=model.parameters(), max_norm=MAX_GRAD_NORM
        )
        
        # backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if scheduler:
            scheduler.step()

    epoch_loss = tr_loss / nb_tr_steps
    tr_accuracy = tr_accuracy / nb_tr_steps
    #print(f"Trained {nb_tr_steps} steps")
    print(f"Training loss epoch: {epoch_loss}")
    print(f"Training accuracy epoch: {tr_accuracy}")
    

def valid(model, validation_loader):
    # put model in evaluation mode
    model.eval()
    
    eval_loss, eval_accuracy = 0, 0
    nb_eval_examples, nb_eval_steps = 0, 0
    eval_preds, eval_labels = [], []
    
    with torch.no_grad():
        for idx, batch in enumerate(validation_loader):
            
            ids = batch['ids'].to(device, dtype = torch.long)
            mask = batch['mask'].to(device, dtype = torch.long)
            targets = batch['targets'].to(device, dtype = torch.long)
            
           
            outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
            loss, eval_logits = outputs.loss, outputs.logits
            
            eval_loss += loss.item()

            nb_eval_steps += 1
            nb_eval_examples += targets.size(0)
        
            #if idx % 100==0:
            #    loss_step = eval_loss/nb_eval_steps
            #    print(f"Validation loss per 100 evaluation steps: {loss_step}")
              
            # compute evaluation accuracy
            flattened_targets = targets.view(-1) # shape (batch_size * seq_len,)
            active_logits = eval_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
            flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
            # now, use mask to determine where we should compare predictions with targets (includes [CLS] and [SEP] token predictions)
            active_accuracy = mask.view(-1) == 1 # active accuracy is also of shape (batch_size * seq_len,)
            targets = torch.masked_select(flattened_targets, active_accuracy)
            predictions = torch.masked_select(flattened_predictions, active_accuracy)
            
            eval_labels.extend(targets)
            eval_preds.extend(predictions)
            
            tmp_eval_accuracy = accuracy_score(targets.cpu().numpy(), predictions.cpu().numpy())
            eval_accuracy += tmp_eval_accuracy
    
    #print(eval_labels)
    #print(eval_preds)

    labels = [id2label[id.item()] for id in eval_labels]
    predictions = [id2label[id.item()] for id in eval_preds]

    #print(labels)
    #print(predictions)
    
    eval_loss = eval_loss / nb_eval_steps
    eval_accuracy = eval_accuracy / nb_eval_steps
    print(f"Validation Loss: {eval_loss}")
    print(f"Validation Accuracy: {eval_accuracy}")

    return labels, predictions

def print_reports_to_csv(test_results, model_name, LEARNING_RATE, EPOCHS, trainset_num, report_type):
    test_reports = []
    for res in test_results:
        report = classification_report([res['labels']], [res['predictions']], output_dict=True)
        flattened_report = {str(k+'_'+v_k) : v_v for k,v in report.items() for v_k, v_v in v.items()  }
        flattened_report['trainset_size'] = res['trainset_size']
        flattened_report['model'] = res['model']
        flattened_report['trainset_num'] = trainset_num
        test_reports.append(flattened_report)
    
    df_test_reports = pd.DataFrame(test_reports)
    if '/' in model_name:
        model_name =  model_name.split('/')[1] 
    test_report_name = f'finetuning_results/{report_type}_{model_name}_{LEARNING_RATE}_16_{EPOCHS}.csv'
    df_test_reports.to_csv(test_report_name, mode='a', header=not os.path.exists(test_report_name),index=False)

In [None]:
%%time
#training
train_params = {'batch_size': TRAIN_BATCH_SIZE,
            'shuffle': True,
            'num_workers': 0
            }

val_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
             }


for trainset_num in range(3, 5):

    train_file_name = f'data/10-fold/train_499_{trainset_num}.csv'#'data/train.csv'
    val_file_name = f'data/10-fold/val_499_{trainset_num}.csv'#'data/val.csv'
    
    for model_name in ['InriaValda/cc_math_roberta_ep01']: #'InriaValda/cc_math_roberta_ep10',
        tokenizer = AutoTokenizer.from_pretrained(model_name, from_tf=True, model_max_length=MAX_LEN)
        
        test_generalizability_set = dataset(pd.read_csv('data/test_GPT+labels.csv'), tokenizer, MAX_LEN)
        
        validation_set = dataset(pd.read_csv(val_file_name), tokenizer, MAX_LEN)
        df_training_set = pd.read_csv(train_file_name)
        
        val_results = []
        test_results = []
        
        validation_loader = DataLoader(validation_set, **val_params)
        test_gen_loader = DataLoader(test_generalizability_set, **val_params)
        
        for trainsetsize in [2048]:  #[64,128,256,512,1024,2048,4096,8192,11401] are already done
            training_set = dataset(df_training_set[:trainsetsize], tokenizer, MAX_LEN)
        
            print("TRAIN Dataset: {}".format(training_set.data.shape))
            #train_params['batch_size'] =  int( trainsetsize / 32) if (trainsetsize < 1024) else 16
            training_loader = DataLoader(training_set, **train_params)
        
        
            num_training_steps = int(training_loader.dataset.len / train_params['batch_size'] * EPOCHS)
            print(f'tranining steps: {num_training_steps+1}')
        
            #Shrey uses TF model
            model = AutoModelForTokenClassification.from_pretrained(model_name,
                                                                    from_tf=True,
                                                                    num_labels=len(id2label),
                                                                    id2label=id2label,
                                                                    label2id=label2id)
            model.to(device)
        
            optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)
            #scheduler = get_cosine_schedule_with_warmup(optimizer = optimizer, num_warmup_steps = 50, num_training_steps=num_training_steps)
            for epoch in range(EPOCHS):
            #for epoch in range(flex_epoch_nb): 
                print(f"Training epoch: {epoch + 1}")
                train(model, training_loader, optimizer)
                valid(model, validation_loader)
                #valid(model, test_gen_loader)
            labels, predictions = valid(model, validation_loader)     
            val_results.append({'trainset_size': trainsetsize, 'model': model_name, 'labels': labels, 'predictions': predictions})
        
            #test generalizablity
            labels, predictions = valid(model, test_gen_loader)
            test_results.append({'trainset_size': trainsetsize, 'model': model_name, 'labels': labels, 'predictions': predictions})
            ner_model_name = f'ner_model/{model_name}_ft_{EPOCHS}ep_train_size_{trainsetsize}_trainset_{trainset_num}'
            model.save_pretrained(ner_model_name)
            tokenizer.save_pretrained(ner_model_name)
            # gpt_aligned_eval(model, tokenizer, ner_model_name) # too slow!
        
        print_reports_to_csv(val_results, model_name, LEARNING_RATE, EPOCHS, trainset_num, 'validation')
        print_reports_to_csv(test_results, model_name, LEARNING_RATE, EPOCHS, trainset_num, 'generalizability')


TRAIN Dataset: (2048, 3)
tranining steps: 1281


All TF 2.0 model weights were used when initializing RobertaForTokenClassification.

All the weights of RobertaForTokenClassification were initialized from the TF 2.0 model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use RobertaForTokenClassification for predictions without further training.


Training epoch: 1
Training loss epoch: 0.08286487784062047
Training accuracy epoch: 0.952549218477001
Validation Loss: 0.045093063005729565
Validation Accuracy: 0.9660932983587777
Training epoch: 2
Training loss epoch: 0.039080586153431796
Training accuracy epoch: 0.9687126276914014
Validation Loss: 0.03779354774027686
Validation Accuracy: 0.9714530980732353
Training epoch: 3
Training loss epoch: 0.029751134941761848
Training accuracy epoch: 0.9761024446076665
Validation Loss: 0.03717532100839705
Validation Accuracy: 0.9739979009855233
Training epoch: 4
Training loss epoch: 0.02324110506742727
Training accuracy epoch: 0.9809312977370916
Validation Loss: 0.04294008636682094
Validation Accuracy: 0.9707549650917959
Training epoch: 5
Training loss epoch: 0.017480263977631694
Training accuracy epoch: 0.9857224869288685
Validation Loss: 0.03787283281076558
Validation Accuracy: 0.9701824570633991
Training epoch: 6
Training loss epoch: 0.01421614603168564
Training accuracy epoch: 0.98871080591

All TF 2.0 model weights were used when initializing RobertaForTokenClassification.

All the weights of RobertaForTokenClassification were initialized from the TF 2.0 model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use RobertaForTokenClassification for predictions without further training.


Training epoch: 1
Training loss epoch: 0.08320464556163643
Training accuracy epoch: 0.9511886245709282
Validation Loss: 0.06663252182210548
Validation Accuracy: 0.942584903432888
Training epoch: 2
Training loss epoch: 0.03984326004865579
Training accuracy epoch: 0.9683081518191731
Validation Loss: 0.040605586965250066
Validation Accuracy: 0.9680821112118954
Training epoch: 3
Training loss epoch: 0.029163400067773182
Training accuracy epoch: 0.9766139774371558
Validation Loss: 0.03804644028644396
Validation Accuracy: 0.971415509613532
Training epoch: 4
Training loss epoch: 0.02224438277335139
Training accuracy epoch: 0.9817856261621576
Validation Loss: 0.039472024038999895
Validation Accuracy: 0.9718737103973145
Training epoch: 5
Training loss epoch: 0.01834763454462518
Training accuracy epoch: 0.9849037550854435
Validation Loss: 0.0450995744123489
Validation Accuracy: 0.9730561669943274
Training epoch: 6
Training loss epoch: 0.014575396045984235
Training accuracy epoch: 0.9883641054456

# Inference

The fun part is when we can quickly test the model on new, unseen sentences. Here, we use the prediction of the first word piece of every word. Note that the function we used to train our model (tokenze_and_preserve_labels) propagated the label to all subsequent word pieces (so you could for example also perform a majority vote on the predicted labels of all word pieces of a word).

In other words, the code below does not take into account when predictions of different word pieces that belong to the same word do not match.

In [13]:
#model = AutoModelForTokenClassification.from_pretrained('ner_model/')
pipe = pipeline(task="token-classification", model=model.to('cpu'), tokenizer=tokenizer, aggregation_strategy="simple")
pipe("The Betti poset of a poset P is the subposet consisting of all homologically contributing elements, B(P)={q∈ P  | _i(Δ_q) ≠ 0  i}.")


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'entity_group': 'MATH_TERM',
  'score': 0.8791002,
  'word': ' Betti poset',
  'start': 4,
  'end': 15}]

In [15]:
%%time
pipe("A subskeleton (Γ_0,α_0,θ_0)⊆(Γ,α,θ) has trivial normal holonomy if the holonomy map K_γ^⊥ is trivial for all loops γ⊂Γ_0.")

CPU times: user 3.33 s, sys: 30 ms, total: 3.36 s
Wall time: 471 ms


[{'entity_group': 'MATH_TERM',
  'score': 0.8962922,
  'word': ' trivial normal holonomy',
  'start': 40,
  'end': 63}]

In [16]:
model_name = 'ner_model/InriaValda/cc_math_roberta_ep10_ft_3ep_train_size_11366/'

model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
pipe = pipeline(task="token-classification", model=model.to('cpu'), tokenizer=tokenizer, aggregation_strategy="simple")
%time
pipe("The Betti poset of a poset P is the subposet consisting of all homologically contributing elements, B(P)={q∈ P  | _i(Δ_q) ≠ 0  i}.")


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 8.82 µs


[{'entity_group': 'MATH_TERM',
  'score': 0.8791002,
  'word': ' Betti poset',
  'start': 4,
  'end': 15}]