Suzan Verberne 17.10.2023 1
Assignment 2: sequence labelling
Text mining course
This is a hand-in assignment for groups of two students. Send in via Brightspace before or on
Tuesday November 7:
• Submit your report as PDF and your python code as separate file. Don’t upload a zip file
containing the PDF (the Python code might be zipped if it consists of multiple files).
• Your report should not be longer than 3 pages.
• Do not copy text from sources (other groups, web pages, generative models such as
chatGPT). Turnitin is enabled and a large overlap will be reported to the Board of Examiners.
Goals of this assignment
• You can pre-process existing annotated text data into the data structure that you need for
classifier learning.
• You can perform hyperparameter optimization.
• You can perform a sequence labelling task with annotated data in Huggingface.
• You can evaluate sequence labelling with the suitable evaluation metrics.
Preliminaries
• You have followed the Huggingface tutorial on token classification
https://huggingface.co/learn/nlp-course/chapter7/2 and its preliminaries (exercises week 6
and 7).
• You have all the required Python packages installed and Python 3.10.
We are going to train an NER classifier for the task “Emerging and Rare entity recognition” from the
Workshop on Noisy User-generated Text (W-NUT). The description of the task can be found at
https://noisy-text.github.io/2017/emerging-rare-entities.html (I put the data itself on Brightspace)
Tasks
1. Download W-NUT_data.zip from the Brightspace assignment and unzip the directory. It
contains 3 IOB files: wnut17train.conll (train), emerging.dev.conll (dev),
emerging.test.annotated (test).
2. Convert the IOB data to the correct data structure for token classification in Huggingface
(words and labels like the conll2023 data in the tutorial) and align the labels with the tokens.
Note that since you are working with a custom dataset, the data conversion is a necessary
step for using the Huggingface training function.
Suzan Verberne 17.10.2023 2
3. Set up the evaluation correctly for the W-NUT test set, following the tutorial.
4. Fine-tune a model with the default hyperparameter settings on the train set and evaluate
the model on the test set. These are your baseline results.
5. Set up hyperparameter optimization with the AdamW optimizer. During optimization, use
the dev set as validation. After the model has been optimized, evaluate the result on the test
set.
6. Extend the evaluation function so that it shows the Precision, Recall and F-score for each of
the entity types (person, location, etc.) on the test set. Include the metrics for the B-label of
the entity type, the I-label, and the full entities. Look up the definitions of macro- and micro-
average scores and compute the macro- and micro average F1 scores over all entities.
Write a report of at most 3 pages in which you:
• describe the task and the data (give a few statistics. What are the entity types?) and briefly
describe two challenges of the task and the data.
• show your results:
o a results table with both the baseline results and the results after hyperparameter
optimization (do not report results on the dev set, only on the test set): a table with
Precision, Recall, F-score for both settings.
o a table with the results after hyperparameter optimization for the different entity
types (Precision, Recall, F-score for B, I, and the full entities), and the macro- and
micro F1 scores.
• write brief conclusions. Address the following questions:
o What is the effect of hyperparameter optimization on the quality of the model?
o What does the difference between scores for different entity types tell you?
o Where does the difference between macro- and micro-averaged F1 scores come
from?
Grading
Maximum 2 points for each of the following criteria:
• General: length correct (2-3 pages) and proper writing + formatting
• Description of the task and the data, with description of 2 challenges
• Baseline results with default hyperparameter settings and results with optimized
hyperparameter settings
• Results after hyperparameter optimization for the different entity types
• Sensible conclusions, briefly addressing the questions listed above.

In [1]:
import datasets

In [2]:
names=[
    "O",
    "B-corporation",
    "I-corporation",
    "B-creative-work",
    "I-creative-work",
    "B-group",
    "I-group",
    "B-location",
    "I-location",
    "B-person",
    "I-person",
    "B-product",
    "I-product"
    ]

In [3]:
train, valid, test = 'wnut17train.conll', 'emerging.dev.conll', 'emerging.test.annotated'
path = 'W-NUT_data'

raw_datasets = {}
for file, name in zip([train, valid, test], ['train', 'validation', 'test']):
    id = 0
    raw_datasets[name] = {'id': [], 'tokens': [], 'ner_tags': []}
    with open(f'{path}/{file}', 'r') as f:
        tokens, ner_tags = [], []
        for line in f:
            try:
                token, ner_tag = line.split()
                tokens.append(token)
                ner_tags.append(names.index(ner_tag))
            except:
                raw_datasets[name]['id'].append([id for _ in range(len(tokens))])
                raw_datasets[name]['tokens'].append(tokens)
                raw_datasets[name]['ner_tags'].append(ner_tags)
                id += 1
                tokens, ner_tags = [], []
    raw_datasets[name] = datasets.Dataset.from_dict(raw_datasets[name])

raw_datasets = datasets.DatasetDict(raw_datasets)
display(raw_datasets)

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 3394
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1009
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1287
    })
})

# Processing the data

In [4]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [5]:
tokenizer.is_fast

True

In [6]:
inputs = tokenizer(raw_datasets["train"][0]["tokens"], is_split_into_words=True)

In [7]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [8]:
labels = raw_datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0]
[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 8, 0, 0, 0, 0, 0, 0, 0, 0, -100]


In [9]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [10]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

Map:   0%|          | 0/3394 [00:00<?, ? examples/s]

Map:   0%|          | 0/1009 [00:00<?, ? examples/s]

Map:   0%|          | 0/1287 [00:00<?, ? examples/s]

# Fine-tuning the model with keras

## Data collation

In [11]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

## Metrics

In [12]:
import evaluate

metric = evaluate.load("seqeval")

In [13]:
import numpy as np

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

## Defining the model

In [14]:
id2label = {i: label for i, label in enumerate(names)}
label2id = {v: k for k, v in id2label.items()}

In [15]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


##  Fine-tuning the model

In [20]:
# from huggingface_hub import interpreter_login

# interpreter_login()

from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [21]:
from transformers import TrainingArguments

args = TrainingArguments(
    "bert-finetuned-ner",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

In [22]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

  0%|          | 0/1275 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


KeyboardInterrupt: 

In [None]:
trainer.push_to_hub(commit_message="Training complete")

'https://huggingface.co/rubinho/bert-finetuned-ner/tree/main/'

## Evaluating the model (baseline)

In [None]:
def print_tex_table(all_metrics):
    print("\\begin{table}[h]")
    print("\\centering")
    print("\\begin{tabular}{lrrrr}")
    print("\\toprule")
    print(" & \\textbf{Precision} & \\textbf{Recall} & \\textbf{F1} & \\textbf{Number} \\\\")
    print("\\midrule")
    for key, value in all_metrics.items():
        if "overall" not in key:
            print(f"{key.title()} & {value['precision']:.2f} & {value['recall']:.2f} & {value['f1']:.2f} & {value['number']} \\\\")
    print("\\midrule")
    print(" & \\textbf{Overall precision} & \\textbf{Overall recall} & \\textbf{Overall F1} & \\textbf{Overall accuracy} \\\\")
    print("\\midrule")
    print(f" & {all_metrics['overall_precision']:.2f} & {all_metrics['overall_recall']:.2f} & {all_metrics['overall_f1']:.2f} & {all_metrics['overall_accuracy']:.2f} \\\\")
    print("\\bottomrule")
    print("\\end{tabular}")
    print("\\caption{Results of the NER model on the test set.}")
    print("\\label{tab:ner-results}")
    print("\\end{table}")

In [None]:
def compute_all_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return all_metrics

In [None]:
predictions = trainer.predict(tokenized_datasets["test"])
eval_preds = predictions.predictions, predictions.label_ids
all_metrics = compute_all_metrics(eval_preds)
print_tex_table(all_metrics)

\begin{table}[h]
\centering
\begin{tabular}{lrrrr}
\toprule
 & \textbf{Precision} & \textbf{Recall} & \textbf{F1} & \textbf{Number} \\
\midrule
Corporation & 0.24 & 0.23 & 0.23 & 66 \\
Creative-Work & 0.40 & 0.18 & 0.25 & 142 \\
Group & 0.35 & 0.13 & 0.19 & 165 \\
Location & 0.55 & 0.43 & 0.49 & 150 \\
Person & 0.79 & 0.47 & 0.59 & 429 \\
Product & 0.16 & 0.09 & 0.11 & 127 \\
\midrule
 & \textbf{Overall precision} & \textbf{Overall recall} & \textbf{Overall F1} & \textbf{Overall accuracy} \\
\midrule
 & 0.54 & 0.31 & 0.40 & 0.94 \\
\bottomrule
\end{tabular}
\caption{Results of the NER model on the test set.}
\label{tab:ner-results}
\end{table}


## Hyperparameter optimization

In [None]:
def model_init(): return AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)
    
trainer = Trainer(
    # model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    model_init=model_init,
)

def optuna_hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [16, 32, 64, 128]),
        "beta_1": trial.suggest_float("beta_1", 0.9, 0.999),
        "beta_2": trial.suggest_float("beta_2", 0.9, 0.999),
        "weight_decay": trial.suggest_float("weight_decay", 0.0, 0.1),
        "adam_epsilon": trial.suggest_float("adam_epsilon", 1e-9, 1e-7, log=True),
    }

best_trials = trainer.hyperparameter_search(
    direction="minimize",
    backend="optuna",
    hp_space=optuna_hp_space,
    n_trials=20
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2023-11-02 20:02:08,540] A new study created in memory with name: no-name-f84da73a-7a98-47ba-a27f-9249041b6451
Trying to set beta_1 in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Trying to set beta_2 in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/81 [00:00<?, ?it/s]

  0%|          | 0/127 [00:00<?, ?it/s]

{'eval_loss': 2.01765513420105, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_f1': 0.0, 'eval_accuracy': 0.7674605298367675, 'eval_runtime': 3.6061, 'eval_samples_per_second': 279.805, 'eval_steps_per_second': 35.218, 'epoch': 1.0}


  0%|          | 0/127 [00:00<?, ?it/s]

{'eval_loss': 1.7563742399215698, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_f1': 0.0, 'eval_accuracy': 0.8605833556328606, 'eval_runtime': 3.522, 'eval_samples_per_second': 286.488, 'eval_steps_per_second': 36.059, 'epoch': 2.0}


  0%|          | 0/127 [00:00<?, ?it/s]

{'eval_loss': 1.661307454109192, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_f1': 0.0, 'eval_accuracy': 0.8697886004816698, 'eval_runtime': 3.5877, 'eval_samples_per_second': 281.238, 'eval_steps_per_second': 35.399, 'epoch': 3.0}
{'train_runtime': 502.0391, 'train_samples_per_second': 20.281, 'train_steps_per_second': 0.161, 'train_loss': 1.7155686366705247, 'epoch': 3.0}


[I 2023-11-02 20:12:47,556] Trial 0 finished with value: 0.8697886004816698 and parameters: {'learning_rate': 1.6778346971618903e-06, 'per_device_train_batch_size': 128, 'beta_1': 0.9524204721092655, 'beta_2': 0.9898800495154644, 'weight_decay': 0.09107553811692248, 'adam_epsilon': 5.159802545809668e-09}. Best is trial 0 with value: 0.8697886004816698.
Trying to set beta_1 in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Trying to set beta_2 in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/81 [00:00<?, ?it/s]

  0%|          | 0/127 [00:00<?, ?it/s]

  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 1.1917213201522827, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_f1': 0.0, 'eval_accuracy': 0.8812416376772813, 'eval_runtime': 3.4826, 'eval_samples_per_second': 289.727, 'eval_steps_per_second': 36.467, 'epoch': 1.0}


  0%|          | 0/127 [00:00<?, ?it/s]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.6802327632904053, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_f1': 0.0, 'eval_accuracy': 0.8816697886004816, 'eval_runtime': 3.423, 'eval_samples_per_second': 294.77, 'eval_steps_per_second': 37.102, 'epoch': 2.0}


  0%|          | 0/127 [00:00<?, ?it/s]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.6492249965667725, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_f1': 0.0, 'eval_accuracy': 0.8816697886004816, 'eval_runtime': 3.4324, 'eval_samples_per_second': 293.962, 'eval_steps_per_second': 37.0, 'epoch': 3.0}
{'train_runtime': 539.6587, 'train_samples_per_second': 18.867, 'train_steps_per_second': 0.15, 'train_loss': 0.7623669188699604, 'epoch': 3.0}


[I 2023-11-02 20:23:18,206] Trial 1 finished with value: 0.8816697886004816 and parameters: {'learning_rate': 4.972493760089585e-06, 'per_device_train_batch_size': 128, 'beta_1': 0.9816505899477423, 'beta_2': 0.9944230938327104, 'weight_decay': 0.03753375740287284, 'adam_epsilon': 3.0993883769028073e-09}. Best is trial 0 with value: 0.8697886004816698.
Trying to set beta_1 in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Trying to set beta_2 in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/321 [00:00<?, ?it/s]

  0%|          | 0/127 [00:00<?, ?it/s]

{'eval_loss': 0.34814760088920593, 'eval_precision': 0.5, 'eval_recall': 0.3026315789473684, 'eval_f1': 0.3770491803278689, 'eval_accuracy': 0.9126036928017126, 'eval_runtime': 3.407, 'eval_samples_per_second': 296.158, 'eval_steps_per_second': 37.277, 'epoch': 1.0}


  0%|          | 0/127 [00:00<?, ?it/s]

{'eval_loss': 0.3662056624889374, 'eval_precision': 0.6199021207177814, 'eval_recall': 0.45454545454545453, 'eval_f1': 0.5244996549344375, 'eval_accuracy': 0.9248595129783248, 'eval_runtime': 3.4393, 'eval_samples_per_second': 293.372, 'eval_steps_per_second': 36.926, 'epoch': 2.0}


  0%|          | 0/127 [00:00<?, ?it/s]

{'eval_loss': 0.3447449505329132, 'eval_precision': 0.5609418282548476, 'eval_recall': 0.48444976076555024, 'eval_f1': 0.5198973042362003, 'eval_accuracy': 0.9273213807867273, 'eval_runtime': 3.466, 'eval_samples_per_second': 291.116, 'eval_steps_per_second': 36.642, 'epoch': 3.0}
{'train_runtime': 257.7757, 'train_samples_per_second': 39.499, 'train_steps_per_second': 1.245, 'train_loss': 0.13035248373156397, 'epoch': 3.0}


[I 2023-11-02 20:27:43,898] Trial 2 finished with value: 2.4926102740433254 and parameters: {'learning_rate': 6.094419647666449e-05, 'per_device_train_batch_size': 32, 'beta_1': 0.9772254518432223, 'beta_2': 0.9376460245363676, 'weight_decay': 0.013520174622271952, 'adam_epsilon': 1.1611797544617362e-09}. Best is trial 0 with value: 0.8697886004816698.
Trying to set beta_1 in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Trying to set beta_2 in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/162 [00:00<?, ?it/s]

  0%|          | 0/127 [00:00<?, ?it/s]

  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.40198978781700134, 'eval_precision': 0.5608695652173913, 'eval_recall': 0.30861244019138756, 'eval_f1': 0.39814814814814814, 'eval_accuracy': 0.9054321648381054, 'eval_runtime': 20.4979, 'eval_samples_per_second': 49.225, 'eval_steps_per_second': 6.196, 'epoch': 1.0}


  0%|          | 0/127 [00:00<?, ?it/s]

{'eval_loss': 0.3615639805793762, 'eval_precision': 0.5843071786310517, 'eval_recall': 0.41866028708133973, 'eval_f1': 0.4878048780487805, 'eval_accuracy': 0.9201498528231201, 'eval_runtime': 19.886, 'eval_samples_per_second': 50.739, 'eval_steps_per_second': 6.386, 'epoch': 2.0}


  0%|          | 0/127 [00:00<?, ?it/s]

{'eval_loss': 0.3354020416736603, 'eval_precision': 0.5041436464088398, 'eval_recall': 0.4366028708133971, 'eval_f1': 0.46794871794871795, 'eval_accuracy': 0.9228793149585229, 'eval_runtime': 20.4373, 'eval_samples_per_second': 49.37, 'eval_steps_per_second': 6.214, 'epoch': 3.0}
{'train_runtime': 505.0982, 'train_samples_per_second': 20.158, 'train_steps_per_second': 0.321, 'train_loss': 0.18717787000868055, 'epoch': 3.0}


[I 2023-11-02 20:37:26,280] Trial 3 finished with value: 2.331574550129478 and parameters: {'learning_rate': 5.358084222725486e-05, 'per_device_train_batch_size': 64, 'beta_1': 0.9613551164268322, 'beta_2': 0.9517801109094812, 'weight_decay': 0.002203737027811026, 'adam_epsilon': 2.942188939581708e-09}. Best is trial 0 with value: 0.8697886004816698.
Trying to set beta_1 in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Trying to set beta_2 in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/639 [00:00<?, ?it/s]

  0%|          | 0/127 [00:00<?, ?it/s]

{'eval_loss': 0.33786246180534363, 'eval_precision': 0.4536652835408022, 'eval_recall': 0.3923444976076555, 'eval_f1': 0.4207825529185375, 'eval_accuracy': 0.9142627776291142, 'eval_runtime': 3.3792, 'eval_samples_per_second': 298.592, 'eval_steps_per_second': 37.583, 'epoch': 1.0}


  0%|          | 0/127 [00:00<?, ?it/s]

{'eval_loss': 0.410776823759079, 'eval_precision': 0.620253164556962, 'eval_recall': 0.4102870813397129, 'eval_f1': 0.49388048956083513, 'eval_accuracy': 0.9195611453037196, 'eval_runtime': 3.3513, 'eval_samples_per_second': 301.077, 'eval_steps_per_second': 37.896, 'epoch': 2.0}
{'loss': 0.1374, 'learning_rate': 1.8536606030301622e-05, 'epoch': 2.35}


  0%|          | 0/127 [00:00<?, ?it/s]

{'eval_loss': 0.43329110741615295, 'eval_precision': 0.5471406491499228, 'eval_recall': 0.423444976076555, 'eval_f1': 0.4774106540795685, 'eval_accuracy': 0.9195611453037196, 'eval_runtime': 3.3072, 'eval_samples_per_second': 305.093, 'eval_steps_per_second': 38.401, 'epoch': 3.0}
{'train_runtime': 208.6314, 'train_samples_per_second': 48.804, 'train_steps_per_second': 3.063, 'train_loss': 0.1153560781702749, 'epoch': 3.0}


[I 2023-11-02 20:42:13,294] Trial 4 finished with value: 2.3675574246097657 and parameters: {'learning_rate': 8.521504498822112e-05, 'per_device_train_batch_size': 16, 'beta_1': 0.9584890540274765, 'beta_2': 0.9311913363721316, 'weight_decay': 0.0756085523016666, 'adam_epsilon': 1.4999601472510144e-08}. Best is trial 0 with value: 0.8697886004816698.
Trying to set beta_1 in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Trying to set beta_2 in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/639 [00:00<?, ?it/s]

  0%|          | 0/127 [00:00<?, ?it/s]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.5929983854293823, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_f1': 0.0, 'eval_accuracy': 0.8816697886004816, 'eval_runtime': 3.3099, 'eval_samples_per_second': 304.838, 'eval_steps_per_second': 38.369, 'epoch': 1.0}


  0%|          | 0/127 [00:00<?, ?it/s]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.5345761775970459, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_f1': 0.0, 'eval_accuracy': 0.8816697886004816, 'eval_runtime': 3.3407, 'eval_samples_per_second': 302.033, 'eval_steps_per_second': 38.016, 'epoch': 2.0}
{'loss': 0.447, 'learning_rate': 3.9903355548671103e-07, 'epoch': 2.35}


  0%|          | 0/127 [00:00<?, ?it/s]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.5209224224090576, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_f1': 0.0, 'eval_accuracy': 0.8816697886004816, 'eval_runtime': 3.3896, 'eval_samples_per_second': 297.671, 'eval_steps_per_second': 37.467, 'epoch': 3.0}
{'train_runtime': 274.4272, 'train_samples_per_second': 37.103, 'train_steps_per_second': 2.328, 'train_loss': 0.3992978963120629, 'epoch': 3.0}


[I 2023-11-02 20:46:50,580] Trial 5 finished with value: 0.8816697886004816 and parameters: {'learning_rate': 1.8344060572374702e-06, 'per_device_train_batch_size': 16, 'beta_1': 0.9462764214105747, 'beta_2': 0.9694473585920098, 'weight_decay': 0.07895287421462685, 'adam_epsilon': 3.4478980955036015e-08}. Best is trial 0 with value: 0.8697886004816698.
Trying to set beta_1 in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Trying to set beta_2 in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/639 [00:00<?, ?it/s]

  0%|          | 0/127 [00:00<?, ?it/s]

[I 2023-11-02 20:47:33,959] Trial 6 pruned. 


{'eval_loss': 0.327398419380188, 'eval_precision': 0.4412470023980815, 'eval_recall': 0.44019138755980863, 'eval_f1': 0.44071856287425154, 'eval_accuracy': 0.9183837302649184, 'eval_runtime': 3.3218, 'eval_samples_per_second': 303.751, 'eval_steps_per_second': 38.232, 'epoch': 1.0}


Trying to set beta_1 in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Trying to set beta_2 in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/81 [00:00<?, ?it/s]

[W 2023-11-02 20:49:28,102] Trial 7 failed with parameters: {'learning_rate': 1.80445820308461e-05, 'per_device_train_batch_size': 128, 'beta_1': 0.9928035321205118, 'beta_2': 0.9012963314335979, 'weight_decay': 0.055559442722266, 'adam_epsilon': 1.2632327057249022e-09} because of the following error: RuntimeError('CUDA error: out of memory\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n').
Traceback (most recent call last):
  File "c:\Users\Ruben\miniconda3\envs\torch\Lib\site-packages\optuna\study\_optimize.py", line 200, in _run_trial
    value_or_values = func(trial)
                      ^^^^^^^^^^^
  File "c:\Users\Ruben\miniconda3\envs\torch\Lib\site-packages\transformers\integrations\integration_utils.py", line 199, in _objective
    trainer.train(resume_from_checkpoint=checkpoint,

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [None]:
print(best_trials)