# 0. GPU check

* 이 코드는 Nvidia GPU를 사용하는 컴퓨터에서, train / test 데이터가 분리되어있는 csv 파일을 사용하는 것을 전제로 작성됨

In [1]:
import torch

if torch.cuda.is_available():
    device_count = torch.cuda.device_count()
    print("device_count: {}".format(device_count))
    for device_num in range(device_count):
        print("device {} capability {}".format(
            device_num,
            torch.cuda.get_device_capability(device_num)))
        print("device {} name {}".format(
            device_num, 
            torch.cuda.get_device_name(device_num)))
else:
    print("no cuda device")

device_count: 1
device 0 capability (8, 6)
device 0 name NVIDIA GeForce RTX 3080


In [2]:
if torch.cuda.is_available() :
    device = torch.device("cuda:0")
else : 
    device = torch.device("cpu")

In [3]:
from pynvml import *

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()
    
print_gpu_utilization()

GPU memory occupied: 428 MB.


* 모델 훈련과정에서 GPU 메모리 용량 초과 시, 개발서버 콘솔에서 직접 `nvidia-smi` 명령어 실행 후 메모리를 점유하고 있는 process의 PID를 찾아 `sudo kill -9 {pid}` 로 프로세스 종료해주면 됨

# 1. Import packages

In [4]:
## Need to check if packages are compatible
# !pip install accelerate nvidia-ml-py3
# !pip install datasets==2.4.0
# !pip install huggingface_hub==0.9.1
# !pip install transformers==4.22.1 # bf16, tf32 등 사용하려면 4.2 이상 필요
# !pip install pyarrow==9.0.0

* huggingface_hub와 transformers 간 호환가능한 버전 확인 필요
* 만약 성능 테스트를 위해 datasets api를 사용할거라면 datasets 역시 호환 가능 버전 확인해야 함
* 세 가지 dependencies를 사용한다는 가정 하에, pyarrow 라이브러리도 필요.

In [5]:
import transformers
import datasets
import huggingface_hub
import pyarrow

print(transformers.__version__)
print(datasets.__version__)
print(huggingface_hub.__version__)
print(pyarrow.__version__)

# 4.22.1
# 2.4.0
# 0.9.1
# 9.0.0

4.22.1
2.4.0
0.9.1
9.0.0


In [6]:
import os
import re
import math
import numpy as np
import pandas as pd

# 'You can use tf32' if you are acessing Ampere hardware
import torch
torch.backends.cuda.matmul.allow_tf32 = True

from datasets import load_dataset, load_metric, ClassLabel
from sklearn.utils.class_weight import compute_class_weight
import ray
from ray import tune
from ray.tune import CLIReporter
from ray.tune.examples.pbt_transformers.utils import (
    download_data,
    build_compute_metrics_fn,
)
from ray.tune.schedulers import PopulationBasedTraining
from transformers import (
    glue_tasks_num_labels,
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    GlueDataset,
    GlueDataTrainingArguments,
    TrainingArguments,
)

In [29]:
from transformers import EarlyStoppingCallback

# 2. Import Data

* xxx_train.csv, xxx_test.csv 파일은 아래 형식으로 전처리된 csv 파일이어야 함 (column name: `text`, `label`)


<table class="features-table">
  <tr>
    <th class="mdc-text-light-green-600", style="text-align:center">
    text
    </th>
    <th class="mdc-text-purple-600", style="text-align:center">
    label
    </th>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      Ok lar... Joking wif u oni...
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      U dun say so early hor... U c already then say...
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      Nah I don't think he goes to usf, he lives around here though
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
</table>

In [8]:
data_name = "IMDB" ## covid_articles / financial_news / IMDB / naver_movie_review / spam

dataset = load_dataset('csv', data_files={'train': f'../data_splited/{data_name}_train.csv',
                                          'test': f'../data_splited/{data_name}_test.csv'})
dataset

Using custom data configuration default-5e9b2acfce9f0b59
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-5e9b2acfce9f0b59/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 39999
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 9999
    })
})

# 3. Data Preprocessing

* load_dataset 함수로 불러온 데이터를 수정할 때는 수정 내용을 담은 함수를 만들고, 이를 map 함수로 각 원소에 적용함 ([링크](https://huggingface.co/docs/datasets/v1.4.0/processing.html#processing-data-row-by-row)에서 확인)

In [9]:
## remove specal characters

def remove_sp(example):
    example["text"]=re.sub(r'[^a-z|A-Z|0-9|ㄱ-ㅎ|ㅏ-ㅣ|가-힣| ]+', '', str(example["text"]))
    return example

dataset = dataset.map(remove_sp)

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-5e9b2acfce9f0b59/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a/cache-0c659aeae188f731.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-5e9b2acfce9f0b59/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a/cache-170d6a353647063a.arrow


In [10]:
## label encoding

labels = list(set(dataset["train"]["label"] + dataset["test"]["label"]))
num_labels = len(labels)

def encoding_label(example):
    str_to_int = ClassLabel(num_classes=num_labels, names=labels)
    example["label"]=str_to_int.str2int(example["label"])
    return example

if type(labels[0]) == str:
    dataset = dataset.map(encoding_label)
    
print(num_labels)

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-5e9b2acfce9f0b59/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a/cache-82b60351b1fd1890.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-5e9b2acfce9f0b59/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a/cache-0fbc2fb5dafc34c1.arrow


2


# 4. Load PLM & Tokenizing

In [11]:
model_name = "bert-base-cased"
# model_name = "bert-base-multilingual-cased"
# model_name = "xlm-roberta-base"

# model_name = "klue/bert-base"
# model_name = "klue/roberta-base"


In [12]:
# Download cache tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

In [13]:
def tokenize_function(examples):
    tokenized_batch = tokenizer(examples["text"], padding="max_length", truncation=True) # padding : ['longest', 'max_length', 'do_not_pad']
    return tokenized_batch

In [14]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-5e9b2acfce9f0b59/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a/cache-1b3db1ac7e2ce261.arrow


  0%|          | 0/10 [00:00<?, ?ba/s]

In [15]:
# train_dataset = tokenized_datasets["train"].shuffle(seed=1919).select(range(0,math.floor(len(tokenized_datasets["train"])*0.7)))
# eval_dataset = tokenized_datasets["train"].shuffle(seed=1919).select(range(math.floor(len(tokenized_datasets["train"])*0.7), len(tokenized_datasets["train"])))
# test_dataset = tokenized_datasets["test"]

In [16]:
# data for test
train_dataset = tokenized_datasets["train"].shuffle(seed=1919).select(range(1000))
eval_dataset = tokenized_datasets["train"].shuffle(seed=1919).select(range(1000))
test_dataset = tokenized_datasets["test"]

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/csv/default-5e9b2acfce9f0b59/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a/cache-e9e4b6fdb429ddad.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/csv/default-5e9b2acfce9f0b59/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a/cache-e9e4b6fdb429ddad.arrow


# 5. Check class weights

In [17]:
def class_weight(train_dataset) :
    
    train_labels = np.array(train_dataset["label"])
    class_weights = compute_class_weight(class_weight = 'balanced', classes = np.unique(train_labels), y = train_labels)
    
    weights = torch.tensor(class_weights, dtype = torch.float)
    
    return weights

In [18]:
weights = class_weight(train_dataset)
print(weights)

tensor([1.0225, 0.9785])


# 6. Modeling

In [24]:
## Customize training strategy

task_data_dir = "test-model"
gpus_per_trial = 1
cpus_per_trial = 16
n_trials = 5
metric = load_metric("f1") # atasets.list_metrics() 
seed = 818

Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

In [25]:
# Download model and features

config = AutoConfig.from_pretrained(
    model_name, 
    num_labels=num_labels
)

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(
        model_name,
        config=config
        )

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/a8d257ba9925ef39f3036bfc338acf5283c512d9/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.22.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



In [26]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions=np.argmax(logits, axis = -1)
    return metric.compute(predictions=predictions, references=labels)

```python
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=1,   # batch size per device during training
    per_device_eval_batch_size=10,   # batch size for evaluation
    warmup_steps=1000,               # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=200,               # How often to print logs
    do_train=True,                   # Perform training
    do_eval=True,                    # Perform evaluation
    evaluation_strategy="epoch",     # evalute after each epoch
    gradient_accumulation_steps=64,  # total number of steps before back propagation
    fp16=True,                       # Use mixed precision
    fp16_opt_level="02",             # mixed precision mode
    run_name="ProBert-BFD-MS",       # experiment name
    seed=3                           # Seed for experiment reproducibility 3x3
)
```

In [30]:
training_args = TrainingArguments(
    output_dir=".",
    learning_rate=2e-5, # config
    do_train=True,
    do_eval=True,
    no_cuda=gpus_per_trial <= 0,
    evaluation_strategy="steps",
    save_strategy="steps",
    metric_for_best_model="accuracy",
    greater_is_better=True,
    load_best_model_at_end=True,
    num_train_epochs=2,  # config
    max_steps=-1,  # config
    per_device_train_batch_size=8,  # config
    per_device_eval_batch_size=8,  # config
    warmup_steps=0,
    warmup_ratio=0.1,  # config
    weight_decay=0.1,  # config
    logging_dir="./logs",
    skip_memory_metrics=True,
    report_to="none",
    fp16=True,
    # bf16=True,
    # tf32=True,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    seed=seed,
    eval_steps = 50
    )
    
# trainer = Trainer(
#     model_init=model_init,
#     args=training_args,
#     train_dataset=train_dataset,
#     eval_dataset=eval_dataset,
#     compute_metrics=compute_metrics,
#     )

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # compute custom loss
        weight = weights.to(device)
        loss_fct = torch.nn.CrossEntropyLoss(weight=weight)
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss
    
trainer = CustomTrainer(
    model_init=model_init,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
    )

PyTorch: setting up devices
loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/a8d257ba9925ef39f3036bfc338acf5283c512d9/pytorch_model.bin
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly i

In [None]:
# Hyperparameter tuning with ray tune

tune_config = {
#     "per_device_train_batch_size": tune.choice([2, 4, 8]),
    "num_train_epochs": tune.choice([2, 5, 10]),
#     "num_train_epochs": [x for x in range(2, 21)],
}

# PopulationBasedTraining
# worker might copy the model parameters from a better performing worker or explore new hyperparameters by changing the current values randomly
# cf. ASHAScheduler
scheduler = PopulationBasedTraining(
    time_attr="training_iteration",
    metric="eval_f1",
    mode="max",
    perturbation_interval=1,
    hyperparam_mutations={
#         "num_train_epochs": [x for x in range(2, 21)],
        "weight_decay": tune.uniform(0.0, 0.3), # tune.uniform(1, 10) == np.random.uniform(1, 10)
        "learning_rate": tune.uniform(1e-5, 5e-5),
        "warmup_ratio": tune.uniform(0.0, 0.3),
#         # Perturb factor3 by changing it to an adjacent value, e.g.
#         # 10 -> 1 or 10 -> 100. Resampling will choose at random.
#         "factor_3": [1, 10, 100, 1000, 10000],
#         # Using tune.choice is NOT equivalent to the above.
#         # factor_4 is treated as a continuous hyperparameter.
#         "factor_4": tune.choice([1, 10, 100, 1000, 10000]),
    },
)


reporter = CLIReporter(
    parameter_columns={
        "weight_decay": "w_decay",
        "learning_rate": "lr",
        "per_device_train_batch_size": "train_bs/gpu",
        "num_train_epochs": "num_epochs",
    },
    metric_columns=["eval_f1", "eval_loss", "epoch", "training_iteration"],
)

result = trainer.hyperparameter_search(
    direction = "maximize",
    hp_space = lambda _: tune_config,
    backend="ray",
    n_trials=n_trials,
    resources_per_trial={"cpu": cpus_per_trial, "gpu": gpus_per_trial},
    scheduler=scheduler,
    keep_checkpoints_num=1,
    checkpoint_score_attr="training_iteration",
    stop=None,
    progress_reporter=reporter,
    local_dir="./test-results",
    name="tune_transformer_pbt",
    log_to_file=True,
)

[2m[36m(pid=2901398)[0m 2022-10-14 09:15:39.436636: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


== Status ==
Current time: 2022-10-14 09:15:37 (running for 00:00:00.28)
Memory usage on this node: 11.4/31.1 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/20 CPUs, 1.0/1 GPUs, 0.0/15.25 GiB heap, 0.0/7.62 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /workspace/syc/BERT_classification_binary/test-results/tune_transformer_pbt
Number of trials: 5/5 (4 PENDING, 1 RUNNING)
+------------------------+----------+--------------------+-----------+-------------+----------------+--------------+
| Trial name             | status   | loc                |   w_decay |          lr | train_bs/gpu   |   num_epochs |
|------------------------+----------+--------------------+-----------+-------------+----------------+--------------|
| _objective_c4104_00000 | RUNNING  | 172.17.0.3:2901398 | 0.0885716 | 4.40175e-05 |                |            2 |
| _objective_c4104_00001 | PENDING  |                    | 0.215004  | 1.48101e-05 |                |            5

[2m[36m(_objective pid=2901398)[0m Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias']
[2m[36m(_objective pid=2901398)[0m - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(_objective pid=2901398)[0m - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassi

== Status ==
Current time: 2022-10-14 09:15:45 (running for 00:00:07.93)
Memory usage on this node: 15.7/31.1 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/20 CPUs, 1.0/1 GPUs, 0.0/15.25 GiB heap, 0.0/7.62 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /workspace/syc/BERT_classification_binary/test-results/tune_transformer_pbt
Number of trials: 5/5 (4 PENDING, 1 RUNNING)
+------------------------+----------+--------------------+-----------+-------------+----------------+--------------+
| Trial name             | status   | loc                |   w_decay |          lr | train_bs/gpu   |   num_epochs |
|------------------------+----------+--------------------+-----------+-------------+----------------+--------------|
| _objective_c4104_00000 | RUNNING  | 172.17.0.3:2901398 | 0.0885716 | 4.40175e-05 |                |            2 |
| _objective_c4104_00001 | PENDING  |                    | 0.215004  | 1.48101e-05 |                |            5

  3%|▎         | 2/62 [00:01<00:38,  1.54it/s]
  5%|▍         | 3/62 [00:01<00:38,  1.55it/s]
  6%|▋         | 4/62 [00:02<00:37,  1.55it/s]
  8%|▊         | 5/62 [00:03<00:36,  1.56it/s]
 10%|▉         | 6/62 [00:03<00:35,  1.56it/s]
 11%|█▏        | 7/62 [00:04<00:35,  1.56it/s]
 13%|█▎        | 8/62 [00:05<00:34,  1.56it/s]
 15%|█▍        | 9/62 [00:05<00:33,  1.56it/s]


== Status ==
Current time: 2022-10-14 09:15:50 (running for 00:00:12.93)
Memory usage on this node: 15.7/31.1 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/20 CPUs, 1.0/1 GPUs, 0.0/15.25 GiB heap, 0.0/7.62 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /workspace/syc/BERT_classification_binary/test-results/tune_transformer_pbt
Number of trials: 5/5 (4 PENDING, 1 RUNNING)
+------------------------+----------+--------------------+-----------+-------------+----------------+--------------+
| Trial name             | status   | loc                |   w_decay |          lr | train_bs/gpu   |   num_epochs |
|------------------------+----------+--------------------+-----------+-------------+----------------+--------------|
| _objective_c4104_00000 | RUNNING  | 172.17.0.3:2901398 | 0.0885716 | 4.40175e-05 |                |            2 |
| _objective_c4104_00001 | PENDING  |                    | 0.215004  | 1.48101e-05 |                |            5

 16%|█▌        | 10/62 [00:06<00:33,  1.56it/s]
 18%|█▊        | 11/62 [00:07<00:32,  1.56it/s]
 19%|█▉        | 12/62 [00:08<00:36,  1.36it/s]
 21%|██        | 13/62 [00:08<00:34,  1.41it/s]
 23%|██▎       | 14/62 [00:09<00:32,  1.45it/s]
 24%|██▍       | 15/62 [00:09<00:31,  1.48it/s]
 26%|██▌       | 16/62 [00:10<00:30,  1.51it/s]
 27%|██▋       | 17/62 [00:11<00:29,  1.52it/s]


== Status ==
Current time: 2022-10-14 09:15:55 (running for 00:00:17.94)
Memory usage on this node: 15.7/31.1 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/20 CPUs, 1.0/1 GPUs, 0.0/15.25 GiB heap, 0.0/7.62 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /workspace/syc/BERT_classification_binary/test-results/tune_transformer_pbt
Number of trials: 5/5 (4 PENDING, 1 RUNNING)
+------------------------+----------+--------------------+-----------+-------------+----------------+--------------+
| Trial name             | status   | loc                |   w_decay |          lr | train_bs/gpu   |   num_epochs |
|------------------------+----------+--------------------+-----------+-------------+----------------+--------------|
| _objective_c4104_00000 | RUNNING  | 172.17.0.3:2901398 | 0.0885716 | 4.40175e-05 |                |            2 |
| _objective_c4104_00001 | PENDING  |                    | 0.215004  | 1.48101e-05 |                |            5

 29%|██▉       | 18/62 [00:11<00:28,  1.53it/s]
 31%|███       | 19/62 [00:12<00:27,  1.54it/s]
 32%|███▏      | 20/62 [00:13<00:27,  1.55it/s]
 34%|███▍      | 21/62 [00:13<00:26,  1.55it/s]
 35%|███▌      | 22/62 [00:14<00:25,  1.55it/s]
 37%|███▋      | 23/62 [00:15<00:25,  1.55it/s]
 39%|███▊      | 24/62 [00:15<00:24,  1.55it/s]


== Status ==
Current time: 2022-10-14 09:16:00 (running for 00:00:22.94)
Memory usage on this node: 15.7/31.1 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/20 CPUs, 1.0/1 GPUs, 0.0/15.25 GiB heap, 0.0/7.62 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /workspace/syc/BERT_classification_binary/test-results/tune_transformer_pbt
Number of trials: 5/5 (4 PENDING, 1 RUNNING)
+------------------------+----------+--------------------+-----------+-------------+----------------+--------------+
| Trial name             | status   | loc                |   w_decay |          lr | train_bs/gpu   |   num_epochs |
|------------------------+----------+--------------------+-----------+-------------+----------------+--------------|
| _objective_c4104_00000 | RUNNING  | 172.17.0.3:2901398 | 0.0885716 | 4.40175e-05 |                |            2 |
| _objective_c4104_00001 | PENDING  |                    | 0.215004  | 1.48101e-05 |                |            5

 40%|████      | 25/62 [00:16<00:23,  1.55it/s]
 42%|████▏     | 26/62 [00:17<00:23,  1.56it/s]
 44%|████▎     | 27/62 [00:17<00:22,  1.56it/s]
 45%|████▌     | 28/62 [00:18<00:21,  1.56it/s]
 47%|████▋     | 29/62 [00:18<00:21,  1.56it/s]
 48%|████▊     | 30/62 [00:19<00:20,  1.56it/s]
 50%|█████     | 31/62 [00:20<00:19,  1.56it/s]
 52%|█████▏    | 32/62 [00:21<00:20,  1.45it/s]


== Status ==
Current time: 2022-10-14 09:16:05 (running for 00:00:27.94)
Memory usage on this node: 15.7/31.1 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/20 CPUs, 1.0/1 GPUs, 0.0/15.25 GiB heap, 0.0/7.62 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /workspace/syc/BERT_classification_binary/test-results/tune_transformer_pbt
Number of trials: 5/5 (4 PENDING, 1 RUNNING)
+------------------------+----------+--------------------+-----------+-------------+----------------+--------------+
| Trial name             | status   | loc                |   w_decay |          lr | train_bs/gpu   |   num_epochs |
|------------------------+----------+--------------------+-----------+-------------+----------------+--------------|
| _objective_c4104_00000 | RUNNING  | 172.17.0.3:2901398 | 0.0885716 | 4.40175e-05 |                |            2 |
| _objective_c4104_00001 | PENDING  |                    | 0.215004  | 1.48101e-05 |                |            5

 53%|█████▎    | 33/62 [00:21<00:19,  1.48it/s]
 55%|█████▍    | 34/62 [00:22<00:18,  1.50it/s]
 56%|█████▋    | 35/62 [00:22<00:17,  1.52it/s]
 58%|█████▊    | 36/62 [00:23<00:16,  1.53it/s]
 60%|█████▉    | 37/62 [00:24<00:16,  1.54it/s]
 61%|██████▏   | 38/62 [00:24<00:15,  1.54it/s]
 63%|██████▎   | 39/62 [00:25<00:14,  1.55it/s]
 65%|██████▍   | 40/62 [00:26<00:14,  1.55it/s]


== Status ==
Current time: 2022-10-14 09:16:10 (running for 00:00:32.94)
Memory usage on this node: 15.7/31.1 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/20 CPUs, 1.0/1 GPUs, 0.0/15.25 GiB heap, 0.0/7.62 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /workspace/syc/BERT_classification_binary/test-results/tune_transformer_pbt
Number of trials: 5/5 (4 PENDING, 1 RUNNING)
+------------------------+----------+--------------------+-----------+-------------+----------------+--------------+
| Trial name             | status   | loc                |   w_decay |          lr | train_bs/gpu   |   num_epochs |
|------------------------+----------+--------------------+-----------+-------------+----------------+--------------|
| _objective_c4104_00000 | RUNNING  | 172.17.0.3:2901398 | 0.0885716 | 4.40175e-05 |                |            2 |
| _objective_c4104_00001 | PENDING  |                    | 0.215004  | 1.48101e-05 |                |            5

 66%|██████▌   | 41/62 [00:26<00:13,  1.55it/s]
 68%|██████▊   | 42/62 [00:27<00:12,  1.56it/s]
 69%|██████▉   | 43/62 [00:28<00:12,  1.56it/s]
 71%|███████   | 44/62 [00:28<00:11,  1.56it/s]
 73%|███████▎  | 45/62 [00:29<00:10,  1.56it/s]
 74%|███████▍  | 46/62 [00:29<00:10,  1.56it/s]
 76%|███████▌  | 47/62 [00:30<00:09,  1.56it/s]


== Status ==
Current time: 2022-10-14 09:16:15 (running for 00:00:37.94)
Memory usage on this node: 15.7/31.1 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/20 CPUs, 1.0/1 GPUs, 0.0/15.25 GiB heap, 0.0/7.62 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /workspace/syc/BERT_classification_binary/test-results/tune_transformer_pbt
Number of trials: 5/5 (4 PENDING, 1 RUNNING)
+------------------------+----------+--------------------+-----------+-------------+----------------+--------------+
| Trial name             | status   | loc                |   w_decay |          lr | train_bs/gpu   |   num_epochs |
|------------------------+----------+--------------------+-----------+-------------+----------------+--------------|
| _objective_c4104_00000 | RUNNING  | 172.17.0.3:2901398 | 0.0885716 | 4.40175e-05 |                |            2 |
| _objective_c4104_00001 | PENDING  |                    | 0.215004  | 1.48101e-05 |                |            5

 77%|███████▋  | 48/62 [00:31<00:08,  1.56it/s]
 79%|███████▉  | 49/62 [00:31<00:08,  1.56it/s]
 81%|████████  | 50/62 [00:32<00:07,  1.56it/s]
[2m[36m(_objective pid=2901398)[0m 
  0%|          | 0/125 [00:00<?, ?it/s][A
[2m[36m(_objective pid=2901398)[0m 
  3%|▎         | 4/125 [00:00<00:03, 32.07it/s][A
[2m[36m(_objective pid=2901398)[0m 
  6%|▋         | 8/125 [00:00<00:04, 26.87it/s][A
[2m[36m(_objective pid=2901398)[0m 
  9%|▉         | 11/125 [00:00<00:04, 25.78it/s][A
[2m[36m(_objective pid=2901398)[0m 
 11%|█         | 14/125 [00:00<00:04, 25.14it/s][A
[2m[36m(_objective pid=2901398)[0m 
 14%|█▎        | 17/125 [00:00<00:04, 24.79it/s][A
[2m[36m(_objective pid=2901398)[0m 
 16%|█▌        | 20/125 [00:00<00:04, 24.57it/s][A
[2m[36m(_objective pid=2901398)[0m 
 18%|█▊        | 23/125 [00:00<00:04, 24.42it/s][A
[2m[36m(_objective pid=2901398)[0m 
 21%|██        | 26/125 [00:01<00:04, 24.34it/s][A
[2m[36m(_objective pid=2901398)[0m 
 23%|██▎ 

== Status ==
Current time: 2022-10-14 09:16:20 (running for 00:00:42.95)
Memory usage on this node: 15.7/31.1 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/20 CPUs, 1.0/1 GPUs, 0.0/15.25 GiB heap, 0.0/7.62 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /workspace/syc/BERT_classification_binary/test-results/tune_transformer_pbt
Number of trials: 5/5 (4 PENDING, 1 RUNNING)
+------------------------+----------+--------------------+-----------+-------------+----------------+--------------+
| Trial name             | status   | loc                |   w_decay |          lr | train_bs/gpu   |   num_epochs |
|------------------------+----------+--------------------+-----------+-------------+----------------+--------------|
| _objective_c4104_00000 | RUNNING  | 172.17.0.3:2901398 | 0.0885716 | 4.40175e-05 |                |            2 |
| _objective_c4104_00001 | PENDING  |                    | 0.215004  | 1.48101e-05 |                |            5

[2m[36m(_objective pid=2901398)[0m 
 74%|███████▎  | 92/125 [00:03<00:01, 24.11it/s][A
[2m[36m(_objective pid=2901398)[0m 
 76%|███████▌  | 95/125 [00:03<00:01, 24.11it/s][A
[2m[36m(_objective pid=2901398)[0m 
 78%|███████▊  | 98/125 [00:04<00:01, 24.12it/s][A
[2m[36m(_objective pid=2901398)[0m 
 81%|████████  | 101/125 [00:04<00:00, 24.11it/s][A
[2m[36m(_objective pid=2901398)[0m 
 83%|████████▎ | 104/125 [00:04<00:00, 24.06it/s][A
[2m[36m(_objective pid=2901398)[0m 
 86%|████████▌ | 107/125 [00:04<00:00, 24.07it/s][A
[2m[36m(_objective pid=2901398)[0m 
 88%|████████▊ | 110/125 [00:04<00:00, 24.07it/s][A
[2m[36m(_objective pid=2901398)[0m 
 90%|█████████ | 113/125 [00:04<00:00, 24.11it/s][A
[2m[36m(_objective pid=2901398)[0m 
 93%|█████████▎| 116/125 [00:04<00:00, 24.11it/s][A
[2m[36m(_objective pid=2901398)[0m 
 95%|█████████▌| 119/125 [00:04<00:00, 24.11it/s][A
[2m[36m(_objective pid=2901398)[0m 
 98%|█████████▊| 122/125 [00:05<00:00, 13.53

Result for _objective_c4104_00000:
  date: 2022-10-14_09-16-22
  done: false
  epoch: 1.61
  eval_f1: 0.8470588235294118
  eval_loss: 0.5500108599662781
  eval_runtime: 5.5226
  eval_samples_per_second: 181.074
  eval_steps_per_second: 22.634
  experiment_id: 4b9631f24c0149ed85033ea5d5f57577
  hostname: 3481a8a2ae33
  iterations_since_restore: 1
  node_ip: 172.17.0.3
  objective: 0.8470588235294118
  pid: 2901398
  time_since_restore: 41.78634071350098
  time_this_iter_s: 41.78634071350098
  time_total_s: 41.78634071350098
  timestamp: 1665738982
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: c4104_00000
  warmup_time: 0.0027446746826171875
  
[2m[36m(_objective pid=2901398)[0m {'eval_loss': 0.5500108599662781, 'eval_f1': 0.8470588235294118, 'eval_runtime': 5.5226, 'eval_samples_per_second': 181.074, 'eval_steps_per_second': 22.634, 'epoch': 1.61}


 81%|████████  | 50/62 [00:38<00:09,  1.31it/s]
[2m[36m(pid=2901654)[0m 2022-10-14 09:16:24.114137: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


== Status ==
Current time: 2022-10-14 09:16:27 (running for 00:00:50.36)
Memory usage on this node: 15.3/31.1 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/20 CPUs, 1.0/1 GPUs, 0.0/15.25 GiB heap, 0.0/7.62 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /workspace/syc/BERT_classification_binary/test-results/tune_transformer_pbt
Number of trials: 5/5 (1 PAUSED, 3 PENDING, 1 RUNNING)
+------------------------+----------+--------------------+-----------+-------------+----------------+--------------+-----------+-------------+---------+----------------------+
| Trial name             | status   | loc                |   w_decay |          lr | train_bs/gpu   |   num_epochs |   eval_f1 |   eval_loss |   epoch |   training_iteration |
|------------------------+----------+--------------------+-----------+-------------+----------------+--------------+-----------+-------------+---------+----------------------|
| _objective_c4104_00001 | RUNNING  | 172.17

[2m[36m(_objective pid=2901654)[0m Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
[2m[36m(_objective pid=2901654)[0m - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(_objective pid=2901654)[0m - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassi

== Status ==
Current time: 2022-10-14 09:16:32 (running for 00:00:55.37)
Memory usage on this node: 15.7/31.1 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/20 CPUs, 1.0/1 GPUs, 0.0/15.25 GiB heap, 0.0/7.62 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /workspace/syc/BERT_classification_binary/test-results/tune_transformer_pbt
Number of trials: 5/5 (1 PAUSED, 3 PENDING, 1 RUNNING)
+------------------------+----------+--------------------+-----------+-------------+----------------+--------------+-----------+-------------+---------+----------------------+
| Trial name             | status   | loc                |   w_decay |          lr | train_bs/gpu   |   num_epochs |   eval_f1 |   eval_loss |   epoch |   training_iteration |
|------------------------+----------+--------------------+-----------+-------------+----------------+--------------+-----------+-------------+---------+----------------------|
| _objective_c4104_00001 | RUNNING  | 172.17

  5%|▍         | 7/155 [00:04<01:35,  1.56it/s]
  5%|▌         | 8/155 [00:05<01:34,  1.56it/s]
  6%|▌         | 9/155 [00:05<01:33,  1.56it/s]
  6%|▋         | 10/155 [00:06<01:33,  1.56it/s]
  7%|▋         | 11/155 [00:07<01:32,  1.56it/s]
  8%|▊         | 12/155 [00:07<01:31,  1.56it/s]
  8%|▊         | 13/155 [00:08<01:31,  1.56it/s]


== Status ==
Current time: 2022-10-14 09:16:37 (running for 00:01:00.37)
Memory usage on this node: 15.7/31.1 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/20 CPUs, 1.0/1 GPUs, 0.0/15.25 GiB heap, 0.0/7.62 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /workspace/syc/BERT_classification_binary/test-results/tune_transformer_pbt
Number of trials: 5/5 (1 PAUSED, 3 PENDING, 1 RUNNING)
+------------------------+----------+--------------------+-----------+-------------+----------------+--------------+-----------+-------------+---------+----------------------+
| Trial name             | status   | loc                |   w_decay |          lr | train_bs/gpu   |   num_epochs |   eval_f1 |   eval_loss |   epoch |   training_iteration |
|------------------------+----------+--------------------+-----------+-------------+----------------+--------------+-----------+-------------+---------+----------------------|
| _objective_c4104_00001 | RUNNING  | 172.17

  9%|▉         | 14/155 [00:09<01:30,  1.56it/s]
 10%|▉         | 15/155 [00:09<01:29,  1.56it/s]
 10%|█         | 16/155 [00:10<01:29,  1.55it/s]
 11%|█         | 17/155 [00:10<01:28,  1.56it/s]
 12%|█▏        | 18/155 [00:11<01:28,  1.56it/s]
 12%|█▏        | 19/155 [00:12<01:27,  1.56it/s]
 13%|█▎        | 20/155 [00:12<01:26,  1.56it/s]
 14%|█▎        | 21/155 [00:13<01:26,  1.56it/s]


== Status ==
Current time: 2022-10-14 09:16:42 (running for 00:01:05.37)
Memory usage on this node: 15.7/31.1 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/20 CPUs, 1.0/1 GPUs, 0.0/15.25 GiB heap, 0.0/7.62 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /workspace/syc/BERT_classification_binary/test-results/tune_transformer_pbt
Number of trials: 5/5 (1 PAUSED, 3 PENDING, 1 RUNNING)
+------------------------+----------+--------------------+-----------+-------------+----------------+--------------+-----------+-------------+---------+----------------------+
| Trial name             | status   | loc                |   w_decay |          lr | train_bs/gpu   |   num_epochs |   eval_f1 |   eval_loss |   epoch |   training_iteration |
|------------------------+----------+--------------------+-----------+-------------+----------------+--------------+-----------+-------------+---------+----------------------|
| _objective_c4104_00001 | RUNNING  | 172.17

 14%|█▍        | 22/155 [00:14<01:25,  1.56it/s]
 15%|█▍        | 23/155 [00:14<01:24,  1.56it/s]
 15%|█▌        | 24/155 [00:15<01:24,  1.56it/s]
 16%|█▌        | 25/155 [00:16<01:23,  1.56it/s]
 17%|█▋        | 26/155 [00:16<01:22,  1.56it/s]
 17%|█▋        | 27/155 [00:17<01:22,  1.56it/s]
 18%|█▊        | 28/155 [00:17<01:21,  1.56it/s]
 19%|█▊        | 29/155 [00:18<01:20,  1.56it/s]


== Status ==
Current time: 2022-10-14 09:16:47 (running for 00:01:10.37)
Memory usage on this node: 15.7/31.1 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/20 CPUs, 1.0/1 GPUs, 0.0/15.25 GiB heap, 0.0/7.62 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /workspace/syc/BERT_classification_binary/test-results/tune_transformer_pbt
Number of trials: 5/5 (1 PAUSED, 3 PENDING, 1 RUNNING)
+------------------------+----------+--------------------+-----------+-------------+----------------+--------------+-----------+-------------+---------+----------------------+
| Trial name             | status   | loc                |   w_decay |          lr | train_bs/gpu   |   num_epochs |   eval_f1 |   eval_loss |   epoch |   training_iteration |
|------------------------+----------+--------------------+-----------+-------------+----------------+--------------+-----------+-------------+---------+----------------------|
| _objective_c4104_00001 | RUNNING  | 172.17

 19%|█▉        | 30/155 [00:19<01:20,  1.56it/s]
 20%|██        | 31/155 [00:19<01:19,  1.56it/s]
 21%|██        | 32/155 [00:20<01:24,  1.45it/s]
 21%|██▏       | 33/155 [00:21<01:22,  1.48it/s]
 22%|██▏       | 34/155 [00:22<01:20,  1.50it/s]
 23%|██▎       | 35/155 [00:22<01:19,  1.52it/s]
 23%|██▎       | 36/155 [00:23<01:29,  1.33it/s]


== Status ==
Current time: 2022-10-14 09:16:52 (running for 00:01:15.47)
Memory usage on this node: 15.7/31.1 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/20 CPUs, 1.0/1 GPUs, 0.0/15.25 GiB heap, 0.0/7.62 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /workspace/syc/BERT_classification_binary/test-results/tune_transformer_pbt
Number of trials: 5/5 (1 PAUSED, 3 PENDING, 1 RUNNING)
+------------------------+----------+--------------------+-----------+-------------+----------------+--------------+-----------+-------------+---------+----------------------+
| Trial name             | status   | loc                |   w_decay |          lr | train_bs/gpu   |   num_epochs |   eval_f1 |   eval_loss |   epoch |   training_iteration |
|------------------------+----------+--------------------+-----------+-------------+----------------+--------------+-----------+-------------+---------+----------------------|
| _objective_c4104_00001 | RUNNING  | 172.17

 24%|██▍       | 37/155 [00:24<01:24,  1.39it/s]
 25%|██▍       | 38/155 [00:24<01:21,  1.44it/s]
 25%|██▌       | 39/155 [00:25<01:18,  1.47it/s]
 26%|██▌       | 40/155 [00:26<01:16,  1.50it/s]
 26%|██▋       | 41/155 [00:26<01:15,  1.51it/s]
 27%|██▋       | 42/155 [00:27<01:14,  1.53it/s]
 28%|██▊       | 43/155 [00:28<01:12,  1.54it/s]
 28%|██▊       | 44/155 [00:28<01:11,  1.54it/s]


== Status ==
Current time: 2022-10-14 09:16:57 (running for 00:01:20.47)
Memory usage on this node: 15.7/31.1 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/20 CPUs, 1.0/1 GPUs, 0.0/15.25 GiB heap, 0.0/7.62 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /workspace/syc/BERT_classification_binary/test-results/tune_transformer_pbt
Number of trials: 5/5 (1 PAUSED, 3 PENDING, 1 RUNNING)
+------------------------+----------+--------------------+-----------+-------------+----------------+--------------+-----------+-------------+---------+----------------------+
| Trial name             | status   | loc                |   w_decay |          lr | train_bs/gpu   |   num_epochs |   eval_f1 |   eval_loss |   epoch |   training_iteration |
|------------------------+----------+--------------------+-----------+-------------+----------------+--------------+-----------+-------------+---------+----------------------|
| _objective_c4104_00001 | RUNNING  | 172.17

 29%|██▉       | 45/155 [00:29<01:11,  1.55it/s]
 30%|██▉       | 46/155 [00:30<01:10,  1.55it/s]
 30%|███       | 47/155 [00:30<01:09,  1.55it/s]
 31%|███       | 48/155 [00:31<01:09,  1.55it/s]
 32%|███▏      | 49/155 [00:31<01:08,  1.55it/s]
 32%|███▏      | 50/155 [00:32<01:07,  1.55it/s]
[2m[36m(_objective pid=2901654)[0m 
  0%|          | 0/125 [00:00<?, ?it/s][A
[2m[36m(_objective pid=2901654)[0m 
  3%|▎         | 4/125 [00:00<00:03, 32.16it/s][A
[2m[36m(_objective pid=2901654)[0m 
  6%|▋         | 8/125 [00:00<00:04, 26.84it/s][A
[2m[36m(_objective pid=2901654)[0m 
  9%|▉         | 11/125 [00:00<00:04, 25.72it/s][A
[2m[36m(_objective pid=2901654)[0m 
 11%|█         | 14/125 [00:00<00:04, 25.10it/s][A
[2m[36m(_objective pid=2901654)[0m 
 14%|█▎        | 17/125 [00:00<00:04, 24.75it/s][A
[2m[36m(_objective pid=2901654)[0m 
 16%|█▌        | 20/125 [00:00<00:04, 24.52it/s][A
[2m[36m(_objective pid=2901654)[0m 
 18%|█▊        | 23/125 [00:00<00:04, 24.

== Status ==
Current time: 2022-10-14 09:17:02 (running for 00:01:25.47)
Memory usage on this node: 15.7/31.1 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/20 CPUs, 1.0/1 GPUs, 0.0/15.25 GiB heap, 0.0/7.62 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /workspace/syc/BERT_classification_binary/test-results/tune_transformer_pbt
Number of trials: 5/5 (1 PAUSED, 3 PENDING, 1 RUNNING)
+------------------------+----------+--------------------+-----------+-------------+----------------+--------------+-----------+-------------+---------+----------------------+
| Trial name             | status   | loc                |   w_decay |          lr | train_bs/gpu   |   num_epochs |   eval_f1 |   eval_loss |   epoch |   training_iteration |
|------------------------+----------+--------------------+-----------+-------------+----------------+--------------+-----------+-------------+---------+----------------------|
| _objective_c4104_00001 | RUNNING  | 172.17

[2m[36m(_objective pid=2901654)[0m 
 30%|███       | 38/125 [00:01<00:03, 24.16it/s][A
[2m[36m(_objective pid=2901654)[0m 
 33%|███▎      | 41/125 [00:01<00:03, 24.12it/s][A
[2m[36m(_objective pid=2901654)[0m 
 35%|███▌      | 44/125 [00:01<00:03, 24.12it/s][A
[2m[36m(_objective pid=2901654)[0m 
 38%|███▊      | 47/125 [00:01<00:03, 24.10it/s][A
[2m[36m(_objective pid=2901654)[0m 
 40%|████      | 50/125 [00:02<00:03, 24.11it/s][A
[2m[36m(_objective pid=2901654)[0m 
 42%|████▏     | 53/125 [00:02<00:02, 24.11it/s][A
[2m[36m(_objective pid=2901654)[0m 
 45%|████▍     | 56/125 [00:02<00:02, 24.10it/s][A
[2m[36m(_objective pid=2901654)[0m 
 47%|████▋     | 59/125 [00:02<00:02, 24.06it/s][A
[2m[36m(_objective pid=2901654)[0m 
 50%|████▉     | 62/125 [00:02<00:02, 24.06it/s][A
[2m[36m(_objective pid=2901654)[0m 
 52%|█████▏    | 65/125 [00:02<00:02, 24.07it/s][A
[2m[36m(_objective pid=2901654)[0m 
 54%|█████▍    | 68/125 [00:02<00:02, 24.07it/s][A

Result for _objective_c4104_00001:
  date: 2022-10-14_09-17-06
  done: false
  epoch: 1.61
  eval_f1: 0.7258771929824562
  eval_loss: 0.5450262427330017
  eval_runtime: 5.2179
  eval_samples_per_second: 191.649
  eval_steps_per_second: 23.956
  experiment_id: a6f8c91886154a9681070e0400ff5285
  hostname: 3481a8a2ae33
  iterations_since_restore: 1
  node_ip: 172.17.0.3
  objective: 0.7258771929824562
  pid: 2901654
  time_since_restore: 41.554603576660156
  time_this_iter_s: 41.554603576660156
  time_total_s: 41.554603576660156
  timestamp: 1665739026
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: c4104_00001
  warmup_time: 0.0019211769104003906
  
[2m[36m(_objective pid=2901654)[0m {'eval_loss': 0.5450262427330017, 'eval_f1': 0.7258771929824562, 'eval_runtime': 5.2179, 'eval_samples_per_second': 191.649, 'eval_steps_per_second': 23.956, 'epoch': 1.61}


[2m[36m(_objective pid=2901654)[0m 
[2m[36m(_objective pid=2901654)[0m 100%|██████████| 125/125 [00:05<00:00, 23.42it/s][A                                                
[2m[36m(_objective pid=2901654)[0m                                                  [A 32%|███▏      | 50/155 [00:37<01:07,  1.55it/s]
[2m[36m(_objective pid=2901654)[0m 100%|██████████| 125/125 [00:05<00:00, 23.42it/s][Aearly stopping required metric_for_best_model, but did not find eval_accuracy so early stopping is disabled
[2m[36m(_objective pid=2901654)[0m 
[2m[36m(_objective pid=2901654)[0m                                                  [A
[2m[36m(_objective pid=2901654)[0m  32%|███▏      | 50/155 [00:37<01:19,  1.32it/s]


In [None]:
result

In [None]:
for n, v in result.hyperparameters.items():
    setattr(trainer.args, n, v)

In [None]:
# trainer.args

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

In [None]:
trainer.predict(test_dataset=test_dataset)

[2m[36m(_objective pid=2900563)[0m 
[2m[36m(_objective pid=2900563)[0m 100%|██████████| 125/125 [00:05<00:00, 24.21it/s][A                                                
[2m[36m(_objective pid=2900563)[0m                                                  [A 32%|███▏      | 50/155 [00:37<01:07,  1.56it/s]
[2m[36m(_objective pid=2900563)[0m 100%|██████████| 125/125 [00:05<00:00, 24.21it/s][Aearly stopping required metric_for_best_model, but did not find eval_accuracy so early stopping is disabled
[2m[36m(_objective pid=2900563)[0m 
[2m[36m(_objective pid=2900563)[0m                                                  [A


[2m[36m(_objective pid=2900563)[0m {'eval_loss': 0.6048755049705505, 'eval_f1': 0.6558891454965358, 'eval_runtime': 5.1791, 'eval_samples_per_second': 193.083, 'eval_steps_per_second': 24.135, 'epoch': 1.61}


In [None]:
# model_path = "test-model"
# trainer.model.save_pretrained(model_path)
# tokenizer.save_pretrained(model_path)

# Reference

https://bo-10000.tistory.com/154  
https://huggingface.co/blog/ray-tune  
https://docs.ray.io/en/latest/tune/examples/pbt_transformers.html  
https://wood-b.github.io/post/a-novices-guide-to-hyperparameter-optimization-at-scale/#schedulers-vs-search-algorithms  
https://docs.ray.io/en/latest/tune/api_docs/search_space.html  
https://docs.ray.io/en/latest/tune/tutorials/tune-advanced-tutorial.html  
https://docs.ray.io/en/latest/tune/api_docs/schedulers.html  
https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/  
https://docs.ray.io/en/latest/tune/faq.html  
https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#population-based-training-tune-schedulers-populationbasedtraining  
https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.hyperparameter_search  
https://docs.ray.io/en/latest/tune/api_docs/suggestion.html#optuna-tune-search-optuna-optunasearch  