In [1]:
!nvidia-smi

Mon Apr 26 14:15:51 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
# from urllib.request import urlopen
# exec(urlopen("http://colab-monitor.smankusors.com/track.py").read())
# _colabMonitor = ColabMonitor().start()

## Fine-tuning a model on a text classification task
This notebook will show to fine-tune one of the 🤗 Transformers model to a text classification task of the [GLUE Benchmark](https://gluebenchmark.com/).
<br><br>
The **GLUE Benchmark** is a group of nine classification tasks on sentences or pairs of sentences which are:

- **CoLA** (Corpus of Linguistic Acceptability) Determine if a sentence is grammatically correct or not. It is a dataset containing sentences labeled grammatically correct or not.
- **MNLI** (Multi-Genre Natural Language Inference) Determine if a sentence entails, contradicts or is unrelated to a given hypothesis. (This dataset has two versions, one with the validation and test set coming from the same distribution, another called mismatched where the validation and test use out-of-domain data.)
- **MRPC** (Microsoft Research Paraphrase Corpus) Determine if two sentences are paraphrases from one another or not.
- **QNLI** (Question-answering Natural Language Inference) Determine if the answer to a question is in the second sentence or not. (This dataset is built from the SQuAD dataset.)
- **QQP** (Quora Question Pairs2) Determine if two questions are semantically equivalent or not.
- **RTE** (Recognizing Textual Entailment) Determine if a sentence entails a given hypothesis or not.
- **SST-2** (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.
- **STS-B** (Semantic Textual Similarity Benchmark) Determine the similarity of two sentences with a score from 1 to 5.
- **WNLI** (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not. (This dataset is built from the Winograd Schema Challenge dataset.)
<br><br>

Among them, this notebook will use The **CoLA** Dataset.

In [3]:
!pip install transformers
!pip install datasets
!pip install optuna

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 5.7MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 21.4MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 39.0MB/s 
Installing collected packages: tokenizers, sacremoses, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1
Collecting datasets
[?25l  Downloading https://files.pythonhosted

In [4]:
import datasets

import random
import pandas as pd
import numpy as np

from IPython.display import display, HTML

import transformers
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# datasets : 1.5.0  |  pd : 1.1.5  |  np : 1.19.5  |  transformers : 4.5.1
print(f'datasets : {datasets.__version__}  |  pd : {pd.__version__}  |  np : {np.__version__}  |  transformers : {transformers.__version__}  |  torch : {torch.__version__}')
print('device :', device)

datasets : 1.6.1  |  pd : 1.1.5  |  np : 1.19.5  |  transformers : 4.5.1  |  torch : 1.8.1+cu101
device : cuda


We will see how to easily load the dataset for each one of those tasks and use the `Trainer` API to fine-tune a model on it. Each task is named by its acronym, with `mnli-mm` standing for the mismatched version of MNLI (so same training set as `mnli` but different validation and test sets):

In [5]:
GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]
task = "cola"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

## 1. Loading the dataset & metric
We will use the 🤗 Datasets library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.

In [6]:
actual_task = "mnli" if task == "mnli-mm" else task
dataset = datasets.load_dataset("glue", actual_task)
metric = datasets.load_metric('glue', actual_task)

print('\n>>> actual_task :', actual_task)
print('\n>>> type of metric :', type(metric))
print('\n>>> dataset object :', dataset)
print('\n>>> sample data :', dataset['train'][0])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=7777.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=4473.0, style=ProgressStyle(description…


Downloading and preparing dataset glue/cola (download: 368.14 KiB, generated: 596.73 KiB, post-processed: Unknown size, total: 964.86 KiB) to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=376971.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1848.0, style=ProgressStyle(description…



>>> actual_task : cola

>>> type of metric : <class 'datasets_modules.metrics.glue.e4606ab9804a36bcd5a9cebb2cb65bb14b6ac78ee9e6d5981fa679a495dd55de.glue.Glue'>

>>> dataset object : DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

>>> sample data : {'idx': 0, 'label': 1, 'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}


In [7]:
# show random sample of a dataset
def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = random.sample(range(len(dataset)), k=num_examples)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(dataset["train"])

Unnamed: 0,idx,label,sentence
0,307,unacceptable,"They failed to tell me which problem I'll beat the competition more easily, the sooner I solve."
1,3985,unacceptable,Is putting the book in the box.
2,6024,acceptable,The cat was leaving.
3,5906,acceptable,It is likely that Bill likes chocolate.
4,624,acceptable,The train reached the station fully.


In [8]:
# compute score with metric
fake_preds = np.random.randint(0, 2, size=(16,))
fake_labels = np.random.randint(0, 2, size=(16,))

fake_preds, fake_labels, metric.compute(predictions=fake_preds, references=fake_labels)

(array([0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1]),
 array([0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0]),
 {'matthews_correlation': -0.2698412698412698})

The metric object only computes the proper metrics associated to the task, which are:

- for **CoLA** : Matthews Correlation Coefficient
- for **MNLI**(matched or mismatched) : Accuracy
- for **MRPC** : Accuracy and F1 score
- for **QNLI** : Accuracy
- for **QQP** : Accuracy and F1 score
- for **RTE** : Accuracy
- for **SST-2** : Accuracy
- for **STS-B** : Pearson Correlation Coefficient and Spearman's_Rank_Correlation_Coefficient
- for **WNLI** : Accuracy

## 2. Preprocessing the data


In [9]:
%%time
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…


CPU times: user 291 ms, sys: 39.8 ms, total: 331 ms
Wall time: 789 ms


In [10]:
# tokenize sample sentences
batch_encoded = tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

print(type(batch_encoded))
for k, v in batch_encoded.items():
  print(k, '-', v)

<class 'transformers.tokenization_utils_base.BatchEncoding'>
input_ids - [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102]
attention_mask - [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [11]:
# generate dict for saving names of columns containing the sentences
task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

# check if the dict works correctly
sent1_key, sent2_key = task_to_keys[task]
if sent2_key is None:
  print(f'Sentence :', dataset['train'][0][sent1_key])
else:
  print(f'Sentence 1 :', dataset['train'][0][sent1_key])
  print(f'Sentence 2 :', dataset['train'][0][sent2_key])

Sentence : Our friends won't buy this analysis, let alone the next one we propose.


In [12]:
# function to preprocess text - tokenize & truncate
def preprocess(examples):
  if sent2_key is None:
    return tokenizer(examples[sent1_key], truncation=True)
  else:
    return tokenizer(examples[sent1_key], examples[sent2_key], truncation=True)


# check if the preprocessor works properly
for k, v in preprocess(dataset['train'][:5]).items():
  print(f"'{k}' :")
  for lst in v:
    print('\t', lst)

'input_ids' :
	 [101, 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012, 102]
	 [101, 2028, 2062, 18404, 2236, 3989, 1998, 1045, 1005, 1049, 3228, 2039, 1012, 102]
	 [101, 2028, 2062, 18404, 2236, 3989, 2030, 1045, 1005, 1049, 3228, 2039, 1012, 102]
	 [101, 1996, 2062, 2057, 2817, 16025, 1010, 1996, 13675, 16103, 2121, 2027, 2131, 1012, 102]
	 [101, 2154, 2011, 2154, 1996, 8866, 2024, 2893, 14163, 8024, 3771, 1012, 102]
'attention_mask' :
	 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
	 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
	 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
	 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
	 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [13]:
%%time
# apply preprocess function to all sentences in dataset
# set batched as True to encode the texts by batches together (about 5x faster)
dataset_encoded = dataset.map(preprocess, batched=True)

for k, v in dataset_encoded['train'][0].items():
  print(f'>>> {k} - {v}')

HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))


>>> attention_mask - [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> idx - 0
>>> input_ids - [101, 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012, 102]
>>> label - 1
>>> sentence - Our friends won't buy this analysis, let alone the next one we propose.
CPU times: user 877 ms, sys: 30.8 ms, total: 908 ms
Wall time: 456 ms


## 3. Fine-tuning the model

In [14]:
num_labels = 3 if task.startswith('mnli') else 1 if task=='stsb' else 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels).to(device)

print()
model

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi




DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [15]:
%%time
# define attributes to customize the training
metric_name = 'pearson' if task=='stsb' else 'matthews_correlation' if task=='cola' else 'accuracy'
args = TrainingArguments(
    output_dir='test-glue', evaluation_strategy='epoch', learning_rate=2e-5, 
    per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size,
    num_train_epochs=5, weight_decay=0.01, load_best_model_at_end=True, 
    metric_for_best_model=metric_name,
)

# define function to compute the metrics from the predictions
def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  if task=='stsb':
    predictions = predictions[:, 0]
  else:
    predictions = np.argmax(predictions, axis=1)
  return metric.compute(predictions=predictions, references=labels)


# generate trainer & train model
validation_key = 'validation_mismatched' if task=='mmli-mm' else 'validation_matched' if task=='mmli' else 'validation'
trainer = Trainer(model=model, args=args, 
                  train_dataset=dataset_encoded['train'], eval_dataset=dataset_encoded[validation_key],
                  tokenizer=tokenizer, compute_metrics=compute_metrics)
trainer.train()

Epoch,Training Loss,Validation Loss,Matthews Correlation,Runtime,Samples Per Second
1,0.5155,0.468441,0.462947,1.4009,744.497
2,0.3425,0.533065,0.490594,1.4619,713.476
3,0.2291,0.552669,0.52944,1.2789,815.515
4,0.1747,0.745263,0.529883,1.3968,746.719
5,0.127,0.85731,0.532073,1.3709,760.819


CPU times: user 5min 42s, sys: 2min 45s, total: 8min 27s
Wall time: 5min 3s


In [16]:
%%time
# evaluate model
trainer.evaluate()

CPU times: user 1.85 s, sys: 611 ms, total: 2.46 s
Wall time: 1.67 s


{'epoch': 5.0,
 'eval_loss': 0.85731041431427,
 'eval_matthews_correlation': 0.532072854687201,
 'eval_mem_cpu_alloc_delta': 163840,
 'eval_mem_cpu_peaked_delta': 98304,
 'eval_mem_gpu_alloc_delta': 0,
 'eval_mem_gpu_peaked_delta': 20080128,
 'eval_runtime': 1.4176,
 'eval_samples_per_second': 735.771}

## 4. Hyperparameter search

In [17]:
%%time
trainer = Trainer(
    model_init=lambda : AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels),
    args=args,
    train_dataset=dataset_encoded['train'].shard(num_shards=10, index=1),  # train with sample train dataset
    eval_dataset=dataset_encoded[validation_key],
    tokenizer=tokenizer, compute_metrics=compute_metrics
)

print('\n\n')
best_run = trainer.hyperparameter_search(n_trials=10, direction='maximize')

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi






[32m[I 2021-04-26 14:21:48,767][0m A new study created in memory with name: no-name-e4629b92-09dd-4415-af28-8a2f4ee40fed[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the 

Epoch,Training Loss,Validation Loss,Matthews Correlation,Runtime,Samples Per Second
1,No log,0.616683,0.0,1.3344,781.623
2,No log,0.614283,0.231277,1.4327,728.015



invalid value encountered in double_scalars

[32m[I 2021-04-26 14:22:44,566][0m Trial 0 finished with value: 0.23127725114885042 and parameters: {'learning_rate': 1.3559868845740137e-05, 'num_train_epochs': 2, 'seed': 35, 'per_device_train_batch_size': 4}. Best is trial 0 with value: 0.23127725114885042.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly ide

Epoch,Training Loss,Validation Loss,Matthews Correlation,Runtime,Samples Per Second
1,No log,0.592384,0.0,1.3093,796.599
2,No log,0.591522,0.299843,1.3448,775.58
3,No log,0.766018,0.298289,1.3742,758.974
4,No log,1.002066,0.308811,1.5554,670.55
5,No log,1.072381,0.315287,1.3931,748.683



invalid value encountered in double_scalars

[32m[I 2021-04-26 14:23:36,893][0m Trial 1 finished with value: 0.3152873142455979 and parameters: {'learning_rate': 3.383507488759775e-05, 'num_train_epochs': 5, 'seed': 4, 'per_device_train_batch_size': 16}. Best is trial 1 with value: 0.3152873142455979.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identi

Epoch,Training Loss,Validation Loss,Matthews Correlation,Runtime,Samples Per Second
1,No log,0.609606,0.0,1.3903,750.205
2,No log,0.60521,0.0,1.3646,764.347



invalid value encountered in double_scalars


invalid value encountered in double_scalars

[32m[I 2021-04-26 14:23:56,007][0m Trial 2 finished with value: 0.0 and parameters: {'learning_rate': 2.026866005883341e-05, 'num_train_epochs': 2, 'seed': 26, 'per_device_train_batch_size': 32}. Best is trial 1 with value: 0.3152873142455979.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that

Epoch,Training Loss,Validation Loss,Matthews Correlation,Runtime,Samples Per Second
1,No log,0.594957,0.0,1.4021,743.874



invalid value encountered in double_scalars

[32m[I 2021-04-26 14:24:10,959][0m Trial 3 finished with value: 0.0 and parameters: {'learning_rate': 4.038514298184836e-05, 'num_train_epochs': 1, 'seed': 30, 'per_device_train_batch_size': 16}. Best is trial 1 with value: 0.3152873142455979.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializ

Epoch,Training Loss,Validation Loss,Matthews Correlation,Runtime,Samples Per Second
1,No log,0.617314,0.0,1.4041,742.832
2,No log,0.612706,0.0,1.4237,732.611
3,No log,0.61141,0.0,1.3892,750.782



invalid value encountered in double_scalars


invalid value encountered in double_scalars


invalid value encountered in double_scalars

[32m[I 2021-04-26 14:25:00,598][0m Trial 4 finished with value: 0.0 and parameters: {'learning_rate': 3.170055509561951e-06, 'num_train_epochs': 3, 'seed': 37, 'per_device_train_batch_size': 8}. Best is trial 1 with value: 0.3152873142455979.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClass

Epoch,Training Loss,Validation Loss,Matthews Correlation,Runtime,Samples Per Second
1,No log,0.615456,0.0,1.3736,759.296
2,No log,0.599778,0.0,1.3324,782.769



invalid value encountered in double_scalars


invalid value encountered in double_scalars

[32m[I 2021-04-26 14:25:15,206][0m Trial 5 pruned. [0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were no

Epoch,Training Loss,Validation Loss,Matthews Correlation,Runtime,Samples Per Second
1,No log,0.665971,0.0,2.9613,352.206
2,No log,0.657186,0.0,2.7535,378.793



invalid value encountered in double_scalars

[32m[I 2021-04-26 14:25:41,523][0m Trial 6 pruned. [0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at dis

Epoch,Training Loss,Validation Loss,Matthews Correlation,Runtime,Samples Per Second
1,No log,0.614889,0.0,3.9618,263.264
2,No log,0.612168,0.0,3.9223,265.918



invalid value encountered in double_scalars

[32m[I 2021-04-26 14:26:06,772][0m Trial 7 pruned. [0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at dis

Epoch,Training Loss,Validation Loss,Matthews Correlation,Runtime,Samples Per Second
1,No log,0.604207,-0.020703,4.2067,247.939


[32m[I 2021-04-26 14:26:24,994][0m Trial 8 pruned. [0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized

Epoch,Training Loss,Validation Loss,Matthews Correlation,Runtime,Samples Per Second
1,No log,0.571527,0.0,5.0966,204.644
2,No log,0.594281,0.30034,4.885,213.51
3,No log,0.688633,0.321753,4.8487,215.11
4,No log,0.786366,0.334038,4.8479,215.146
5,No log,0.884434,0.326928,4.8596,214.629


[32m[I 2021-04-26 14:27:45,869][0m Trial 9 finished with value: 0.32692809378284005 and parameters: {'learning_rate': 7.886872124558365e-05, 'num_train_epochs': 5, 'seed': 9, 'per_device_train_batch_size': 64}. Best is trial 9 with value: 0.32692809378284005.[0m


CPU times: user 6min 32s, sys: 4min 10s, total: 10min 43s
Wall time: 6min 1s


In [18]:
%%time
# generate final trainer
trainer_final = Trainer(
    model_init=lambda : AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels),
    args=args,
    train_dataset=dataset_encoded['train'],  # train with full train dataset
    eval_dataset=dataset_encoded[validation_key],
    tokenizer=tokenizer, compute_metrics=compute_metrics
)

# trian model with best hyper-parameters
print('< best_run >')
for n, v in best_run.hyperparameters.items():
  print(f'>>> {n} : {v}')
  setattr(trainer_final.args, n, v)

print('\n')
trainer_final.train()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

< best_run >
>>> learning_rate : 7.886872124558365e-05
>>> num_train_epochs : 5
>>> seed : 9
>>> per_device_train_batch_size : 64




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

Epoch,Training Loss,Validation Loss,Matthews Correlation,Runtime,Samples Per Second
1,No log,0.528273,0.391621,1.3805,755.523
2,No log,0.463834,0.490077,1.4012,744.355
3,No log,0.641262,0.47175,1.3961,747.065
4,0.298200,0.714019,0.520706,1.3241,787.697
5,0.298200,0.879305,0.494445,1.4241,732.384


CPU times: user 2min 38s, sys: 1min 16s, total: 3min 55s
Wall time: 2min 17s


In [19]:
%%time
# evaluate model
trainer_final.evaluate()

CPU times: user 2 s, sys: 593 ms, total: 2.59 s
Wall time: 1.79 s


{'epoch': 5.0,
 'eval_loss': 0.7140189409255981,
 'eval_matthews_correlation': 0.5207058375145255,
 'eval_mem_cpu_alloc_delta': 208896,
 'eval_mem_cpu_peaked_delta': 0,
 'eval_mem_gpu_alloc_delta': 0,
 'eval_mem_gpu_peaked_delta': 19949056,
 'eval_runtime': 1.4471,
 'eval_samples_per_second': 720.766}