In [1]:
!pip install transformers
!pip install datasets
!pip install optuna

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2.1MB 11.2MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 901kB 46.6MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 3.3MB 46.5MB/s 
Installing co

In [2]:
!nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-9bcc3c68-54fc-710d-9191-b4b085a86a4f)


## Fine-tuning a model on a text classification task
This notebook will show to fine-tune one of the ðŸ¤— Transformers model to a text classification task of the [GLUE Benchmark](https://gluebenchmark.com/).
<br><br>
The **GLUE Benchmark** is a group of nine classification tasks on sentences or pairs of sentences which are:

- **CoLA** (Corpus of Linguistic Acceptability) Determine if a sentence is grammatically correct or not. It is a dataset containing sentences labeled grammatically correct or not.
- **MNLI** (Multi-Genre Natural Language Inference) Determine if a sentence entails, contradicts or is unrelated to a given hypothesis. (This dataset has two versions, one with the validation and test set coming from the same distribution, another called mismatched where the validation and test use out-of-domain data.)
- **MRPC** (Microsoft Research Paraphrase Corpus) Determine if two sentences are paraphrases from one another or not.
- **QNLI** (Question-answering Natural Language Inference) Determine if the answer to a question is in the second sentence or not. (This dataset is built from the SQuAD dataset.)
- **QQP** (Quora Question Pairs2) Determine if two questions are semantically equivalent or not.
- **RTE** (Recognizing Textual Entailment) Determine if a sentence entails a given hypothesis or not.
- **SST-2** (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.
- **STS-B** (Semantic Textual Similarity Benchmark) Determine the similarity of two sentences with a score from 1 to 5.
- **WNLI** (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not. (This dataset is built from the Winograd Schema Challenge dataset.)
<br><br>

Among them, this notebook will use The **RTE** Dataset.

In [3]:
import datasets

import pandas as pd
import numpy as np
import random

from IPython.display import display, HTML

import transformers
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# datasets : 1.5.0  |  pd : 1.1.5  |  np : 1.19.5  |  transformers : 4.5.1
print(f'datasets : {datasets.__version__}  |  pd : {pd.__version__}  |  np : {np.__version__}  |  transformers : {transformers.__version__}  |  torch : {torch.__version__}')
print('device :', device)

datasets : 1.6.1  |  pd : 1.1.5  |  np : 1.19.5  |  transformers : 4.5.1  |  torch : 1.8.1+cu101
device : cuda


In [4]:
# check execution time for whole code
import time
s_time = time.time()

We will see how to easily load the dataset for each one of those tasks and use the `Trainer` API to fine-tune a model on it. Each task is named by its acronym, with `mnli-mm` standing for the mismatched version of MNLI (so same training set as `mnli` but different validation and test sets):

In [5]:
GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]
task = "rte"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

## 1. Loading the dataset & metric
We will use the ðŸ¤— Datasets library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.

In [6]:
actual_task = "mnli" if task == "mnli-mm" else task
dataset = datasets.load_dataset("glue", actual_task)
metric = datasets.load_metric('glue', actual_task)

print('\n>>> actual_task :', actual_task)
print('\n>>> type of metric :', type(metric))
print('\n>>> dataset object :', dataset)
print('\n>>> sample data :', dataset['train'][0])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=7777.0, style=ProgressStyle(descriptionâ€¦




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=4473.0, style=ProgressStyle(descriptionâ€¦


Downloading and preparing dataset glue/rte (download: 680.81 KiB, generated: 1.83 MiB, post-processed: Unknown size, total: 2.49 MiB) to /root/.cache/huggingface/datasets/glue/rte/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=697150.0, style=ProgressStyle(descriptiâ€¦




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/rte/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1848.0, style=ProgressStyle(descriptionâ€¦



>>> actual_task : rte

>>> type of metric : <class 'datasets_modules.metrics.glue.e4606ab9804a36bcd5a9cebb2cb65bb14b6ac78ee9e6d5981fa679a495dd55de.glue.Glue'>

>>> dataset object : DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 2490
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 277
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3000
    })
})

>>> sample data : {'idx': 0, 'label': 1, 'sentence1': 'No Weapons of Mass Destruction Found in Iraq Yet.', 'sentence2': 'Weapons of Mass Destruction Found in Iraq.'}


In [7]:
# show random sample of a dataset
def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = random.sample(range(len(dataset)), k=num_examples)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(dataset["train"])

Unnamed: 0,idx,label,sentence1,sentence2
0,363,entailment,"President Alvaro Uribe was sworn into his second term of office in the Colombian capital of Bogota, Monday, pledging to improve the economy and make peace with FARC (Revolutionary Armed Forces of Colombia) rebels. Security was tight in order to prevent a repetition of the attack on his first inauguration in 2002 when FARC rebels fired mortars at the presidential palace killing 20. Police reported deactivating a car bomb outside of the capital on Monday.",Alvaro Uribe is the current President of Colombia.
1,383,entailment,"In-form Rooney's hot goalscoring streak of seven goals in his last four internationals saw him win the vote to be crowned England's Player of the Year for 2008. Manchester United striker Rooney, 23, is set for his 50th cap against Ukraine at Wembley tomorrow and is enjoying the best form of his England career after a heart-to-heart with Capello. England boss Capello ordered Rooney to work hard on his finishing, stop shooting from long range and start scoring tap-ins to help transform him into a prolific marksman.",Fabio Capello is the coach of the English team.
2,985,entailment,"Two men have been charged with starting the Zaca Fire, the second-largest wildfire in the history of California. The fire is continuing to burn through the Los Padres National Forest, and has consumed a total of 240,000 acres, having started on July 4. However, the fire is now under control, and is expected to be contained by September 4. Jose Jesus Cabrera of Santa Ynez, 38, Santiago Iniguez Cervantes of Santa Maria, 46, and the company of Rancho La Laguna, whom they worked for, have all been charged with six counts of felony in relation to starting the fire.",Cabrera and Cervantes are accused of having caused the Zaca Fire.
3,1080,not_entailment,"Tropical Storm Arthur is projected to weaken tonight, but it will likely regain strength after entering the Bay of Campeche on Sunday.The 2008 Atlantic hurricane season got off to an early start today when a tropical storm formed off the coast of Belize, one day before the season officially begins. Tropical Storm Arthur formed Saturday afternoon and quickly made landfall at the Yucatan Peninsula, near the border between Belize and Mexico. Both countries issued a tropical storm warning for the peninsula's eastern coastline. In the Mexican state of Quintana Roo, ports were closed to small boats, water sports were banned, and those living in coastal areas were encouraged to take precautions.",Tropical Storm Arthur is approaching the US coast.
4,401,not_entailment,"Of Irish descent, he was born in Brookline, Mass., on May 29, 1917. Graduating from Harvard in 1940, he entered the Navy. In 1943, when his PT boat was rammed and sunk by a Japanese destroyer, Kennedy, despite grave injuries, led the survivors through perilous waters to safety. Back from the war, he became a Democratic congressman from the Boston area, advancing in 1953 to the Senate. He married Jacqueline Bouvier on Sept. 12, 1953. In 1955, while recuperating from a back operation, he wrote ""Profiles in Courage,"" which won the Pulitzer Prize in history.",John Fitzgerald Kennedy was shot dead in 1963.


In [8]:
# compute score with metric
fake_preds = np.random.randint(0, 2, size=(16,))
fake_labels = np.random.randint(0, 2, size=(16,))

fake_preds, fake_labels, metric.compute(predictions=fake_preds, references=fake_labels)

(array([1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1]),
 array([0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1]),
 {'accuracy': 0.6875})

The metric object only computes the proper metrics associated to the task, which are:

- for **CoLA** : Matthews Correlation Coefficient
- for **MNLI**(matched or mismatched) : Accuracy
- for **MRPC** : Accuracy and F1 score
- for **QNLI** : Accuracy
- for **QQP** : Accuracy and F1 score
- for **RTE** : Accuracy
- for **SST-2** : Accuracy
- for **STS-B** : Pearson Correlation Coefficient and Spearman's_Rank_Correlation_Coefficient
- for **WNLI** : Accuracy

## 2. Preprocessing the data


In [9]:
%%time
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_â€¦




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descriptiâ€¦




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descriptiâ€¦




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_wâ€¦


CPU times: user 354 ms, sys: 50.6 ms, total: 405 ms
Wall time: 3.53 s


In [10]:
# tokenize sample sentences
batch_encoded = tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

print(type(batch_encoded))
for k, v in batch_encoded.items():
  print(k, '-', v)

<class 'transformers.tokenization_utils_base.BatchEncoding'>
input_ids - [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102]
attention_mask - [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [11]:
# generate dict for saving names of columns containing the sentences
task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

# check if the dict works correctly
sent1_key, sent2_key = task_to_keys[task]
if sent2_key is None:
  print(f'Sentence :', dataset['train'][0][sent1_key])
else:
  print(f'Sentence 1 :', dataset['train'][0][sent1_key])
  print(f'Sentence 2 :', dataset['train'][0][sent2_key])

Sentence 1 : No Weapons of Mass Destruction Found in Iraq Yet.
Sentence 2 : Weapons of Mass Destruction Found in Iraq.


In [12]:
# function to preprocess text - tokenize & truncate
def preprocess(examples):
  if sent2_key is None:
    return tokenizer(examples[sent1_key], truncation=True)
  else:
    return tokenizer(examples[sent1_key], examples[sent2_key], truncation=True)


# check if the preprocessor works properly
for k, v in preprocess(dataset['train'][:5]).items():
  print(f"'{k}' :")
  for lst in v:
    print('\t', lst)

'input_ids' :
	 [101, 2053, 4255, 1997, 3742, 6215, 2179, 1999, 5712, 2664, 1012, 102, 4255, 1997, 3742, 6215, 2179, 1999, 5712, 1012, 102]
	 [101, 1037, 2173, 1997, 14038, 1010, 2044, 4831, 2198, 2703, 2462, 2351, 1010, 2150, 1037, 2173, 1997, 7401, 1010, 2004, 3142, 3234, 11633, 5935, 1999, 5116, 3190, 2000, 2928, 1996, 8272, 1997, 2047, 4831, 12122, 16855, 1012, 102, 4831, 12122, 16855, 2003, 1996, 2047, 3003, 1997, 1996, 3142, 3234, 2277, 1012, 102]
	 [101, 2014, 3401, 13876, 2378, 2001, 2525, 4844, 2000, 7438, 1996, 5305, 4355, 7388, 4456, 5022, 1010, 1998, 1996, 2194, 2056, 1010, 6928, 1010, 2009, 2097, 6848, 2007, 2976, 25644, 1996, 6061, 1997, 3653, 11020, 3089, 10472, 1996, 4319, 2005, 2062, 7388, 4456, 5022, 1012, 102, 2014, 3401, 13876, 2378, 2064, 2022, 2109, 2000, 7438, 7388, 4456, 1012, 102]
	 [101, 18414, 10265, 13801, 1010, 2708, 3237, 2012, 20877, 2098, 5555, 1010, 1037, 2966, 2326, 2194, 2008, 7126, 15770, 1996, 1016, 1011, 2095, 1011, 2214, 5148, 2540, 2820, 1999, 75

In [13]:
%%time
# apply preprocess function to all sentences in dataset
# set batched as True to encode the texts by batches together (about 5x faster)
dataset_encoded = dataset.map(preprocess, batched=True)

for k, v in dataset_encoded['train'][0].items():
  print(f'>>> {k} - {v}')

HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))


>>> attention_mask - [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> idx - 0
>>> input_ids - [101, 2053, 4255, 1997, 3742, 6215, 2179, 1999, 5712, 2664, 1012, 102, 4255, 1997, 3742, 6215, 2179, 1999, 5712, 1012, 102]
>>> label - 1
>>> sentence1 - No Weapons of Mass Destruction Found in Iraq Yet.
>>> sentence2 - Weapons of Mass Destruction Found in Iraq.
CPU times: user 2.32 s, sys: 55.3 ms, total: 2.38 s
Wall time: 887 ms


## 3. Fine-tuning the model

In [14]:
num_labels = 3 if task.startswith('mnli') else 1 if task=='stsb' else 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels).to(device)

print()
model

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descriâ€¦




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi




DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [15]:
%%time
# define attributes to customize the training
metric_name = 'pearson' if task=='stsb' else 'matthews_correlation' if task=='cola' else 'accuracy'
args = TrainingArguments(
    output_dir='test-glue', evaluation_strategy='epoch', learning_rate=2e-5, 
    per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size,
    num_train_epochs=5, weight_decay=0.01, load_best_model_at_end=True, 
    metric_for_best_model=metric_name,
)

# define function to compute the metrics from the predictions
def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  if task=='stsb':
    predictions = predictions[:, 0]
  else:
    predictions = np.argmax(predictions, axis=1)
  return metric.compute(predictions=predictions, references=labels)


# generate trainer & train model
validation_key = 'validation_mismatched' if task=='mmli-mm' else 'validation_matched' if task=='mmli' else 'validation'
trainer = Trainer(model=model, args=args, 
                  train_dataset=dataset_encoded['train'], eval_dataset=dataset_encoded[validation_key],
                  tokenizer=tokenizer, compute_metrics=compute_metrics)
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
1,No log,0.68351,0.606498,0.8281,334.507
2,No log,0.670175,0.606498,0.8243,336.047
3,No log,0.768653,0.606498,0.8245,335.963
4,0.600800,0.79041,0.624549,0.826,335.366
5,0.600800,0.80856,0.628159,0.8308,333.411


CPU times: user 3min 35s, sys: 1min 54s, total: 5min 29s
Wall time: 3min 2s


In [16]:
%%time
# evaluate model
trainer.evaluate()

CPU times: user 1.29 s, sys: 483 ms, total: 1.77 s
Wall time: 1.15 s


{'epoch': 5.0,
 'eval_accuracy': 0.628158844765343,
 'eval_loss': 0.808559775352478,
 'eval_mem_cpu_alloc_delta': 229376,
 'eval_mem_cpu_peaked_delta': 0,
 'eval_mem_gpu_alloc_delta': 0,
 'eval_mem_gpu_peaked_delta': 185443840,
 'eval_runtime': 0.8272,
 'eval_samples_per_second': 334.861}

## 4. Hyperparameter search

In [17]:
%%time
trainer = Trainer(
    model_init=lambda : AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels),
    args=args,
    train_dataset=dataset_encoded['train'].shard(num_shards=10, index=1),  # train with sample train dataset
    eval_dataset=dataset_encoded[validation_key],
    tokenizer=tokenizer, compute_metrics=compute_metrics
)

print('\n\n')
best_run = trainer.hyperparameter_search(n_trials=10, direction='maximize')

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi






[32m[I 2021-04-28 14:52:13,241][0m A new study created in memory with name: no-name-d184e8bb-7b77-4430-8fb3-c810e070a81c[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the 

Epoch,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
1,No log,0.691103,0.559567,0.833,332.526


[32m[I 2021-04-28 14:52:24,946][0m Trial 0 finished with value: 0.5595667870036101 and parameters: {'learning_rate': 1.9733973444095682e-05, 'num_train_epochs': 1, 'seed': 30, 'per_device_train_batch_size': 64}. Best is trial 0 with value: 0.5595667870036101.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassific

Epoch,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
1,No log,0.690961,0.548736,0.824,336.184


[32m[I 2021-04-28 14:52:37,481][0m Trial 1 finished with value: 0.5487364620938628 and parameters: {'learning_rate': 1.3449116163536331e-05, 'num_train_epochs': 1, 'seed': 35, 'per_device_train_batch_size': 16}. Best is trial 0 with value: 0.5595667870036101.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassific

Epoch,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
1,No log,0.693749,0.472924,0.8284,334.397
2,No log,0.748053,0.501805,0.8367,331.06
3,No log,1.073804,0.509025,0.8293,334.026
4,No log,1.538683,0.527076,0.8193,338.091
5,No log,1.615997,0.516245,0.8189,338.26


[32m[I 2021-04-28 14:53:33,839][0m Trial 2 finished with value: 0.516245487364621 and parameters: {'learning_rate': 3.2910330230077714e-05, 'num_train_epochs': 5, 'seed': 27, 'per_device_train_batch_size': 4}. Best is trial 0 with value: 0.5595667870036101.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassificat

Epoch,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
1,No log,0.692012,0.516245,0.8228,336.646
2,No log,0.692066,0.545126,0.8197,337.923


[32m[I 2021-04-28 14:53:58,722][0m Trial 3 finished with value: 0.5451263537906137 and parameters: {'learning_rate': 3.465006924197396e-05, 'num_train_epochs': 2, 'seed': 28, 'per_device_train_batch_size': 4}. Best is trial 0 with value: 0.5595667870036101.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassificat

Epoch,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
1,No log,0.694624,0.501805,0.8184,338.469
2,No log,0.694515,0.490975,0.8174,338.863


[32m[I 2021-04-28 14:54:16,338][0m Trial 4 finished with value: 0.49097472924187724 and parameters: {'learning_rate': 5.870909271181118e-06, 'num_train_epochs': 2, 'seed': 9, 'per_device_train_batch_size': 32}. Best is trial 0 with value: 0.5595667870036101.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassifica

Epoch,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
1,No log,0.694497,0.494585,0.8186,338.396


[32m[I 2021-04-28 14:54:26,180][0m Trial 5 pruned. [0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized

Epoch,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
1,No log,0.694067,0.505415,0.9708,285.328


[32m[I 2021-04-28 14:54:34,058][0m Trial 6 pruned. [0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized

Epoch,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
1,No log,0.691054,0.530686,1.0852,255.257


[32m[I 2021-04-28 14:54:50,557][0m Trial 7 finished with value: 0.5306859205776173 and parameters: {'learning_rate': 3.3050881279515635e-05, 'num_train_epochs': 1, 'seed': 27, 'per_device_train_batch_size': 32}. Best is trial 0 with value: 0.5595667870036101.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassific

Epoch,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
1,No log,0.690612,0.545126,0.8151,339.856


[32m[I 2021-04-28 14:55:03,843][0m Trial 8 finished with value: 0.5451263537906137 and parameters: {'learning_rate': 7.259277959846856e-06, 'num_train_epochs': 1, 'seed': 37, 'per_device_train_batch_size': 8}. Best is trial 0 with value: 0.5595667870036101.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassificat

Epoch,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
1,No log,0.694118,0.480144,0.8202,337.713


[32m[I 2021-04-28 14:55:11,887][0m Trial 9 pruned. [0m


CPU times: user 3min 41s, sys: 1min 44s, total: 5min 26s
Wall time: 3min 3s


In [18]:
%%time
# generate final trainer
trainer_final = Trainer(
    model_init=lambda : AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels),
    args=args,
    train_dataset=dataset_encoded['train'],  # train with full train dataset
    eval_dataset=dataset_encoded[validation_key],
    tokenizer=tokenizer, compute_metrics=compute_metrics
)

# trian model with best hyper-parameters
print('< best_run >')
for n, v in best_run.hyperparameters.items():
  print(f'>>> {n} : {v}')
  setattr(trainer_final.args, n, v)

print('\n')
trainer_final.train()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

< best_run >
>>> learning_rate : 1.9733973444095682e-05
>>> num_train_epochs : 1
>>> seed : 30
>>> per_device_train_batch_size : 64




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

Epoch,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
1,No log,0.690103,0.563177,0.9173,301.963


CPU times: user 55 s, sys: 46.9 s, total: 1min 41s
Wall time: 44.8 s


In [19]:
%%time
# evaluate model
trainer_final.evaluate()

CPU times: user 1.35 s, sys: 952 ms, total: 2.3 s
Wall time: 1.42 s


{'epoch': 1.0,
 'eval_accuracy': 0.5631768953068592,
 'eval_loss': 0.690102756023407,
 'eval_mem_cpu_alloc_delta': 12288,
 'eval_mem_cpu_peaked_delta': 0,
 'eval_mem_gpu_alloc_delta': 0,
 'eval_mem_gpu_peaked_delta': 185443840,
 'eval_runtime': 1.0619,
 'eval_samples_per_second': 260.857}

In [20]:
# check execution time for whole code
e_time = time.time()
time_elapsed = e_time - s_time
print(f'Total time elapsed : {int(time_elapsed//60)} min {int(time_elapsed%60)} sec')

Total time elapsed : 7 min 19 sec
