# HW 2: Efficient Fine-Tuning with BitFit?
**Due: March 13, 11:30 AM**

In this homework assignment, you will replicate [the BitFit experiments (Zaken et al., 2020)](https://aclanthology.org/2022.acl-short.1/). You will first use the [🤗 Transformers framework](https://huggingface.co/docs/transformers/index) to fine-tune a [BERT$_\text{tiny}$ model](https://huggingface.co/prajjwal1/bert-tiny) ([Turc et al., 2019](https://arxiv.org/abs/1908.08962); [Bhargava et al., 2021](https://aclanthology.org/2021.insights-1.18/)) on the IMDb dataset. You will then fine-tune the same model, but with all parameters frozen other than the bias terms. You will compare the two models on the following metrics: (1) their accuracy on the IMDb test set and (2) the number of parameters trained during fine-tuning.

## Important: Read Before Starting

In the following exercises, you will need to implement functions defined in the `train_model.py` and `test_model.py` scripts. **Please write all your code in those files.** You should not submit this notebook with your solutions, and we will not grade it if you do. Please be aware that code written in a Jupyter notebook may run differently when copied into Python modules.

The outputs shown in this notebook are the outputs that you should get **when all problems have been completed correctly**. You may obtain different results if you attempt to run the code cells before you have completed the problem set, or if you have completed one or more problems incorrectly.

For part of this assignment, you will be asked to fine-tune a BERT$_\text{tiny}$ model on the IMDb dataset with hyperparameter tuning. **This will take several hours to run on a laptop with a CPU.** You may want to instead run your code on [Google Colaboratory](https://colab.research.google.com/) using a free GPU.

To begin, please run the following `import` statements.

In [1]:
! pip install datasets evaluate optuna --quiet # install datasets if it is not included in your environment

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/485.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m317.4/485.4 kB[0m [31m11.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m481.3/485.4 kB[0m [31m9.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m383.6/383.6 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m231.8/231.8 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━

In [4]:
import torch
from collections.abc import Iterable
from datasets import load_dataset

# Model and tokenizer from 🤗 Transformers
from transformers import AutoModelForSequenceClassification, \
    BertForSequenceClassification, BertTokenizerFast

# Code you will write for this assignment
from train_model import init_model, preprocess_dataset, init_trainer
from test_model import init_tester

## Problem 1: Setup (30 Points in Total)

In this assignment, you will fine-tune a pre-trained Transformer model using libraries provided by [Hugging Face](https://huggingface.co/) (whose name is usually styled using the emoji 🤗). You have already been exposed to Hugging Face in lab, where you used the [🤗 Datasets](https://huggingface.co/docs/datasets/index) library to load the IMDb dataset and the [🤗 Transformers](https://huggingface.co/docs/transformers/index) library to load a pre-trained BERT$_\text{tiny}$ model. In the following problems, additionally use the [🤗 Evaluate](https://huggingface.co/docs/evaluate/index) library, which provides evaluation metrics such as accuracy and F1.

For several parts of this problem, you will need to refer to the [Hugging Face fine-tuning tutorial](https://huggingface.co/docs/transformers/training) for guidance.

### Problem 1a: Understand the 🤗 Transformers Library (No Submission, 0 Points)

🤗 Transformers is imported into Python via the name `transformers`. Please find the import statements from 🤗 Transformers in the code cell above.

🤗 Transformers comes with a number of different Transformer architectures, as well as [the Model Hub, a repository of pre-trained model parameters](https://huggingface.co/models). A pre-trained model is loaded by calling the model architecture's `.from_pretrained` method.

In [4]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=2)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The code above loads a Transformer classifier consisting of a pre-trained BERT$_\text{base}$ encoder with case-insensitive vocabulary and a randomly initialized 2-layer MLP decoder with tanh activation. The choice of this particular set of pre-trained parameters is specified by the identifier `'bert-base-uncased'`, which is passed to the first parameter of `.from_pretrained`. Different pre-trained weights can be loaded by passing a different identifier to `.from_pretrained`. The following code loads the BERT$_\text{tiny}$ model from [Turc et al. (2019)](https://arxiv.org/abs/1908.08962) and [Bhargava et al. (2021)](https://aclanthology.org/2021.insights-1.18/), which you will be fine-tuning in this assignment. (The `/` indicates that this is a user-submitted model, uploaded by the user [`prajjwal1`](https://huggingface.co/prajjwal1).)

In [5]:
model = BertForSequenceClassification.from_pretrained(
    "prajjwal1/bert-tiny", num_labels=2)

config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
model = BertForSequenceClassification.from_pretrained(
    "prajjwal1/bert-tiny", num_labels=2)

loading configuration file config.json from cache at C:\Users\liaof/.cache\huggingface\hub\models--prajjwal1--bert-tiny\snapshots\6f75de8b60a9f8a2fdf7b69cbd86d9e64bcb3837\config.json
Model config BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file pytorch_model.bin from cache at C:\Users\liaof/.cache\huggingface\hub\models--prajjwal1--bert-tiny\snapshots\6f75de8b60a9f8a2fdf7b69cbd86d9e64bcb3837\pytorch_model.bin
Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing B

In [None]:
model = BertForSequenceClassification.from_pretrained(
    "prajjwal1/bert-mini", num_labels=2)

loading configuration file config.json from cache at C:\Users\liaof/.cache\huggingface\hub\models--prajjwal1--bert-mini\snapshots\5e123abc2480f0c4b4cac186d3b3f09299c258fc\config.json
Model config BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 4,
  "num_hidden_layers": 4,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file pytorch_model.bin from cache at C:\Users\liaof/.cache\huggingface\hub\models--prajjwal1--bert-mini\snapshots\5e123abc2480f0c4b4cac186d3b3f09299c258fc\pytorch_model.bin
Some weights of the model checkpoint at prajjwal1/bert-mini were not used when initializing 

In order to load a model using the code above, you would have to know that BERT$_{\text{tiny}}$'s architecture is implemented using the same class as BERT$_{\text{base}}$. This is not true in general, however. For instance, if you wanted to initialize a RoBERTa classifier instead of a BERT classifier, you would need to call `RobertaForSequenceClassification.from_pretrained` instead of `BertForSequenceClassification.from_pretrained`. When you don't know which class implements the architecture of pre-trained model you want to load, you can use the `AutoModelForSequenceClassification` class ([and equivalent classes for other tasks](https://huggingface.co/docs/transformers/model_doc/auto)), which will figure out which class to instantiate based on the pre-trained weights you would like to load.

In [5]:
# This code does exactly the same thing as the previous code cell
model = AutoModelForSequenceClassification.from_pretrained(
    "prajjwal1/bert-tiny", num_labels=2)

loading configuration file config.json from cache at C:\Users\liaof/.cache\huggingface\hub\models--prajjwal1--bert-tiny\snapshots\6f75de8b60a9f8a2fdf7b69cbd86d9e64bcb3837\config.json
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file pytorch_model.bin from cache at C:\Users\liaof/.cache\huggingface\hub\models--prajjwal1--bert-tiny\snapshots\6f75de8b60a9f8a2fdf7b69cbd86d9e64bcb3837\pytorch_model.bin
Some weights of the model checkpoint at prajjwal1/b

In [6]:
model = AutoModelForSequenceClassification.from_pretrained(
    "prajjwal1/bert-mini", num_labels=2)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
loading configuration file config.json from cache at C:\Users\liaof/.cache\huggingface\hub\models--prajjwal1--bert-mini\snapshots\5e123abc2480f0c4b4cac186d3b3f09299c258fc\config.json
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-mini",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 4,
  "num_hidden_layers": 4,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "

In [7]:
# This code does exactly the same thing as the previous code cell
model = AutoModelForSequenceClassification.from_pretrained(
    "prajjwal1/bert-tiny", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In addition to models, 🤗 Transformers also provides tokenizers that implement a full processing pipeline similar to what you implemented in HW 2. You can load the appropriate tokenizer for your model using a `.from_pretrained` method, just as you did with the model.

In [7]:
tokenizer = BertTokenizerFast.from_pretrained("prajjwal1/bert-tiny")

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

In [8]:
tokenizer = BertTokenizerFast.from_pretrained("prajjwal1/bert-tiny")

As we saw in lab, the tokenizer object can be called as a function. Doing so will return a fully processed input, ready to be passed to the model.

In [8]:
# Because 🤗 Transformers supports multiple deep learning libraries, you will
# need to use the keyword parameter return_tensors in order to indicate that
# you want your inputs to be returned in PyTorch format.
inputs = tokenizer(["Hello world!", "How are you?"], padding=True,
                   return_tensors="pt")
inputs

{'input_ids': tensor([[ 101, 7592, 2088,  999,  102,    0],
        [ 101, 2129, 2024, 2017, 1029,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1]])}

In [9]:
# Because 🤗 Transformers supports multiple deep learning libraries, you will
# need to use the keyword parameter return_tensors in order to indicate that
# you want your inputs to be returned in PyTorch format.
inputs = tokenizer(["Hello world!", "How are you?"], padding=True,
                   return_tensors="pt")
inputs

{'input_ids': tensor([[ 101, 7592, 2088,  999,  102,    0],
        [ 101, 2129, 2024, 2017, 1029,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1]])}

The inputs returned by the tokenizer are passed to the model via [dictionary unpacking](https://realpython.com/python-kwargs-and-args/). The output of the model is structured, with various kinds of information provided depending on keyword arguments passed to the model.

In [9]:
model.eval()
with torch.no_grad():
    outputs = model(**inputs)

print(outputs, end="\n\n")

# Use the dot operator to access parts of the output
print(outputs.logits)

SequenceClassifierOutput(loss=None, logits=tensor([[0.1665, 0.1385],
        [0.1117, 0.2058]]), hidden_states=None, attentions=None)

tensor([[0.1665, 0.1385],
        [0.1117, 0.2058]])


In [10]:
model.eval()
with torch.no_grad():
    outputs = model(**inputs)

print(outputs, end="\n\n")

# Use the dot operator to access parts of the output
print(outputs.logits)

SequenceClassifierOutput(loss=None, logits=tensor([[-0.3286,  0.0966],
        [-0.2045,  0.1461]]), hidden_states=None, attentions=None)

tensor([[-0.3286,  0.0966],
        [-0.2045,  0.1461]])


### Problem 1b: Understand BERT Inputs (Written, 10 Points)

Look at the tokenized inputs from two code cells above. The 
`inputs are represented as a dict with three keys: `'input_ids'`, `'token_type_ids'`, and `'attention_mask'`. What do each of those three inputs represent? Please consult the [original BERT paper (Devlin et al., 2018)](https://arxiv.org/abs/1810.04805) for guidance.

### Problem 1c: Understand BERT Hyperparameters (Written, 10 Points)

For this assignment, you will perform hyperparameter tuning for the BERT$_\text{tiny}$ model using the same procedure as in the [original paper](https://arxiv.org/abs/1908.08962). Their hyperparameter tuning procedure is documented in the [official BERT GitHub repository](https://github.com/google-research/bert) under the heading "**\*\*\*\*\*New March 11th, 2020: Smaller BERT Models\*\*\*\*\***." Please read this documentation and describe how hyperparameter tuning was performed for the GLUE benchmark.

### Problem 1d: Prepare Dataset (Code, 10 Points)

As in lab, we will be using the IMDb dataset provided by 🤗 Datasets.

In [None]:
# Load IMDb dataset and create validation split
imdb = load_dataset("imdb")
split = imdb["train"].train_test_split(.2, seed=3463)
imdb["train"] = split["train"]
imdb["val"] = split["test"]
del imdb["unsupervised"]

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to PATH REDACTED...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to PATH REDACTED. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [11]:
# Load IMDb dataset and create validation split
imdb = load_dataset("imdb")
split = imdb["train"].train_test_split(.2, seed=3463)
imdb["train"] = split["train"]
imdb["val"] = split["test"]
del imdb["unsupervised"]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 557222.64 examples/s]
Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 675925.02 examples/s]
Generating unsupervised split: 100%|██████████| 50000/50000 [00:00<00:00, 734448.64 examples/s]


The 🤗 Transformers fine-tuning API expects datasets to be pre-processed through the following steps.
- All input texts should be tokenized.
- BERT models have a maximum input length, and all inputs need to be truncated to this length.
- Inputs shorter than the maximum input length should be padded to this length.
- The pre-processed inputs do not need to be in the form of PyTorch tensors.

These steps are performed by the `preprocess_dataset` function in `run_experiment.py`, which you will implement for this problem.

In [None]:
imdb["train"] = preprocess_dataset(imdb["train"], tokenizer)
imdb["val"] = preprocess_dataset(imdb["val"], tokenizer)
imdb["test"] = preprocess_dataset(imdb["test"], tokenizer)

# Visualize the preprocessed dataset
for k, v in imdb["val"][:2].items():
    print("{}:\n{}\n{}\n".format(k, type(v),
                                 [item[:20] if isinstance(item, Iterable) else
                                 item for item in v[:5]]))

  0%|          | 0/20000 [00:00<?, ?ex/s]

  0%|          | 0/5000 [00:00<?, ?ex/s]

  0%|          | 0/25000 [00:00<?, ?ex/s]

text:
<class 'list'>
['As so many others ha', 'When converting a bo']

label:
<class 'list'>
[1, 0]

input_ids:
<class 'list'>
[[101, 2004, 2061, 2116, 2500, 2031, 2517, 1010, 2023, 2003, 1037, 6919, 4516, 1012, 2182, 2003, 1037, 2862, 1997, 1996], [101, 2043, 16401, 1037, 2338, 2000, 2143, 1010, 2009, 2003, 3227, 1037, 2204, 2801, 2000, 2562, 2012, 2560, 2070, 1997]]

token_type_ids:
<class 'list'>
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

attention_mask:
<class 'list'>
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]



In [4]:
import torch
from collections.abc import Iterable
from datasets import load_dataset

# Model and tokenizer from 🤗 Transformers
from transformers import AutoModelForSequenceClassification, \
    BertForSequenceClassification, BertTokenizerFast

# Code you will write for this assignment
from train_model import init_model, preprocess_dataset, init_trainer
from test_model import init_tester

# Load IMDb dataset and create validation split
imdb = load_dataset("imdb")
split = imdb["train"].train_test_split(.2, seed=3463)
imdb["train"] = split["train"]
imdb["val"] = split["test"]
del imdb["unsupervised"]

tokenizer = BertTokenizerFast.from_pretrained("prajjwal1/bert-tiny")

imdb["train"] = preprocess_dataset(imdb["train"], tokenizer)
imdb["val"] = preprocess_dataset(imdb["val"], tokenizer)
imdb["test"] = preprocess_dataset(imdb["test"], tokenizer)

# Visualize the preprocessed dataset
for k, v in imdb["val"][:2].items():
    print("{}:\n{}\n{}\n".format(k, type(v),
                                 [item[:20] if isinstance(item, Iterable) else
                                 item for item in v[:5]]))

Map: 100%|██████████| 20000/20000 [00:04<00:00, 4703.36 examples/s]
Map: 100%|██████████| 5000/5000 [00:01<00:00, 4601.09 examples/s]
Map: 100%|██████████| 25000/25000 [00:05<00:00, 4714.39 examples/s]

text:
<class 'list'>
['As so many others ha', 'When converting a bo']

label:
<class 'list'>
[1, 0]

input_ids:
<class 'list'>
[[101, 2004, 2061, 2116, 2500, 2031, 2517, 1010, 2023, 2003, 1037, 6919, 4516, 1012, 2182, 2003, 1037, 2862, 1997, 1996], [101, 2043, 16401, 1037, 2338, 2000, 2143, 1010, 2009, 2003, 3227, 1037, 2204, 2801, 2000, 2562, 2012, 2560, 2070, 1997]]

token_type_ids:
<class 'list'>
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

attention_mask:
<class 'list'>
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]






Please base your implementation on the [Hugging Face fine-tuning tutorial](https://huggingface.co/docs/transformers/training), and please consult [Appendix A.2 of the BERT paper](https://arxiv.org/abs/1810.04805) to find out what the maximum input length should be.

## Problem 2: Implement Experiment (50 Points in Total)
### Problem 2a: Freeze Non-Bias Weights (Code, 10 Points)

At the end of this assignment, you will be comparing a BERT$_{\text{tiny}}$ model fine-tuned using BitFit to a BERT$_{\text{tiny}}$ model fine-tuned _without_ BitFit. To run that experiment, you will need to support freezing all non-bias parameters of the model. To do this, please implement the `init_model` function, illustrated below. This function should load a pre-trained BERT classifier model from the Hugging Face Model Hub and optionally freeze non-bias parameters.

In [None]:
# The first parameter is unused; we just pass None to it
model = init_model(None, "prajjwal1/bert-tiny", use_bitfit=True)

# Check if weight matrix is frozen
print(model.bert.encoder.layer[0].attention.self.query.weight.requires_grad)

# Check if bias term is frozen
print(model.bert.encoder.layer[0].attention.self.query.bias.requires_grad)

Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initia

False
True


In [1]:
import torch
from collections.abc import Iterable
from datasets import load_dataset

# Model and tokenizer from 🤗 Transformers
from transformers import AutoModelForSequenceClassification, \
    BertForSequenceClassification, BertTokenizerFast

# Code you will write for this assignment
from train_model import init_model, preprocess_dataset, init_trainer
from test_model import init_tester

# The first parameter is unused; we just pass None to it
model = init_model(None, "prajjwal1/bert-tiny", use_bitfit=True)

# Check if weight matrix is frozen
print(model.bert.encoder.layer[0].attention.self.query.weight.requires_grad)

# Check if bias term is frozen
print(model.bert.encoder.layer[0].attention.self.query.bias.requires_grad)

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weight

False
True


In [2]:
# The first parameter is unused; we just pass None to it
model = init_model(None, "prajjwal1/bert-mini", use_bitfit=True)

# Check if weight matrix is frozen
print(model.bert.encoder.layer[0].attention.self.query.weight.requires_grad)

# Check if bias term is frozen
print(model.bert.encoder.layer[0].attention.self.query.bias.requires_grad)

Some weights of the model checkpoint at prajjwal1/bert-mini were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initia

False
True


**Hint:** Please consult the [documentation for the function `nn.Module.named_parameters`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.named_parameters).

### Problem 2b: Set Up Trainer and Tester (Code, 20 Points)

🤗 Transformers provides a [`Trainer` object](https://huggingface.co/docs/transformers/main_classes/trainer) that implements training and testing a neural network. For this problem, please implement the functions `init_trainer` in `train_model.py` and `init_tester` in `test_model.py`, which will set up the `Trainer`s used to train and test your model, respectively.

In [None]:
# Creates a Trainer from a Hugging Face Model Hub identifier
trainer = init_trainer("prajjwal1/bert-tiny", imdb["train"], imdb["val"])

# Train using the trainer
trainer.train()

# Change this to whichever checkpoint you want to evalaute
eval_checkpoint_directory = "checkpoints/run-13/checkpoint-1252"

# Creates a Trainer to test a Hugging Face saved model
tester = init_tester(eval_checkpoint_directory)

loading configuration file config.json from cache at PATH REDACTED
Model config BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file pytorch_model.bin from cache at PATH REDACTED
Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 

In [1]:
import torch
from collections.abc import Iterable
from datasets import load_dataset

# Model and tokenizer from 🤗 Transformers
from transformers import AutoModelForSequenceClassification, \
    BertForSequenceClassification, BertTokenizerFast

# Code you will write for this assignment
from train_model import init_model, preprocess_dataset, init_trainer
from test_model import init_tester

# Load IMDb dataset and create validation split
imdb = load_dataset("imdb")
split = imdb["train"].train_test_split(.2, seed=3463)
imdb["train"] = split["train"]
imdb["val"] = split["test"]
del imdb["unsupervised"]

tokenizer = BertTokenizerFast.from_pretrained("prajjwal1/bert-tiny")

imdb["train"] = preprocess_dataset(imdb["train"], tokenizer)
imdb["val"] = preprocess_dataset(imdb["val"], tokenizer)
imdb["test"] = preprocess_dataset(imdb["test"], tokenizer)

# Creates a Trainer from a Hugging Face Model Hub identifier
trainer = init_trainer("prajjwal1/bert-tiny", imdb["train"], imdb["val"])

# Train using the trainer
trainer.train()

# # Change this to whichever checkpoint you want to evalaute
# eval_checkpoint_directory = "checkpoints/run-13/checkpoint-1252"

# # Creates a Trainer to test a Hugging Face saved model
# tester = init_tester(eval_checkpoint_directory)

  from .autonotebook import tqdm as notebook_tqdm
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]
Map: 100%|██████████| 20000/20000 [00:04<00:00, 4629.40 examples/s]
Map: 100%|██████████| 5000/5000 [00:01<00:00, 4210.94 examples/s]
Map: 100%|██████████| 25000/25000 [00:05<00:00, 4456.48 examples/s]
loading configuration file config.json from cache at C:\Users\liaof/.cache\huggingface\hub\models--prajjwal1--bert-tiny\snapshots\6f75de8b60a9f8a2fdf7b69cbd86d9e64bcb3837\config.json
Model config BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file pytorch_model.bin from cache at C:\Users

{'loss': 0.7331, 'learning_rate': 4.99e-05, 'epoch': 0.01}


  1%|          | 26/5000 [00:00<02:14, 37.06it/s]

{'loss': 0.6895, 'learning_rate': 4.9800000000000004e-05, 'epoch': 0.02}


  1%|          | 34/5000 [00:01<02:12, 37.43it/s]

{'loss': 0.6842, 'learning_rate': 4.97e-05, 'epoch': 0.02}


  1%|          | 43/5000 [00:01<02:07, 38.86it/s]

{'loss': 0.6867, 'learning_rate': 4.96e-05, 'epoch': 0.03}


  1%|          | 58/5000 [00:01<02:01, 40.58it/s]

{'loss': 0.6876, 'learning_rate': 4.9500000000000004e-05, 'epoch': 0.04}


  1%|▏         | 68/5000 [00:01<02:00, 40.97it/s]

{'loss': 0.6897, 'learning_rate': 4.94e-05, 'epoch': 0.05}


  2%|▏         | 78/5000 [00:02<02:01, 40.59it/s]

{'loss': 0.6765, 'learning_rate': 4.93e-05, 'epoch': 0.06}


  2%|▏         | 88/5000 [00:02<02:01, 40.27it/s]

{'loss': 0.6773, 'learning_rate': 4.92e-05, 'epoch': 0.06}


  2%|▏         | 98/5000 [00:02<02:02, 39.97it/s]

{'loss': 0.6653, 'learning_rate': 4.91e-05, 'epoch': 0.07}


  2%|▏         | 108/5000 [00:02<02:02, 39.97it/s]

{'loss': 0.6626, 'learning_rate': 4.9e-05, 'epoch': 0.08}


  2%|▏         | 118/5000 [00:03<01:58, 41.05it/s]

{'loss': 0.6614, 'learning_rate': 4.89e-05, 'epoch': 0.09}


  3%|▎         | 128/5000 [00:03<01:59, 40.87it/s]

{'loss': 0.6628, 'learning_rate': 4.88e-05, 'epoch': 0.1}


  3%|▎         | 138/5000 [00:03<02:00, 40.32it/s]

{'loss': 0.6433, 'learning_rate': 4.87e-05, 'epoch': 0.1}


  3%|▎         | 148/5000 [00:03<01:58, 40.96it/s]

{'loss': 0.6375, 'learning_rate': 4.86e-05, 'epoch': 0.11}


  3%|▎         | 158/5000 [00:04<01:59, 40.45it/s]

{'loss': 0.6515, 'learning_rate': 4.85e-05, 'epoch': 0.12}


  3%|▎         | 168/5000 [00:04<02:00, 40.11it/s]

{'loss': 0.6376, 'learning_rate': 4.8400000000000004e-05, 'epoch': 0.13}


  4%|▎         | 178/5000 [00:04<01:58, 40.68it/s]

{'loss': 0.6331, 'learning_rate': 4.83e-05, 'epoch': 0.14}


  4%|▍         | 188/5000 [00:04<01:58, 40.60it/s]

{'loss': 0.6136, 'learning_rate': 4.82e-05, 'epoch': 0.14}


  4%|▍         | 198/5000 [00:05<02:00, 39.91it/s]

{'loss': 0.5789, 'learning_rate': 4.8100000000000004e-05, 'epoch': 0.15}


  4%|▍         | 207/5000 [00:05<01:58, 40.53it/s]

{'loss': 0.5932, 'learning_rate': 4.8e-05, 'epoch': 0.16}


  4%|▍         | 217/5000 [00:05<01:58, 40.27it/s]

{'loss': 0.5755, 'learning_rate': 4.79e-05, 'epoch': 0.17}


  5%|▍         | 227/5000 [00:05<01:56, 40.96it/s]

{'loss': 0.5879, 'learning_rate': 4.78e-05, 'epoch': 0.18}


  5%|▍         | 237/5000 [00:06<01:55, 41.35it/s]

{'loss': 0.6087, 'learning_rate': 4.77e-05, 'epoch': 0.18}


  5%|▍         | 247/5000 [00:06<01:57, 40.40it/s]

{'loss': 0.6016, 'learning_rate': 4.76e-05, 'epoch': 0.19}


  5%|▌         | 257/5000 [00:06<02:00, 39.40it/s]

{'loss': 0.5788, 'learning_rate': 4.75e-05, 'epoch': 0.2}


  5%|▌         | 267/5000 [00:06<01:57, 40.24it/s]

{'loss': 0.5572, 'learning_rate': 4.74e-05, 'epoch': 0.21}


  6%|▌         | 277/5000 [00:07<01:57, 40.34it/s]

{'loss': 0.5365, 'learning_rate': 4.73e-05, 'epoch': 0.22}


  6%|▌         | 287/5000 [00:07<01:56, 40.40it/s]

{'loss': 0.5179, 'learning_rate': 4.72e-05, 'epoch': 0.22}


  6%|▌         | 297/5000 [00:07<01:57, 40.09it/s]

{'loss': 0.5466, 'learning_rate': 4.71e-05, 'epoch': 0.23}


  6%|▌         | 307/5000 [00:07<01:56, 40.12it/s]

{'loss': 0.5232, 'learning_rate': 4.7e-05, 'epoch': 0.24}


  6%|▋         | 317/5000 [00:08<01:54, 40.84it/s]

{'loss': 0.5091, 'learning_rate': 4.69e-05, 'epoch': 0.25}


  7%|▋         | 327/5000 [00:08<01:55, 40.62it/s]

{'loss': 0.5166, 'learning_rate': 4.6800000000000006e-05, 'epoch': 0.26}


  7%|▋         | 337/5000 [00:08<01:55, 40.47it/s]

{'loss': 0.51, 'learning_rate': 4.6700000000000003e-05, 'epoch': 0.26}


  7%|▋         | 347/5000 [00:08<01:56, 40.03it/s]

{'loss': 0.5547, 'learning_rate': 4.660000000000001e-05, 'epoch': 0.27}


  7%|▋         | 357/5000 [00:09<01:53, 40.74it/s]

{'loss': 0.4979, 'learning_rate': 4.6500000000000005e-05, 'epoch': 0.28}


  7%|▋         | 367/5000 [00:09<01:54, 40.54it/s]

{'loss': 0.5535, 'learning_rate': 4.64e-05, 'epoch': 0.29}


  8%|▊         | 377/5000 [00:09<01:53, 40.57it/s]

{'loss': 0.486, 'learning_rate': 4.630000000000001e-05, 'epoch': 0.3}


  8%|▊         | 387/5000 [00:09<01:52, 41.03it/s]

{'loss': 0.5138, 'learning_rate': 4.6200000000000005e-05, 'epoch': 0.3}


  8%|▊         | 397/5000 [00:10<01:52, 41.00it/s]

{'loss': 0.494, 'learning_rate': 4.61e-05, 'epoch': 0.31}


  8%|▊         | 407/5000 [00:10<01:52, 40.92it/s]

{'loss': 0.4347, 'learning_rate': 4.600000000000001e-05, 'epoch': 0.32}


  8%|▊         | 417/5000 [00:10<01:54, 40.19it/s]

{'loss': 0.5109, 'learning_rate': 4.5900000000000004e-05, 'epoch': 0.33}


  9%|▊         | 427/5000 [00:10<01:53, 40.42it/s]

{'loss': 0.4648, 'learning_rate': 4.58e-05, 'epoch': 0.34}


  9%|▊         | 437/5000 [00:11<01:52, 40.72it/s]

{'loss': 0.5157, 'learning_rate': 4.5700000000000006e-05, 'epoch': 0.34}


  9%|▉         | 447/5000 [00:11<01:51, 40.86it/s]

{'loss': 0.4145, 'learning_rate': 4.5600000000000004e-05, 'epoch': 0.35}


  9%|▉         | 457/5000 [00:11<01:50, 41.20it/s]

{'loss': 0.4879, 'learning_rate': 4.55e-05, 'epoch': 0.36}


  9%|▉         | 467/5000 [00:11<01:53, 40.11it/s]

{'loss': 0.4556, 'learning_rate': 4.5400000000000006e-05, 'epoch': 0.37}


 10%|▉         | 477/5000 [00:12<01:50, 40.77it/s]

{'loss': 0.4461, 'learning_rate': 4.53e-05, 'epoch': 0.38}


 10%|▉         | 487/5000 [00:12<01:50, 41.01it/s]

{'loss': 0.4173, 'learning_rate': 4.52e-05, 'epoch': 0.38}


 10%|▉         | 497/5000 [00:12<01:50, 40.68it/s]

{'loss': 0.4951, 'learning_rate': 4.5100000000000005e-05, 'epoch': 0.39}


 10%|█         | 507/5000 [00:12<01:48, 41.36it/s]

{'loss': 0.4373, 'learning_rate': 4.5e-05, 'epoch': 0.4}


 10%|█         | 517/5000 [00:13<01:48, 41.26it/s]

{'loss': 0.388, 'learning_rate': 4.49e-05, 'epoch': 0.41}


 11%|█         | 527/5000 [00:13<01:48, 41.12it/s]

{'loss': 0.4315, 'learning_rate': 4.4800000000000005e-05, 'epoch': 0.42}


 11%|█         | 537/5000 [00:13<01:48, 41.05it/s]

{'loss': 0.4084, 'learning_rate': 4.47e-05, 'epoch': 0.42}


 11%|█         | 547/5000 [00:13<01:47, 41.23it/s]

{'loss': 0.4401, 'learning_rate': 4.46e-05, 'epoch': 0.43}


 11%|█         | 557/5000 [00:13<01:47, 41.19it/s]

{'loss': 0.3766, 'learning_rate': 4.4500000000000004e-05, 'epoch': 0.44}


 11%|█▏        | 567/5000 [00:14<01:47, 41.24it/s]

{'loss': 0.369, 'learning_rate': 4.44e-05, 'epoch': 0.45}


 12%|█▏        | 577/5000 [00:14<01:48, 40.64it/s]

{'loss': 0.4048, 'learning_rate': 4.43e-05, 'epoch': 0.46}


 12%|█▏        | 587/5000 [00:14<01:47, 41.01it/s]

{'loss': 0.4036, 'learning_rate': 4.4200000000000004e-05, 'epoch': 0.46}


 12%|█▏        | 597/5000 [00:14<01:48, 40.46it/s]

{'loss': 0.3861, 'learning_rate': 4.41e-05, 'epoch': 0.47}


 12%|█▏        | 607/5000 [00:15<01:48, 40.66it/s]

{'loss': 0.3624, 'learning_rate': 4.4000000000000006e-05, 'epoch': 0.48}


 12%|█▏        | 617/5000 [00:15<01:47, 40.85it/s]

{'loss': 0.3315, 'learning_rate': 4.39e-05, 'epoch': 0.49}


 13%|█▎        | 627/5000 [00:15<01:45, 41.32it/s]

{'loss': 0.3686, 'learning_rate': 4.38e-05, 'epoch': 0.5}


 13%|█▎        | 637/5000 [00:15<01:44, 41.92it/s]

{'loss': 0.3636, 'learning_rate': 4.3700000000000005e-05, 'epoch': 0.5}


 13%|█▎        | 647/5000 [00:16<01:42, 42.26it/s]

{'loss': 0.4222, 'learning_rate': 4.36e-05, 'epoch': 0.51}


 13%|█▎        | 657/5000 [00:16<01:45, 41.32it/s]

{'loss': 0.3633, 'learning_rate': 4.35e-05, 'epoch': 0.52}


 13%|█▎        | 667/5000 [00:16<01:44, 41.28it/s]

{'loss': 0.4044, 'learning_rate': 4.3400000000000005e-05, 'epoch': 0.53}


 14%|█▎        | 677/5000 [00:16<01:44, 41.39it/s]

{'loss': 0.3883, 'learning_rate': 4.33e-05, 'epoch': 0.54}


 14%|█▎        | 687/5000 [00:17<01:43, 41.68it/s]

{'loss': 0.4183, 'learning_rate': 4.32e-05, 'epoch': 0.54}


 14%|█▍        | 697/5000 [00:17<01:46, 40.27it/s]

{'loss': 0.4298, 'learning_rate': 4.3100000000000004e-05, 'epoch': 0.55}


 14%|█▍        | 707/5000 [00:17<01:46, 40.21it/s]

{'loss': 0.417, 'learning_rate': 4.3e-05, 'epoch': 0.56}


 14%|█▍        | 717/5000 [00:17<01:45, 40.76it/s]

{'loss': 0.3559, 'learning_rate': 4.29e-05, 'epoch': 0.57}


 15%|█▍        | 727/5000 [00:18<01:42, 41.63it/s]

{'loss': 0.3578, 'learning_rate': 4.2800000000000004e-05, 'epoch': 0.58}


 15%|█▍        | 737/5000 [00:18<01:43, 41.02it/s]

{'loss': 0.4019, 'learning_rate': 4.27e-05, 'epoch': 0.58}


 15%|█▍        | 747/5000 [00:18<01:44, 40.52it/s]

{'loss': 0.447, 'learning_rate': 4.26e-05, 'epoch': 0.59}


 15%|█▌        | 757/5000 [00:18<01:45, 40.34it/s]

{'loss': 0.3333, 'learning_rate': 4.25e-05, 'epoch': 0.6}


 15%|█▌        | 767/5000 [00:19<01:44, 40.62it/s]

{'loss': 0.4315, 'learning_rate': 4.24e-05, 'epoch': 0.61}


 16%|█▌        | 777/5000 [00:19<01:42, 41.10it/s]

{'loss': 0.382, 'learning_rate': 4.23e-05, 'epoch': 0.62}


 16%|█▌        | 787/5000 [00:19<01:43, 40.67it/s]

{'loss': 0.3897, 'learning_rate': 4.22e-05, 'epoch': 0.62}


 16%|█▌        | 797/5000 [00:19<01:41, 41.44it/s]

{'loss': 0.3965, 'learning_rate': 4.21e-05, 'epoch': 0.63}


 16%|█▌        | 807/5000 [00:20<01:41, 41.44it/s]

{'loss': 0.3669, 'learning_rate': 4.2e-05, 'epoch': 0.64}


 16%|█▋        | 817/5000 [00:20<01:42, 40.82it/s]

{'loss': 0.3565, 'learning_rate': 4.19e-05, 'epoch': 0.65}


 17%|█▋        | 827/5000 [00:20<01:40, 41.64it/s]

{'loss': 0.4038, 'learning_rate': 4.18e-05, 'epoch': 0.66}


 17%|█▋        | 837/5000 [00:20<01:40, 41.51it/s]

{'loss': 0.4347, 'learning_rate': 4.17e-05, 'epoch': 0.66}


 17%|█▋        | 847/5000 [00:21<01:41, 40.73it/s]

{'loss': 0.3079, 'learning_rate': 4.16e-05, 'epoch': 0.67}


 17%|█▋        | 857/5000 [00:21<01:42, 40.39it/s]

{'loss': 0.3268, 'learning_rate': 4.15e-05, 'epoch': 0.68}


 17%|█▋        | 867/5000 [00:21<01:38, 41.86it/s]

{'loss': 0.3255, 'learning_rate': 4.14e-05, 'epoch': 0.69}


 18%|█▊        | 877/5000 [00:21<01:41, 40.73it/s]

{'loss': 0.305, 'learning_rate': 4.13e-05, 'epoch': 0.7}


 18%|█▊        | 887/5000 [00:22<01:39, 41.15it/s]

{'loss': 0.317, 'learning_rate': 4.12e-05, 'epoch': 0.7}


 18%|█▊        | 897/5000 [00:22<01:39, 41.14it/s]

{'loss': 0.3298, 'learning_rate': 4.11e-05, 'epoch': 0.71}


 18%|█▊        | 907/5000 [00:22<01:39, 41.01it/s]

{'loss': 0.4028, 'learning_rate': 4.1e-05, 'epoch': 0.72}


 18%|█▊        | 917/5000 [00:22<01:38, 41.44it/s]

{'loss': 0.4488, 'learning_rate': 4.09e-05, 'epoch': 0.73}


 19%|█▊        | 927/5000 [00:23<01:39, 40.83it/s]

{'loss': 0.2689, 'learning_rate': 4.08e-05, 'epoch': 0.74}


 19%|█▊        | 937/5000 [00:23<01:41, 40.23it/s]

{'loss': 0.3406, 'learning_rate': 4.07e-05, 'epoch': 0.74}


 19%|█▉        | 947/5000 [00:23<01:40, 40.48it/s]

{'loss': 0.36, 'learning_rate': 4.0600000000000004e-05, 'epoch': 0.75}


 19%|█▉        | 957/5000 [00:23<01:37, 41.39it/s]

{'loss': 0.462, 'learning_rate': 4.05e-05, 'epoch': 0.76}


 19%|█▉        | 967/5000 [00:24<01:36, 41.81it/s]

{'loss': 0.394, 'learning_rate': 4.0400000000000006e-05, 'epoch': 0.77}


 20%|█▉        | 977/5000 [00:24<01:35, 41.94it/s]

{'loss': 0.361, 'learning_rate': 4.0300000000000004e-05, 'epoch': 0.78}


 20%|█▉        | 987/5000 [00:24<01:35, 42.13it/s]

{'loss': 0.3858, 'learning_rate': 4.02e-05, 'epoch': 0.78}


 20%|█▉        | 997/5000 [00:24<01:39, 40.37it/s]

{'loss': 0.3761, 'learning_rate': 4.0100000000000006e-05, 'epoch': 0.79}


 20%|██        | 1007/5000 [00:24<01:37, 41.15it/s]

{'loss': 0.329, 'learning_rate': 4e-05, 'epoch': 0.8}


 20%|██        | 1017/5000 [00:25<01:35, 41.69it/s]

{'loss': 0.3795, 'learning_rate': 3.99e-05, 'epoch': 0.81}


 21%|██        | 1027/5000 [00:25<01:35, 41.48it/s]

{'loss': 0.3722, 'learning_rate': 3.9800000000000005e-05, 'epoch': 0.82}


 21%|██        | 1037/5000 [00:25<01:33, 42.25it/s]

{'loss': 0.3554, 'learning_rate': 3.97e-05, 'epoch': 0.82}


 21%|██        | 1047/5000 [00:25<01:33, 42.19it/s]

{'loss': 0.3038, 'learning_rate': 3.960000000000001e-05, 'epoch': 0.83}


 21%|██        | 1057/5000 [00:26<01:34, 41.58it/s]

{'loss': 0.446, 'learning_rate': 3.9500000000000005e-05, 'epoch': 0.84}


 21%|██▏       | 1067/5000 [00:26<01:34, 41.51it/s]

{'loss': 0.34, 'learning_rate': 3.94e-05, 'epoch': 0.85}


 22%|██▏       | 1077/5000 [00:26<01:33, 42.01it/s]

{'loss': 0.3955, 'learning_rate': 3.9300000000000007e-05, 'epoch': 0.86}


 22%|██▏       | 1087/5000 [00:26<01:35, 41.00it/s]

{'loss': 0.3895, 'learning_rate': 3.9200000000000004e-05, 'epoch': 0.86}


 22%|██▏       | 1097/5000 [00:27<01:35, 40.74it/s]

{'loss': 0.3357, 'learning_rate': 3.91e-05, 'epoch': 0.87}


 22%|██▏       | 1107/5000 [00:27<01:37, 39.90it/s]

{'loss': 0.3258, 'learning_rate': 3.9000000000000006e-05, 'epoch': 0.88}


 22%|██▏       | 1116/5000 [00:27<01:36, 40.05it/s]

{'loss': 0.3776, 'learning_rate': 3.8900000000000004e-05, 'epoch': 0.89}


 23%|██▎       | 1126/5000 [00:27<01:34, 40.99it/s]

{'loss': 0.2786, 'learning_rate': 3.88e-05, 'epoch': 0.9}


 23%|██▎       | 1136/5000 [00:28<01:36, 40.07it/s]

{'loss': 0.3215, 'learning_rate': 3.8700000000000006e-05, 'epoch': 0.9}


 23%|██▎       | 1146/5000 [00:28<01:36, 39.77it/s]

{'loss': 0.4366, 'learning_rate': 3.86e-05, 'epoch': 0.91}


 23%|██▎       | 1155/5000 [00:28<01:37, 39.25it/s]

{'loss': 0.3086, 'learning_rate': 3.85e-05, 'epoch': 0.92}


 23%|██▎       | 1168/5000 [00:28<01:36, 39.80it/s]

{'loss': 0.3748, 'learning_rate': 3.8400000000000005e-05, 'epoch': 0.93}


 24%|██▎       | 1176/5000 [00:29<01:35, 39.90it/s]

{'loss': 0.4312, 'learning_rate': 3.83e-05, 'epoch': 0.94}


 24%|██▎       | 1186/5000 [00:29<01:34, 40.48it/s]

{'loss': 0.3942, 'learning_rate': 3.82e-05, 'epoch': 0.94}


 24%|██▍       | 1196/5000 [00:29<01:33, 40.71it/s]

{'loss': 0.2586, 'learning_rate': 3.8100000000000005e-05, 'epoch': 0.95}


 24%|██▍       | 1206/5000 [00:29<01:32, 41.10it/s]

{'loss': 0.3473, 'learning_rate': 3.8e-05, 'epoch': 0.96}


 24%|██▍       | 1216/5000 [00:30<01:31, 41.41it/s]

{'loss': 0.316, 'learning_rate': 3.79e-05, 'epoch': 0.97}


 25%|██▍       | 1226/5000 [00:30<01:30, 41.57it/s]

{'loss': 0.3153, 'learning_rate': 3.7800000000000004e-05, 'epoch': 0.98}


 25%|██▍       | 1236/5000 [00:30<01:32, 40.48it/s]

{'loss': 0.3073, 'learning_rate': 3.77e-05, 'epoch': 0.98}


 25%|██▍       | 1246/5000 [00:30<01:31, 40.86it/s]

{'loss': 0.4348, 'learning_rate': 3.76e-05, 'epoch': 0.99}


 25%|██▌       | 1250/5000 [00:30<01:31, 40.86it/s]The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 16


{'loss': 0.3657, 'learning_rate': 3.7500000000000003e-05, 'epoch': 1.0}



 25%|██▌       | 1250/5000 [00:34<01:31, 40.86it/s]Saving model checkpoint to checkpoints\checkpoint-1250
Configuration saved in checkpoints\checkpoint-1250\config.json
Model weights saved in checkpoints\checkpoint-1250\pytorch_model.bin
 25%|██▌       | 1256/5000 [00:34<10:43,  5.82it/s]

{'eval_loss': 0.3361194431781769, 'eval_accuracy': 0.8512, 'eval_runtime': 3.4496, 'eval_samples_per_second': 1449.439, 'eval_steps_per_second': 90.735, 'epoch': 1.0}


 25%|██▌       | 1266/5000 [00:34<05:59, 10.38it/s]

{'loss': 0.287, 'learning_rate': 3.74e-05, 'epoch': 1.01}


 26%|██▌       | 1275/5000 [00:35<03:51, 16.08it/s]

{'loss': 0.4047, 'learning_rate': 3.73e-05, 'epoch': 1.02}


 26%|██▌       | 1289/5000 [00:35<02:16, 27.13it/s]

{'loss': 0.3203, 'learning_rate': 3.72e-05, 'epoch': 1.02}


 26%|██▌       | 1294/5000 [00:35<02:02, 30.15it/s]

{'loss': 0.2636, 'learning_rate': 3.71e-05, 'epoch': 1.03}


 26%|██▌       | 1304/5000 [00:35<01:44, 35.20it/s]

{'loss': 0.3577, 'learning_rate': 3.7e-05, 'epoch': 1.04}


 26%|██▋       | 1314/5000 [00:36<01:39, 37.02it/s]

{'loss': 0.3563, 'learning_rate': 3.69e-05, 'epoch': 1.05}


 26%|██▋       | 1324/5000 [00:36<01:34, 38.70it/s]

{'loss': 0.3571, 'learning_rate': 3.68e-05, 'epoch': 1.06}


 27%|██▋       | 1338/5000 [00:36<01:32, 39.53it/s]

{'loss': 0.3207, 'learning_rate': 3.6700000000000004e-05, 'epoch': 1.06}


 27%|██▋       | 1348/5000 [00:36<01:30, 40.54it/s]

{'loss': 0.2349, 'learning_rate': 3.66e-05, 'epoch': 1.07}


 27%|██▋       | 1358/5000 [00:37<01:29, 40.91it/s]

{'loss': 0.3563, 'learning_rate': 3.65e-05, 'epoch': 1.08}


 27%|██▋       | 1368/5000 [00:37<01:27, 41.44it/s]

{'loss': 0.3496, 'learning_rate': 3.6400000000000004e-05, 'epoch': 1.09}


 28%|██▊       | 1378/5000 [00:37<01:29, 40.57it/s]

{'loss': 0.3223, 'learning_rate': 3.63e-05, 'epoch': 1.1}


 28%|██▊       | 1388/5000 [00:37<01:28, 40.89it/s]

{'loss': 0.3245, 'learning_rate': 3.62e-05, 'epoch': 1.1}


 28%|██▊       | 1398/5000 [00:38<01:27, 41.11it/s]

{'loss': 0.3644, 'learning_rate': 3.61e-05, 'epoch': 1.11}


 28%|██▊       | 1408/5000 [00:38<01:28, 40.77it/s]

{'loss': 0.3438, 'learning_rate': 3.6e-05, 'epoch': 1.12}


 28%|██▊       | 1418/5000 [00:38<01:28, 40.64it/s]

{'loss': 0.3196, 'learning_rate': 3.59e-05, 'epoch': 1.13}


 29%|██▊       | 1427/5000 [00:38<01:30, 39.51it/s]

{'loss': 0.2492, 'learning_rate': 3.58e-05, 'epoch': 1.14}


 29%|██▊       | 1437/5000 [00:39<01:27, 40.54it/s]

{'loss': 0.3063, 'learning_rate': 3.57e-05, 'epoch': 1.14}


 29%|██▉       | 1447/5000 [00:39<01:26, 40.92it/s]

{'loss': 0.3662, 'learning_rate': 3.56e-05, 'epoch': 1.15}


 29%|██▉       | 1457/5000 [00:39<01:24, 41.80it/s]

{'loss': 0.2686, 'learning_rate': 3.55e-05, 'epoch': 1.16}


 29%|██▉       | 1467/5000 [00:39<01:24, 41.87it/s]

{'loss': 0.3732, 'learning_rate': 3.54e-05, 'epoch': 1.17}


 30%|██▉       | 1477/5000 [00:40<01:26, 40.79it/s]

{'loss': 0.3534, 'learning_rate': 3.53e-05, 'epoch': 1.18}


 30%|██▉       | 1487/5000 [00:40<01:26, 40.72it/s]

{'loss': 0.316, 'learning_rate': 3.52e-05, 'epoch': 1.18}


 30%|██▉       | 1496/5000 [00:40<01:28, 39.65it/s]

{'loss': 0.3024, 'learning_rate': 3.51e-05, 'epoch': 1.19}


 30%|███       | 1506/5000 [00:40<01:26, 40.42it/s]

{'loss': 0.2656, 'learning_rate': 3.5e-05, 'epoch': 1.2}


 30%|███       | 1516/5000 [00:41<01:26, 40.13it/s]

{'loss': 0.3019, 'learning_rate': 3.49e-05, 'epoch': 1.21}


 31%|███       | 1526/5000 [00:41<01:25, 40.54it/s]

{'loss': 0.2941, 'learning_rate': 3.48e-05, 'epoch': 1.22}


 31%|███       | 1536/5000 [00:41<01:27, 39.80it/s]

{'loss': 0.4305, 'learning_rate': 3.4699999999999996e-05, 'epoch': 1.22}


 31%|███       | 1545/5000 [00:41<01:26, 40.11it/s]

{'loss': 0.3331, 'learning_rate': 3.46e-05, 'epoch': 1.23}


 31%|███       | 1555/5000 [00:42<01:24, 40.77it/s]

{'loss': 0.2984, 'learning_rate': 3.45e-05, 'epoch': 1.24}


 31%|███▏      | 1565/5000 [00:42<01:24, 40.51it/s]

{'loss': 0.3364, 'learning_rate': 3.4399999999999996e-05, 'epoch': 1.25}


 32%|███▏      | 1575/5000 [00:42<01:25, 40.04it/s]

{'loss': 0.316, 'learning_rate': 3.430000000000001e-05, 'epoch': 1.26}


 32%|███▏      | 1585/5000 [00:42<01:26, 39.60it/s]

{'loss': 0.3358, 'learning_rate': 3.4200000000000005e-05, 'epoch': 1.26}


 32%|███▏      | 1595/5000 [00:43<01:24, 40.33it/s]

{'loss': 0.2609, 'learning_rate': 3.41e-05, 'epoch': 1.27}


 32%|███▏      | 1605/5000 [00:43<01:23, 40.81it/s]

{'loss': 0.3794, 'learning_rate': 3.4000000000000007e-05, 'epoch': 1.28}


 32%|███▏      | 1615/5000 [00:43<01:21, 41.37it/s]

{'loss': 0.2412, 'learning_rate': 3.3900000000000004e-05, 'epoch': 1.29}


 32%|███▎      | 1625/5000 [00:43<01:21, 41.28it/s]

{'loss': 0.3675, 'learning_rate': 3.38e-05, 'epoch': 1.3}


 33%|███▎      | 1635/5000 [00:43<01:21, 41.06it/s]

{'loss': 0.3012, 'learning_rate': 3.3700000000000006e-05, 'epoch': 1.3}


 33%|███▎      | 1645/5000 [00:44<01:20, 41.83it/s]

{'loss': 0.3659, 'learning_rate': 3.3600000000000004e-05, 'epoch': 1.31}


 33%|███▎      | 1655/5000 [00:44<01:22, 40.50it/s]

{'loss': 0.266, 'learning_rate': 3.35e-05, 'epoch': 1.32}


 33%|███▎      | 1665/5000 [00:44<01:21, 40.81it/s]

{'loss': 0.2574, 'learning_rate': 3.3400000000000005e-05, 'epoch': 1.33}


 34%|███▎      | 1675/5000 [00:44<01:20, 41.27it/s]

{'loss': 0.3455, 'learning_rate': 3.33e-05, 'epoch': 1.34}


 34%|███▎      | 1685/5000 [00:45<01:23, 39.86it/s]

{'loss': 0.3268, 'learning_rate': 3.32e-05, 'epoch': 1.34}


 34%|███▍      | 1695/5000 [00:45<01:23, 39.82it/s]

{'loss': 0.2617, 'learning_rate': 3.3100000000000005e-05, 'epoch': 1.35}


 34%|███▍      | 1705/5000 [00:45<01:21, 40.47it/s]

{'loss': 0.3405, 'learning_rate': 3.3e-05, 'epoch': 1.36}


 34%|███▍      | 1715/5000 [00:45<01:19, 41.25it/s]

{'loss': 0.3316, 'learning_rate': 3.29e-05, 'epoch': 1.37}


 34%|███▍      | 1725/5000 [00:46<01:19, 41.11it/s]

{'loss': 0.434, 'learning_rate': 3.2800000000000004e-05, 'epoch': 1.38}


 35%|███▍      | 1735/5000 [00:46<01:19, 41.05it/s]

{'loss': 0.2702, 'learning_rate': 3.27e-05, 'epoch': 1.38}


 35%|███▍      | 1745/5000 [00:46<01:18, 41.39it/s]

{'loss': 0.3211, 'learning_rate': 3.26e-05, 'epoch': 1.39}


 35%|███▌      | 1755/5000 [00:46<01:21, 39.96it/s]

{'loss': 0.3159, 'learning_rate': 3.2500000000000004e-05, 'epoch': 1.4}


 35%|███▌      | 1765/5000 [00:47<01:20, 39.98it/s]

{'loss': 0.2686, 'learning_rate': 3.24e-05, 'epoch': 1.41}


 36%|███▌      | 1775/5000 [00:47<01:20, 40.07it/s]

{'loss': 0.2691, 'learning_rate': 3.2300000000000006e-05, 'epoch': 1.42}


 36%|███▌      | 1785/5000 [00:47<01:19, 40.30it/s]

{'loss': 0.2215, 'learning_rate': 3.2200000000000003e-05, 'epoch': 1.42}


 36%|███▌      | 1795/5000 [00:47<01:17, 41.20it/s]

{'loss': 0.3184, 'learning_rate': 3.21e-05, 'epoch': 1.43}


 36%|███▌      | 1805/5000 [00:48<01:19, 40.22it/s]

{'loss': 0.2121, 'learning_rate': 3.2000000000000005e-05, 'epoch': 1.44}


 36%|███▋      | 1815/5000 [00:48<01:18, 40.60it/s]

{'loss': 0.3699, 'learning_rate': 3.19e-05, 'epoch': 1.45}


 36%|███▋      | 1825/5000 [00:48<01:18, 40.49it/s]

{'loss': 0.2537, 'learning_rate': 3.18e-05, 'epoch': 1.46}


 37%|███▋      | 1835/5000 [00:48<01:18, 40.33it/s]

{'loss': 0.3785, 'learning_rate': 3.1700000000000005e-05, 'epoch': 1.46}


 37%|███▋      | 1845/5000 [00:49<01:17, 40.62it/s]

{'loss': 0.4027, 'learning_rate': 3.16e-05, 'epoch': 1.47}


 37%|███▋      | 1855/5000 [00:49<01:16, 40.99it/s]

{'loss': 0.3233, 'learning_rate': 3.15e-05, 'epoch': 1.48}


 37%|███▋      | 1865/5000 [00:49<01:17, 40.58it/s]

{'loss': 0.2523, 'learning_rate': 3.1400000000000004e-05, 'epoch': 1.49}


 38%|███▊      | 1875/5000 [00:49<01:17, 40.12it/s]

{'loss': 0.3127, 'learning_rate': 3.13e-05, 'epoch': 1.5}


 38%|███▊      | 1885/5000 [00:50<01:17, 40.43it/s]

{'loss': 0.2939, 'learning_rate': 3.12e-05, 'epoch': 1.5}


 38%|███▊      | 1895/5000 [00:50<01:15, 41.31it/s]

{'loss': 0.2506, 'learning_rate': 3.1100000000000004e-05, 'epoch': 1.51}


 38%|███▊      | 1905/5000 [00:50<01:16, 40.54it/s]

{'loss': 0.3164, 'learning_rate': 3.1e-05, 'epoch': 1.52}


 38%|███▊      | 1915/5000 [00:50<01:15, 40.94it/s]

{'loss': 0.2801, 'learning_rate': 3.09e-05, 'epoch': 1.53}


 38%|███▊      | 1925/5000 [00:51<01:14, 41.24it/s]

{'loss': 0.3773, 'learning_rate': 3.08e-05, 'epoch': 1.54}


 39%|███▊      | 1935/5000 [00:51<01:15, 40.47it/s]

{'loss': 0.3221, 'learning_rate': 3.07e-05, 'epoch': 1.54}


 39%|███▉      | 1945/5000 [00:51<01:15, 40.56it/s]

{'loss': 0.2977, 'learning_rate': 3.06e-05, 'epoch': 1.55}


 39%|███▉      | 1955/5000 [00:51<01:15, 40.09it/s]

{'loss': 0.3019, 'learning_rate': 3.05e-05, 'epoch': 1.56}


 39%|███▉      | 1965/5000 [00:52<01:14, 40.86it/s]

{'loss': 0.405, 'learning_rate': 3.04e-05, 'epoch': 1.57}


 40%|███▉      | 1975/5000 [00:52<01:12, 41.77it/s]

{'loss': 0.3867, 'learning_rate': 3.03e-05, 'epoch': 1.58}


 40%|███▉      | 1985/5000 [00:52<01:13, 40.75it/s]

{'loss': 0.3661, 'learning_rate': 3.02e-05, 'epoch': 1.58}


 40%|███▉      | 1995/5000 [00:52<01:12, 41.51it/s]

{'loss': 0.2239, 'learning_rate': 3.01e-05, 'epoch': 1.59}


 40%|████      | 2005/5000 [00:53<01:12, 41.34it/s]

{'loss': 0.2869, 'learning_rate': 3e-05, 'epoch': 1.6}


 40%|████      | 2015/5000 [00:53<01:11, 41.69it/s]

{'loss': 0.3226, 'learning_rate': 2.9900000000000002e-05, 'epoch': 1.61}


 40%|████      | 2025/5000 [00:53<01:12, 41.23it/s]

{'loss': 0.3317, 'learning_rate': 2.98e-05, 'epoch': 1.62}


 41%|████      | 2035/5000 [00:53<01:12, 40.76it/s]

{'loss': 0.2785, 'learning_rate': 2.97e-05, 'epoch': 1.62}


 41%|████      | 2045/5000 [00:54<01:12, 40.71it/s]

{'loss': 0.2564, 'learning_rate': 2.96e-05, 'epoch': 1.63}


 41%|████      | 2055/5000 [00:54<01:12, 40.44it/s]

{'loss': 0.2722, 'learning_rate': 2.95e-05, 'epoch': 1.64}


 41%|████▏     | 2065/5000 [00:54<01:11, 41.24it/s]

{'loss': 0.3026, 'learning_rate': 2.94e-05, 'epoch': 1.65}


 42%|████▏     | 2075/5000 [00:54<01:11, 41.15it/s]

{'loss': 0.2815, 'learning_rate': 2.93e-05, 'epoch': 1.66}


 42%|████▏     | 2085/5000 [00:54<01:09, 41.71it/s]

{'loss': 0.2942, 'learning_rate': 2.9199999999999998e-05, 'epoch': 1.66}


 42%|████▏     | 2095/5000 [00:55<01:08, 42.19it/s]

{'loss': 0.2903, 'learning_rate': 2.91e-05, 'epoch': 1.67}


 42%|████▏     | 2105/5000 [00:55<01:08, 42.13it/s]

{'loss': 0.3432, 'learning_rate': 2.9e-05, 'epoch': 1.68}


 42%|████▏     | 2115/5000 [00:55<01:08, 42.17it/s]

{'loss': 0.3211, 'learning_rate': 2.8899999999999998e-05, 'epoch': 1.69}


 42%|████▎     | 2125/5000 [00:55<01:08, 41.79it/s]

{'loss': 0.2897, 'learning_rate': 2.88e-05, 'epoch': 1.7}


 43%|████▎     | 2135/5000 [00:56<01:10, 40.88it/s]

{'loss': 0.2815, 'learning_rate': 2.87e-05, 'epoch': 1.7}


 43%|████▎     | 2145/5000 [00:56<01:08, 41.45it/s]

{'loss': 0.3069, 'learning_rate': 2.86e-05, 'epoch': 1.71}


 43%|████▎     | 2155/5000 [00:56<01:08, 41.57it/s]

{'loss': 0.2421, 'learning_rate': 2.8499999999999998e-05, 'epoch': 1.72}


 43%|████▎     | 2165/5000 [00:56<01:07, 42.06it/s]

{'loss': 0.3081, 'learning_rate': 2.84e-05, 'epoch': 1.73}


 44%|████▎     | 2175/5000 [00:57<01:06, 42.25it/s]

{'loss': 0.3199, 'learning_rate': 2.83e-05, 'epoch': 1.74}


 44%|████▎     | 2185/5000 [00:57<01:06, 42.15it/s]

{'loss': 0.3468, 'learning_rate': 2.8199999999999998e-05, 'epoch': 1.74}


 44%|████▍     | 2195/5000 [00:57<01:07, 41.37it/s]

{'loss': 0.3314, 'learning_rate': 2.8100000000000005e-05, 'epoch': 1.75}


 44%|████▍     | 2205/5000 [00:57<01:07, 41.41it/s]

{'loss': 0.1803, 'learning_rate': 2.8000000000000003e-05, 'epoch': 1.76}


 44%|████▍     | 2215/5000 [00:58<01:06, 41.76it/s]

{'loss': 0.3909, 'learning_rate': 2.7900000000000004e-05, 'epoch': 1.77}


 44%|████▍     | 2225/5000 [00:58<01:06, 41.89it/s]

{'loss': 0.3742, 'learning_rate': 2.7800000000000005e-05, 'epoch': 1.78}


 45%|████▍     | 2235/5000 [00:58<01:05, 41.97it/s]

{'loss': 0.2556, 'learning_rate': 2.7700000000000002e-05, 'epoch': 1.78}


 45%|████▍     | 2245/5000 [00:58<01:05, 42.18it/s]

{'loss': 0.2857, 'learning_rate': 2.7600000000000003e-05, 'epoch': 1.79}


 45%|████▌     | 2255/5000 [00:59<01:04, 42.24it/s]

{'loss': 0.3106, 'learning_rate': 2.7500000000000004e-05, 'epoch': 1.8}


 45%|████▌     | 2265/5000 [00:59<01:05, 41.79it/s]

{'loss': 0.3011, 'learning_rate': 2.7400000000000002e-05, 'epoch': 1.81}


 46%|████▌     | 2275/5000 [00:59<01:04, 41.94it/s]

{'loss': 0.3294, 'learning_rate': 2.7300000000000003e-05, 'epoch': 1.82}


 46%|████▌     | 2285/5000 [00:59<01:04, 41.81it/s]

{'loss': 0.2965, 'learning_rate': 2.7200000000000004e-05, 'epoch': 1.82}


 46%|████▌     | 2295/5000 [00:59<01:04, 41.66it/s]

{'loss': 0.3594, 'learning_rate': 2.7100000000000005e-05, 'epoch': 1.83}


 46%|████▌     | 2305/5000 [01:00<01:05, 40.99it/s]

{'loss': 0.3298, 'learning_rate': 2.7000000000000002e-05, 'epoch': 1.84}


 46%|████▋     | 2315/5000 [01:00<01:04, 41.40it/s]

{'loss': 0.3801, 'learning_rate': 2.6900000000000003e-05, 'epoch': 1.85}


 46%|████▋     | 2325/5000 [01:00<01:03, 41.87it/s]

{'loss': 0.3106, 'learning_rate': 2.6800000000000004e-05, 'epoch': 1.86}


 47%|████▋     | 2335/5000 [01:00<01:04, 41.64it/s]

{'loss': 0.2865, 'learning_rate': 2.6700000000000002e-05, 'epoch': 1.86}


 47%|████▋     | 2345/5000 [01:01<01:03, 41.55it/s]

{'loss': 0.3581, 'learning_rate': 2.6600000000000003e-05, 'epoch': 1.87}


 47%|████▋     | 2355/5000 [01:01<01:04, 40.97it/s]

{'loss': 0.2615, 'learning_rate': 2.6500000000000004e-05, 'epoch': 1.88}


 47%|████▋     | 2365/5000 [01:01<01:03, 41.64it/s]

{'loss': 0.289, 'learning_rate': 2.64e-05, 'epoch': 1.89}


 48%|████▊     | 2375/5000 [01:01<01:02, 41.69it/s]

{'loss': 0.2662, 'learning_rate': 2.6300000000000002e-05, 'epoch': 1.9}


 48%|████▊     | 2385/5000 [01:02<01:03, 41.39it/s]

{'loss': 0.3016, 'learning_rate': 2.6200000000000003e-05, 'epoch': 1.9}


 48%|████▊     | 2395/5000 [01:02<01:02, 41.61it/s]

{'loss': 0.2803, 'learning_rate': 2.61e-05, 'epoch': 1.91}


 48%|████▊     | 2405/5000 [01:02<01:02, 41.61it/s]

{'loss': 0.346, 'learning_rate': 2.6000000000000002e-05, 'epoch': 1.92}


 48%|████▊     | 2415/5000 [01:02<01:03, 40.88it/s]

{'loss': 0.2502, 'learning_rate': 2.5900000000000003e-05, 'epoch': 1.93}


 48%|████▊     | 2425/5000 [01:03<01:02, 41.20it/s]

{'loss': 0.3028, 'learning_rate': 2.58e-05, 'epoch': 1.94}


 49%|████▊     | 2435/5000 [01:03<01:01, 41.42it/s]

{'loss': 0.3055, 'learning_rate': 2.57e-05, 'epoch': 1.94}


 49%|████▉     | 2445/5000 [01:03<01:01, 41.77it/s]

{'loss': 0.3262, 'learning_rate': 2.5600000000000002e-05, 'epoch': 1.95}


 49%|████▉     | 2455/5000 [01:03<01:01, 41.57it/s]

{'loss': 0.245, 'learning_rate': 2.5500000000000003e-05, 'epoch': 1.96}


 49%|████▉     | 2465/5000 [01:04<01:01, 41.53it/s]

{'loss': 0.3653, 'learning_rate': 2.54e-05, 'epoch': 1.97}


 50%|████▉     | 2475/5000 [01:04<01:01, 41.27it/s]

{'loss': 0.2571, 'learning_rate': 2.5300000000000002e-05, 'epoch': 1.98}


 50%|████▉     | 2485/5000 [01:04<01:00, 41.46it/s]

{'loss': 0.2963, 'learning_rate': 2.5200000000000003e-05, 'epoch': 1.98}


 50%|████▉     | 2495/5000 [01:04<00:59, 41.88it/s]

{'loss': 0.2926, 'learning_rate': 2.51e-05, 'epoch': 1.99}


 50%|█████     | 2500/5000 [01:04<00:59, 42.30it/s]The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 16


{'loss': 0.3248, 'learning_rate': 2.5e-05, 'epoch': 2.0}


                                                   
 50%|█████     | 2500/5000 [01:08<00:59, 42.30it/s]Saving model checkpoint to checkpoints\checkpoint-2500
Configuration saved in checkpoints\checkpoint-2500\config.json
Model weights saved in checkpoints\checkpoint-2500\pytorch_model.bin
 50%|█████     | 2505/5000 [01:08<09:39,  4.31it/s]

{'eval_loss': 0.3163749575614929, 'eval_accuracy': 0.871, 'eval_runtime': 3.414, 'eval_samples_per_second': 1464.544, 'eval_steps_per_second': 91.68, 'epoch': 2.0}


 50%|█████     | 2515/5000 [01:08<05:12,  7.95it/s]

{'loss': 0.2478, 'learning_rate': 2.4900000000000002e-05, 'epoch': 2.01}


 50%|█████     | 2525/5000 [01:08<03:03, 13.50it/s]

{'loss': 0.3351, 'learning_rate': 2.48e-05, 'epoch': 2.02}


 51%|█████     | 2535/5000 [01:09<01:59, 20.71it/s]

{'loss': 0.2665, 'learning_rate': 2.47e-05, 'epoch': 2.02}


 51%|█████     | 2545/5000 [01:09<01:27, 27.98it/s]

{'loss': 0.2471, 'learning_rate': 2.46e-05, 'epoch': 2.03}


 51%|█████     | 2555/5000 [01:09<01:12, 33.70it/s]

{'loss': 0.2986, 'learning_rate': 2.45e-05, 'epoch': 2.04}


 51%|█████▏    | 2565/5000 [01:09<01:04, 37.55it/s]

{'loss': 0.2963, 'learning_rate': 2.44e-05, 'epoch': 2.05}


 52%|█████▏    | 2575/5000 [01:10<01:00, 39.86it/s]

{'loss': 0.3149, 'learning_rate': 2.43e-05, 'epoch': 2.06}


 52%|█████▏    | 2585/5000 [01:10<00:58, 41.28it/s]

{'loss': 0.2629, 'learning_rate': 2.4200000000000002e-05, 'epoch': 2.06}


 52%|█████▏    | 2595/5000 [01:10<00:57, 41.91it/s]

{'loss': 0.2334, 'learning_rate': 2.41e-05, 'epoch': 2.07}


 52%|█████▏    | 2605/5000 [01:10<00:56, 42.58it/s]

{'loss': 0.2145, 'learning_rate': 2.4e-05, 'epoch': 2.08}


 52%|█████▏    | 2615/5000 [01:11<00:56, 42.58it/s]

{'loss': 0.2609, 'learning_rate': 2.39e-05, 'epoch': 2.09}


 52%|█████▎    | 2625/5000 [01:11<00:55, 42.71it/s]

{'loss': 0.2923, 'learning_rate': 2.38e-05, 'epoch': 2.1}


 53%|█████▎    | 2635/5000 [01:11<00:55, 42.82it/s]

{'loss': 0.1868, 'learning_rate': 2.37e-05, 'epoch': 2.1}


 53%|█████▎    | 2645/5000 [01:11<00:55, 42.45it/s]

{'loss': 0.2071, 'learning_rate': 2.36e-05, 'epoch': 2.11}


 53%|█████▎    | 2655/5000 [01:12<00:54, 42.67it/s]

{'loss': 0.1811, 'learning_rate': 2.35e-05, 'epoch': 2.12}


 53%|█████▎    | 2665/5000 [01:12<00:55, 42.44it/s]

{'loss': 0.2915, 'learning_rate': 2.3400000000000003e-05, 'epoch': 2.13}


 54%|█████▎    | 2675/5000 [01:12<00:54, 42.44it/s]

{'loss': 0.2538, 'learning_rate': 2.3300000000000004e-05, 'epoch': 2.14}


 54%|█████▎    | 2685/5000 [01:12<00:53, 43.28it/s]

{'loss': 0.1638, 'learning_rate': 2.32e-05, 'epoch': 2.14}


 54%|█████▍    | 2695/5000 [01:12<00:54, 42.61it/s]

{'loss': 0.2644, 'learning_rate': 2.3100000000000002e-05, 'epoch': 2.15}


 54%|█████▍    | 2705/5000 [01:13<00:54, 42.26it/s]

{'loss': 0.2991, 'learning_rate': 2.3000000000000003e-05, 'epoch': 2.16}


 54%|█████▍    | 2715/5000 [01:13<00:52, 43.64it/s]

{'loss': 0.3369, 'learning_rate': 2.29e-05, 'epoch': 2.17}


 55%|█████▍    | 2725/5000 [01:13<00:53, 42.67it/s]

{'loss': 0.2388, 'learning_rate': 2.2800000000000002e-05, 'epoch': 2.18}


 55%|█████▍    | 2735/5000 [01:13<00:52, 42.97it/s]

{'loss': 0.2312, 'learning_rate': 2.2700000000000003e-05, 'epoch': 2.18}


 55%|█████▍    | 2745/5000 [01:14<00:51, 43.38it/s]

{'loss': 0.3439, 'learning_rate': 2.26e-05, 'epoch': 2.19}


 55%|█████▌    | 2755/5000 [01:14<00:52, 43.07it/s]

{'loss': 0.3558, 'learning_rate': 2.25e-05, 'epoch': 2.2}


 55%|█████▌    | 2765/5000 [01:14<00:52, 42.49it/s]

{'loss': 0.2249, 'learning_rate': 2.2400000000000002e-05, 'epoch': 2.21}


 56%|█████▌    | 2775/5000 [01:14<00:51, 43.11it/s]

{'loss': 0.3361, 'learning_rate': 2.23e-05, 'epoch': 2.22}


 56%|█████▌    | 2785/5000 [01:15<00:51, 42.68it/s]

{'loss': 0.2501, 'learning_rate': 2.22e-05, 'epoch': 2.22}


 56%|█████▌    | 2795/5000 [01:15<00:51, 42.62it/s]

{'loss': 0.273, 'learning_rate': 2.2100000000000002e-05, 'epoch': 2.23}


 56%|█████▌    | 2805/5000 [01:15<00:50, 43.28it/s]

{'loss': 0.1888, 'learning_rate': 2.2000000000000003e-05, 'epoch': 2.24}


 56%|█████▋    | 2815/5000 [01:15<00:51, 42.78it/s]

{'loss': 0.275, 'learning_rate': 2.19e-05, 'epoch': 2.25}


 56%|█████▋    | 2825/5000 [01:16<00:50, 42.77it/s]

{'loss': 0.2127, 'learning_rate': 2.18e-05, 'epoch': 2.26}


 57%|█████▋    | 2835/5000 [01:16<00:50, 43.21it/s]

{'loss': 0.2395, 'learning_rate': 2.1700000000000002e-05, 'epoch': 2.26}


 57%|█████▋    | 2845/5000 [01:16<00:50, 42.85it/s]

{'loss': 0.1822, 'learning_rate': 2.16e-05, 'epoch': 2.27}


 57%|█████▋    | 2855/5000 [01:16<00:50, 42.82it/s]

{'loss': 0.2486, 'learning_rate': 2.15e-05, 'epoch': 2.28}


 57%|█████▋    | 2865/5000 [01:16<00:49, 43.09it/s]

{'loss': 0.2464, 'learning_rate': 2.1400000000000002e-05, 'epoch': 2.29}


 57%|█████▊    | 2875/5000 [01:17<00:50, 42.37it/s]

{'loss': 0.2546, 'learning_rate': 2.13e-05, 'epoch': 2.3}


 58%|█████▊    | 2885/5000 [01:17<00:49, 42.85it/s]

{'loss': 0.2685, 'learning_rate': 2.12e-05, 'epoch': 2.3}


 58%|█████▊    | 2895/5000 [01:17<00:49, 42.91it/s]

{'loss': 0.2326, 'learning_rate': 2.11e-05, 'epoch': 2.31}


 58%|█████▊    | 2905/5000 [01:17<00:49, 42.59it/s]

{'loss': 0.2759, 'learning_rate': 2.1e-05, 'epoch': 2.32}


 58%|█████▊    | 2915/5000 [01:18<00:48, 42.99it/s]

{'loss': 0.341, 'learning_rate': 2.09e-05, 'epoch': 2.33}


 58%|█████▊    | 2925/5000 [01:18<00:48, 42.78it/s]

{'loss': 0.244, 'learning_rate': 2.08e-05, 'epoch': 2.34}


 59%|█████▊    | 2935/5000 [01:18<00:48, 42.71it/s]

{'loss': 0.2321, 'learning_rate': 2.07e-05, 'epoch': 2.34}


 59%|█████▉    | 2945/5000 [01:18<00:47, 43.02it/s]

{'loss': 0.1676, 'learning_rate': 2.06e-05, 'epoch': 2.35}


 59%|█████▉    | 2955/5000 [01:19<00:47, 42.95it/s]

{'loss': 0.2947, 'learning_rate': 2.05e-05, 'epoch': 2.36}


 59%|█████▉    | 2965/5000 [01:19<00:47, 42.57it/s]

{'loss': 0.2397, 'learning_rate': 2.04e-05, 'epoch': 2.37}


 60%|█████▉    | 2975/5000 [01:19<00:47, 42.19it/s]

{'loss': 0.2662, 'learning_rate': 2.0300000000000002e-05, 'epoch': 2.38}


 60%|█████▉    | 2985/5000 [01:19<00:47, 42.63it/s]

{'loss': 0.3742, 'learning_rate': 2.0200000000000003e-05, 'epoch': 2.38}


 60%|█████▉    | 2995/5000 [01:19<00:46, 43.12it/s]

{'loss': 0.2855, 'learning_rate': 2.01e-05, 'epoch': 2.39}


 60%|██████    | 3005/5000 [01:20<00:46, 42.61it/s]

{'loss': 0.2114, 'learning_rate': 2e-05, 'epoch': 2.4}


 60%|██████    | 3015/5000 [01:20<00:47, 42.09it/s]

{'loss': 0.2767, 'learning_rate': 1.9900000000000003e-05, 'epoch': 2.41}


 60%|██████    | 3025/5000 [01:20<00:46, 42.53it/s]

{'loss': 0.2318, 'learning_rate': 1.9800000000000004e-05, 'epoch': 2.42}


 61%|██████    | 3035/5000 [01:20<00:46, 42.41it/s]

{'loss': 0.2089, 'learning_rate': 1.97e-05, 'epoch': 2.42}


 61%|██████    | 3045/5000 [01:21<00:45, 42.67it/s]

{'loss': 0.289, 'learning_rate': 1.9600000000000002e-05, 'epoch': 2.43}


 61%|██████    | 3055/5000 [01:21<00:46, 41.78it/s]

{'loss': 0.2348, 'learning_rate': 1.9500000000000003e-05, 'epoch': 2.44}


 61%|██████▏   | 3065/5000 [01:21<00:45, 42.25it/s]

{'loss': 0.2825, 'learning_rate': 1.94e-05, 'epoch': 2.45}


 62%|██████▏   | 3075/5000 [01:21<00:45, 42.50it/s]

{'loss': 0.2846, 'learning_rate': 1.93e-05, 'epoch': 2.46}


 62%|██████▏   | 3085/5000 [01:22<00:44, 42.56it/s]

{'loss': 0.2163, 'learning_rate': 1.9200000000000003e-05, 'epoch': 2.46}


 62%|██████▏   | 3095/5000 [01:22<00:44, 42.78it/s]

{'loss': 0.2425, 'learning_rate': 1.91e-05, 'epoch': 2.47}


 62%|██████▏   | 3105/5000 [01:22<00:44, 42.78it/s]

{'loss': 0.2719, 'learning_rate': 1.9e-05, 'epoch': 2.48}


 62%|██████▏   | 3115/5000 [01:22<00:45, 41.74it/s]

{'loss': 0.2054, 'learning_rate': 1.8900000000000002e-05, 'epoch': 2.49}


 62%|██████▎   | 3125/5000 [01:23<00:44, 42.61it/s]

{'loss': 0.2596, 'learning_rate': 1.88e-05, 'epoch': 2.5}


 63%|██████▎   | 3135/5000 [01:23<00:44, 42.32it/s]

{'loss': 0.2201, 'learning_rate': 1.87e-05, 'epoch': 2.5}


 63%|██████▎   | 3145/5000 [01:23<00:44, 42.15it/s]

{'loss': 0.3625, 'learning_rate': 1.86e-05, 'epoch': 2.51}


 63%|██████▎   | 3155/5000 [01:23<00:43, 42.16it/s]

{'loss': 0.3228, 'learning_rate': 1.85e-05, 'epoch': 2.52}


 63%|██████▎   | 3165/5000 [01:23<00:43, 42.42it/s]

{'loss': 0.2157, 'learning_rate': 1.84e-05, 'epoch': 2.53}


 64%|██████▎   | 3175/5000 [01:24<00:43, 41.97it/s]

{'loss': 0.2376, 'learning_rate': 1.83e-05, 'epoch': 2.54}


 64%|██████▎   | 3185/5000 [01:24<00:43, 41.96it/s]

{'loss': 0.204, 'learning_rate': 1.8200000000000002e-05, 'epoch': 2.54}


 64%|██████▍   | 3195/5000 [01:24<00:43, 41.36it/s]

{'loss': 0.282, 'learning_rate': 1.81e-05, 'epoch': 2.55}


 64%|██████▍   | 3205/5000 [01:24<00:42, 41.88it/s]

{'loss': 0.2621, 'learning_rate': 1.8e-05, 'epoch': 2.56}


 64%|██████▍   | 3215/5000 [01:25<00:42, 41.82it/s]

{'loss': 0.3022, 'learning_rate': 1.79e-05, 'epoch': 2.57}


 64%|██████▍   | 3225/5000 [01:25<00:42, 41.46it/s]

{'loss': 0.2876, 'learning_rate': 1.78e-05, 'epoch': 2.58}


 65%|██████▍   | 3235/5000 [01:25<00:42, 41.30it/s]

{'loss': 0.2798, 'learning_rate': 1.77e-05, 'epoch': 2.58}


 65%|██████▍   | 3245/5000 [01:25<00:42, 41.22it/s]

{'loss': 0.2433, 'learning_rate': 1.76e-05, 'epoch': 2.59}


 65%|██████▌   | 3255/5000 [01:26<00:41, 41.83it/s]

{'loss': 0.2466, 'learning_rate': 1.75e-05, 'epoch': 2.6}


 65%|██████▌   | 3265/5000 [01:26<00:41, 41.51it/s]

{'loss': 0.1991, 'learning_rate': 1.74e-05, 'epoch': 2.61}


 66%|██████▌   | 3275/5000 [01:26<00:41, 41.95it/s]

{'loss': 0.1777, 'learning_rate': 1.73e-05, 'epoch': 2.62}


 66%|██████▌   | 3285/5000 [01:26<00:40, 42.21it/s]

{'loss': 0.3315, 'learning_rate': 1.7199999999999998e-05, 'epoch': 2.62}


 66%|██████▌   | 3295/5000 [01:27<00:40, 42.12it/s]

{'loss': 0.2503, 'learning_rate': 1.7100000000000002e-05, 'epoch': 2.63}


 66%|██████▌   | 3305/5000 [01:27<00:40, 42.25it/s]

{'loss': 0.3221, 'learning_rate': 1.7000000000000003e-05, 'epoch': 2.64}


 66%|██████▋   | 3315/5000 [01:27<00:40, 41.80it/s]

{'loss': 0.1862, 'learning_rate': 1.69e-05, 'epoch': 2.65}


 66%|██████▋   | 3325/5000 [01:27<00:39, 42.17it/s]

{'loss': 0.2553, 'learning_rate': 1.6800000000000002e-05, 'epoch': 2.66}


 67%|██████▋   | 3335/5000 [01:28<00:39, 42.30it/s]

{'loss': 0.2121, 'learning_rate': 1.6700000000000003e-05, 'epoch': 2.66}


 67%|██████▋   | 3345/5000 [01:28<00:38, 42.62it/s]

{'loss': 0.3272, 'learning_rate': 1.66e-05, 'epoch': 2.67}


 67%|██████▋   | 3355/5000 [01:28<00:38, 42.46it/s]

{'loss': 0.2448, 'learning_rate': 1.65e-05, 'epoch': 2.68}


 67%|██████▋   | 3365/5000 [01:28<00:38, 42.82it/s]

{'loss': 0.2468, 'learning_rate': 1.6400000000000002e-05, 'epoch': 2.69}


 68%|██████▊   | 3375/5000 [01:28<00:38, 42.73it/s]

{'loss': 0.2771, 'learning_rate': 1.63e-05, 'epoch': 2.7}


 68%|██████▊   | 3385/5000 [01:29<00:37, 42.78it/s]

{'loss': 0.3253, 'learning_rate': 1.62e-05, 'epoch': 2.7}


 68%|██████▊   | 3395/5000 [01:29<00:37, 42.81it/s]

{'loss': 0.2437, 'learning_rate': 1.6100000000000002e-05, 'epoch': 2.71}


 68%|██████▊   | 3405/5000 [01:29<00:37, 42.52it/s]

{'loss': 0.2291, 'learning_rate': 1.6000000000000003e-05, 'epoch': 2.72}


 68%|██████▊   | 3415/5000 [01:29<00:37, 42.79it/s]

{'loss': 0.2802, 'learning_rate': 1.59e-05, 'epoch': 2.73}


 68%|██████▊   | 3425/5000 [01:30<00:36, 42.72it/s]

{'loss': 0.3531, 'learning_rate': 1.58e-05, 'epoch': 2.74}


 69%|██████▊   | 3435/5000 [01:30<00:37, 42.02it/s]

{'loss': 0.2686, 'learning_rate': 1.5700000000000002e-05, 'epoch': 2.74}


 69%|██████▉   | 3445/5000 [01:30<00:36, 42.23it/s]

{'loss': 0.3152, 'learning_rate': 1.56e-05, 'epoch': 2.75}


 69%|██████▉   | 3455/5000 [01:30<00:36, 42.30it/s]

{'loss': 0.28, 'learning_rate': 1.55e-05, 'epoch': 2.76}


 69%|██████▉   | 3465/5000 [01:31<00:36, 42.62it/s]

{'loss': 0.2401, 'learning_rate': 1.54e-05, 'epoch': 2.77}


 70%|██████▉   | 3475/5000 [01:31<00:35, 42.58it/s]

{'loss': 0.2276, 'learning_rate': 1.53e-05, 'epoch': 2.78}


 70%|██████▉   | 3485/5000 [01:31<00:36, 41.93it/s]

{'loss': 0.2739, 'learning_rate': 1.52e-05, 'epoch': 2.78}


 70%|██████▉   | 3495/5000 [01:31<00:36, 41.61it/s]

{'loss': 0.1761, 'learning_rate': 1.51e-05, 'epoch': 2.79}


 70%|███████   | 3505/5000 [01:32<00:35, 42.01it/s]

{'loss': 0.3144, 'learning_rate': 1.5e-05, 'epoch': 2.8}


 70%|███████   | 3515/5000 [01:32<00:36, 40.71it/s]

{'loss': 0.1855, 'learning_rate': 1.49e-05, 'epoch': 2.81}


 70%|███████   | 3525/5000 [01:32<00:35, 41.08it/s]

{'loss': 0.2267, 'learning_rate': 1.48e-05, 'epoch': 2.82}


 71%|███████   | 3535/5000 [01:32<00:35, 41.37it/s]

{'loss': 0.3074, 'learning_rate': 1.47e-05, 'epoch': 2.82}


 71%|███████   | 3545/5000 [01:33<00:34, 41.87it/s]

{'loss': 0.2881, 'learning_rate': 1.4599999999999999e-05, 'epoch': 2.83}


 71%|███████   | 3555/5000 [01:33<00:33, 42.67it/s]

{'loss': 0.263, 'learning_rate': 1.45e-05, 'epoch': 2.84}


 71%|███████▏  | 3565/5000 [01:33<00:34, 41.98it/s]

{'loss': 0.3685, 'learning_rate': 1.44e-05, 'epoch': 2.85}


 72%|███████▏  | 3575/5000 [01:33<00:33, 41.97it/s]

{'loss': 0.2639, 'learning_rate': 1.43e-05, 'epoch': 2.86}


 72%|███████▏  | 3585/5000 [01:33<00:33, 41.94it/s]

{'loss': 0.2055, 'learning_rate': 1.42e-05, 'epoch': 2.86}


 72%|███████▏  | 3595/5000 [01:34<00:33, 42.35it/s]

{'loss': 0.1754, 'learning_rate': 1.4099999999999999e-05, 'epoch': 2.87}


 72%|███████▏  | 3605/5000 [01:34<00:33, 42.20it/s]

{'loss': 0.2737, 'learning_rate': 1.4000000000000001e-05, 'epoch': 2.88}


 72%|███████▏  | 3615/5000 [01:34<00:32, 42.00it/s]

{'loss': 0.2091, 'learning_rate': 1.3900000000000002e-05, 'epoch': 2.89}


 72%|███████▎  | 3625/5000 [01:34<00:33, 41.53it/s]

{'loss': 0.2488, 'learning_rate': 1.3800000000000002e-05, 'epoch': 2.9}


 73%|███████▎  | 3635/5000 [01:35<00:32, 42.24it/s]

{'loss': 0.3494, 'learning_rate': 1.3700000000000001e-05, 'epoch': 2.9}


 73%|███████▎  | 3645/5000 [01:35<00:32, 41.81it/s]

{'loss': 0.2006, 'learning_rate': 1.3600000000000002e-05, 'epoch': 2.91}


 73%|███████▎  | 3655/5000 [01:35<00:32, 41.47it/s]

{'loss': 0.375, 'learning_rate': 1.3500000000000001e-05, 'epoch': 2.92}


 73%|███████▎  | 3665/5000 [01:35<00:31, 41.72it/s]

{'loss': 0.266, 'learning_rate': 1.3400000000000002e-05, 'epoch': 2.93}


 74%|███████▎  | 3675/5000 [01:36<00:31, 41.56it/s]

{'loss': 0.2902, 'learning_rate': 1.3300000000000001e-05, 'epoch': 2.94}


 74%|███████▎  | 3685/5000 [01:36<00:31, 41.91it/s]

{'loss': 0.2744, 'learning_rate': 1.32e-05, 'epoch': 2.94}


 74%|███████▍  | 3695/5000 [01:36<00:31, 41.37it/s]

{'loss': 0.2741, 'learning_rate': 1.3100000000000002e-05, 'epoch': 2.95}


 74%|███████▍  | 3705/5000 [01:36<00:31, 41.27it/s]

{'loss': 0.1933, 'learning_rate': 1.3000000000000001e-05, 'epoch': 2.96}


 74%|███████▍  | 3715/5000 [01:37<00:30, 41.94it/s]

{'loss': 0.292, 'learning_rate': 1.29e-05, 'epoch': 2.97}


 74%|███████▍  | 3725/5000 [01:37<00:30, 41.82it/s]

{'loss': 0.3325, 'learning_rate': 1.2800000000000001e-05, 'epoch': 2.98}


 75%|███████▍  | 3735/5000 [01:37<00:30, 41.67it/s]

{'loss': 0.2984, 'learning_rate': 1.27e-05, 'epoch': 2.98}


 75%|███████▍  | 3745/5000 [01:37<00:30, 41.61it/s]

{'loss': 0.2803, 'learning_rate': 1.2600000000000001e-05, 'epoch': 2.99}


 75%|███████▌  | 3750/5000 [01:37<00:30, 41.63it/s]The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 16


{'loss': 0.2531, 'learning_rate': 1.25e-05, 'epoch': 3.0}


                                                   
 75%|███████▌  | 3750/5000 [01:41<00:30, 41.63it/s]Saving model checkpoint to checkpoints\checkpoint-3750
Configuration saved in checkpoints\checkpoint-3750\config.json
Model weights saved in checkpoints\checkpoint-3750\pytorch_model.bin
 75%|███████▌  | 3755/5000 [01:41<04:51,  4.27it/s]

{'eval_loss': 0.3155679702758789, 'eval_accuracy': 0.8766, 'eval_runtime': 3.4401, 'eval_samples_per_second': 1453.443, 'eval_steps_per_second': 90.986, 'epoch': 3.0}


 75%|███████▌  | 3765/5000 [01:41<02:36,  7.88it/s]

{'loss': 0.2736, 'learning_rate': 1.24e-05, 'epoch': 3.01}


 76%|███████▌  | 3775/5000 [01:42<01:31, 13.44it/s]

{'loss': 0.2483, 'learning_rate': 1.23e-05, 'epoch': 3.02}


 76%|███████▌  | 3785/5000 [01:42<00:59, 20.52it/s]

{'loss': 0.189, 'learning_rate': 1.22e-05, 'epoch': 3.02}


 76%|███████▌  | 3795/5000 [01:42<00:43, 27.44it/s]

{'loss': 0.3123, 'learning_rate': 1.2100000000000001e-05, 'epoch': 3.03}


 76%|███████▌  | 3805/5000 [01:42<00:36, 33.17it/s]

{'loss': 0.2358, 'learning_rate': 1.2e-05, 'epoch': 3.04}


 76%|███████▋  | 3815/5000 [01:42<00:32, 36.76it/s]

{'loss': 0.2162, 'learning_rate': 1.19e-05, 'epoch': 3.05}


 76%|███████▋  | 3825/5000 [01:43<00:29, 39.31it/s]

{'loss': 0.2103, 'learning_rate': 1.18e-05, 'epoch': 3.06}


 77%|███████▋  | 3835/5000 [01:43<00:28, 40.18it/s]

{'loss': 0.2487, 'learning_rate': 1.1700000000000001e-05, 'epoch': 3.06}


 77%|███████▋  | 3845/5000 [01:43<00:27, 41.47it/s]

{'loss': 0.2095, 'learning_rate': 1.16e-05, 'epoch': 3.07}


 77%|███████▋  | 3855/5000 [01:43<00:27, 41.34it/s]

{'loss': 0.1857, 'learning_rate': 1.1500000000000002e-05, 'epoch': 3.08}


 77%|███████▋  | 3865/5000 [01:44<00:27, 42.03it/s]

{'loss': 0.1969, 'learning_rate': 1.1400000000000001e-05, 'epoch': 3.09}


 78%|███████▊  | 3875/5000 [01:44<00:26, 42.18it/s]

{'loss': 0.2172, 'learning_rate': 1.13e-05, 'epoch': 3.1}


 78%|███████▊  | 3885/5000 [01:44<00:26, 42.22it/s]

{'loss': 0.1738, 'learning_rate': 1.1200000000000001e-05, 'epoch': 3.1}


 78%|███████▊  | 3895/5000 [01:44<00:26, 42.09it/s]

{'loss': 0.2025, 'learning_rate': 1.11e-05, 'epoch': 3.11}


 78%|███████▊  | 3905/5000 [01:45<00:26, 41.31it/s]

{'loss': 0.1844, 'learning_rate': 1.1000000000000001e-05, 'epoch': 3.12}


 78%|███████▊  | 3915/5000 [01:45<00:26, 41.58it/s]

{'loss': 0.1894, 'learning_rate': 1.09e-05, 'epoch': 3.13}


 78%|███████▊  | 3925/5000 [01:45<00:25, 41.89it/s]

{'loss': 0.1919, 'learning_rate': 1.08e-05, 'epoch': 3.14}


 79%|███████▊  | 3935/5000 [01:45<00:25, 42.31it/s]

{'loss': 0.2749, 'learning_rate': 1.0700000000000001e-05, 'epoch': 3.14}


 79%|███████▉  | 3945/5000 [01:46<00:24, 42.45it/s]

{'loss': 0.1838, 'learning_rate': 1.06e-05, 'epoch': 3.15}


 79%|███████▉  | 3955/5000 [01:46<00:24, 42.32it/s]

{'loss': 0.2463, 'learning_rate': 1.05e-05, 'epoch': 3.16}


 79%|███████▉  | 3965/5000 [01:46<00:24, 42.42it/s]

{'loss': 0.2819, 'learning_rate': 1.04e-05, 'epoch': 3.17}


 80%|███████▉  | 3975/5000 [01:46<00:24, 41.75it/s]

{'loss': 0.2501, 'learning_rate': 1.03e-05, 'epoch': 3.18}


 80%|███████▉  | 3985/5000 [01:47<00:24, 41.77it/s]

{'loss': 0.1819, 'learning_rate': 1.02e-05, 'epoch': 3.18}


 80%|███████▉  | 3995/5000 [01:47<00:24, 41.56it/s]

{'loss': 0.3006, 'learning_rate': 1.0100000000000002e-05, 'epoch': 3.19}


 80%|████████  | 4005/5000 [01:47<00:23, 41.51it/s]

{'loss': 0.1584, 'learning_rate': 1e-05, 'epoch': 3.2}


 80%|████████  | 4015/5000 [01:47<00:23, 41.63it/s]

{'loss': 0.2336, 'learning_rate': 9.900000000000002e-06, 'epoch': 3.21}


 80%|████████  | 4025/5000 [01:47<00:23, 41.82it/s]

{'loss': 0.1931, 'learning_rate': 9.800000000000001e-06, 'epoch': 3.22}


 81%|████████  | 4035/5000 [01:48<00:22, 42.11it/s]

{'loss': 0.2099, 'learning_rate': 9.7e-06, 'epoch': 3.22}


 81%|████████  | 4045/5000 [01:48<00:23, 41.11it/s]

{'loss': 0.2158, 'learning_rate': 9.600000000000001e-06, 'epoch': 3.23}


 81%|████████  | 4055/5000 [01:48<00:22, 41.26it/s]

{'loss': 0.2047, 'learning_rate': 9.5e-06, 'epoch': 3.24}


 81%|████████▏ | 4065/5000 [01:48<00:22, 41.68it/s]

{'loss': 0.1853, 'learning_rate': 9.4e-06, 'epoch': 3.25}


 82%|████████▏ | 4075/5000 [01:49<00:21, 42.56it/s]

{'loss': 0.168, 'learning_rate': 9.3e-06, 'epoch': 3.26}


 82%|████████▏ | 4085/5000 [01:49<00:21, 41.87it/s]

{'loss': 0.2385, 'learning_rate': 9.2e-06, 'epoch': 3.26}


 82%|████████▏ | 4095/5000 [01:49<00:21, 41.87it/s]

{'loss': 0.1496, 'learning_rate': 9.100000000000001e-06, 'epoch': 3.27}


 82%|████████▏ | 4105/5000 [01:49<00:21, 41.98it/s]

{'loss': 0.2381, 'learning_rate': 9e-06, 'epoch': 3.28}


 82%|████████▏ | 4115/5000 [01:50<00:21, 41.97it/s]

{'loss': 0.1879, 'learning_rate': 8.9e-06, 'epoch': 3.29}


 82%|████████▎ | 4125/5000 [01:50<00:20, 42.08it/s]

{'loss': 0.1882, 'learning_rate': 8.8e-06, 'epoch': 3.3}


 83%|████████▎ | 4135/5000 [01:50<00:20, 41.38it/s]

{'loss': 0.2382, 'learning_rate': 8.7e-06, 'epoch': 3.3}


 83%|████████▎ | 4145/5000 [01:50<00:20, 41.93it/s]

{'loss': 0.263, 'learning_rate': 8.599999999999999e-06, 'epoch': 3.31}


 83%|████████▎ | 4155/5000 [01:51<00:19, 42.72it/s]

{'loss': 0.2973, 'learning_rate': 8.500000000000002e-06, 'epoch': 3.32}


 83%|████████▎ | 4165/5000 [01:51<00:19, 42.06it/s]

{'loss': 0.2888, 'learning_rate': 8.400000000000001e-06, 'epoch': 3.33}


 84%|████████▎ | 4175/5000 [01:51<00:19, 41.93it/s]

{'loss': 0.2183, 'learning_rate': 8.3e-06, 'epoch': 3.34}


 84%|████████▎ | 4185/5000 [01:51<00:19, 42.21it/s]

{'loss': 0.2657, 'learning_rate': 8.200000000000001e-06, 'epoch': 3.34}


 84%|████████▍ | 4195/5000 [01:52<00:19, 42.29it/s]

{'loss': 0.2753, 'learning_rate': 8.1e-06, 'epoch': 3.35}


 84%|████████▍ | 4205/5000 [01:52<00:19, 41.72it/s]

{'loss': 0.2185, 'learning_rate': 8.000000000000001e-06, 'epoch': 3.36}


 84%|████████▍ | 4215/5000 [01:52<00:19, 41.28it/s]

{'loss': 0.2523, 'learning_rate': 7.9e-06, 'epoch': 3.37}


 84%|████████▍ | 4225/5000 [01:52<00:18, 41.77it/s]

{'loss': 0.2005, 'learning_rate': 7.8e-06, 'epoch': 3.38}


 85%|████████▍ | 4235/5000 [01:52<00:18, 41.65it/s]

{'loss': 0.2051, 'learning_rate': 7.7e-06, 'epoch': 3.38}


 85%|████████▍ | 4245/5000 [01:53<00:17, 41.97it/s]

{'loss': 0.2491, 'learning_rate': 7.6e-06, 'epoch': 3.39}


 85%|████████▌ | 4255/5000 [01:53<00:17, 42.09it/s]

{'loss': 0.2077, 'learning_rate': 7.5e-06, 'epoch': 3.4}


 85%|████████▌ | 4265/5000 [01:53<00:17, 42.07it/s]

{'loss': 0.2546, 'learning_rate': 7.4e-06, 'epoch': 3.41}


 86%|████████▌ | 4275/5000 [01:53<00:17, 42.40it/s]

{'loss': 0.169, 'learning_rate': 7.2999999999999996e-06, 'epoch': 3.42}


 86%|████████▌ | 4285/5000 [01:54<00:17, 41.94it/s]

{'loss': 0.2565, 'learning_rate': 7.2e-06, 'epoch': 3.42}


 86%|████████▌ | 4295/5000 [01:54<00:16, 41.99it/s]

{'loss': 0.154, 'learning_rate': 7.1e-06, 'epoch': 3.43}


 86%|████████▌ | 4305/5000 [01:54<00:16, 41.84it/s]

{'loss': 0.2746, 'learning_rate': 7.000000000000001e-06, 'epoch': 3.44}


 86%|████████▋ | 4315/5000 [01:54<00:16, 42.48it/s]

{'loss': 0.2453, 'learning_rate': 6.900000000000001e-06, 'epoch': 3.45}


 86%|████████▋ | 4325/5000 [01:55<00:16, 41.73it/s]

{'loss': 0.28, 'learning_rate': 6.800000000000001e-06, 'epoch': 3.46}


 87%|████████▋ | 4335/5000 [01:55<00:15, 41.86it/s]

{'loss': 0.2167, 'learning_rate': 6.700000000000001e-06, 'epoch': 3.46}


 87%|████████▋ | 4345/5000 [01:55<00:15, 41.92it/s]

{'loss': 0.2337, 'learning_rate': 6.6e-06, 'epoch': 3.47}


 87%|████████▋ | 4355/5000 [01:55<00:15, 41.91it/s]

{'loss': 0.1854, 'learning_rate': 6.5000000000000004e-06, 'epoch': 3.48}


 87%|████████▋ | 4365/5000 [01:56<00:15, 41.93it/s]

{'loss': 0.3066, 'learning_rate': 6.4000000000000006e-06, 'epoch': 3.49}


 88%|████████▊ | 4375/5000 [01:56<00:15, 41.28it/s]

{'loss': 0.1793, 'learning_rate': 6.300000000000001e-06, 'epoch': 3.5}


 88%|████████▊ | 4385/5000 [01:56<00:14, 41.74it/s]

{'loss': 0.2503, 'learning_rate': 6.2e-06, 'epoch': 3.5}


 88%|████████▊ | 4395/5000 [01:56<00:14, 41.61it/s]

{'loss': 0.2667, 'learning_rate': 6.1e-06, 'epoch': 3.51}


 88%|████████▊ | 4405/5000 [01:57<00:13, 42.50it/s]

{'loss': 0.3795, 'learning_rate': 6e-06, 'epoch': 3.52}


 88%|████████▊ | 4415/5000 [01:57<00:13, 41.92it/s]

{'loss': 0.2366, 'learning_rate': 5.9e-06, 'epoch': 3.53}


 88%|████████▊ | 4425/5000 [01:57<00:13, 42.85it/s]

{'loss': 0.2145, 'learning_rate': 5.8e-06, 'epoch': 3.54}


 89%|████████▊ | 4435/5000 [01:57<00:13, 41.86it/s]

{'loss': 0.1775, 'learning_rate': 5.7000000000000005e-06, 'epoch': 3.54}


 89%|████████▉ | 4445/5000 [01:57<00:13, 41.65it/s]

{'loss': 0.2393, 'learning_rate': 5.600000000000001e-06, 'epoch': 3.55}


 89%|████████▉ | 4455/5000 [01:58<00:13, 41.39it/s]

{'loss': 0.2822, 'learning_rate': 5.500000000000001e-06, 'epoch': 3.56}


 89%|████████▉ | 4465/5000 [01:58<00:12, 41.93it/s]

{'loss': 0.2561, 'learning_rate': 5.4e-06, 'epoch': 3.57}


 90%|████████▉ | 4475/5000 [01:58<00:12, 42.06it/s]

{'loss': 0.3006, 'learning_rate': 5.3e-06, 'epoch': 3.58}


 90%|████████▉ | 4485/5000 [01:58<00:12, 41.52it/s]

{'loss': 0.2568, 'learning_rate': 5.2e-06, 'epoch': 3.58}


 90%|████████▉ | 4495/5000 [01:59<00:12, 41.25it/s]

{'loss': 0.1848, 'learning_rate': 5.1e-06, 'epoch': 3.59}


 90%|█████████ | 4505/5000 [01:59<00:12, 40.91it/s]

{'loss': 0.2245, 'learning_rate': 5e-06, 'epoch': 3.6}


 90%|█████████ | 4515/5000 [01:59<00:11, 41.21it/s]

{'loss': 0.2692, 'learning_rate': 4.9000000000000005e-06, 'epoch': 3.61}


 90%|█████████ | 4525/5000 [01:59<00:11, 41.32it/s]

{'loss': 0.2456, 'learning_rate': 4.800000000000001e-06, 'epoch': 3.62}


 91%|█████████ | 4535/5000 [02:00<00:11, 40.75it/s]

{'loss': 0.194, 'learning_rate': 4.7e-06, 'epoch': 3.62}


 91%|█████████ | 4545/5000 [02:00<00:11, 41.18it/s]

{'loss': 0.1829, 'learning_rate': 4.6e-06, 'epoch': 3.63}


 91%|█████████ | 4555/5000 [02:00<00:10, 41.75it/s]

{'loss': 0.2865, 'learning_rate': 4.5e-06, 'epoch': 3.64}


 91%|█████████▏| 4565/5000 [02:00<00:10, 41.95it/s]

{'loss': 0.2802, 'learning_rate': 4.4e-06, 'epoch': 3.65}


 92%|█████████▏| 4575/5000 [02:01<00:10, 41.44it/s]

{'loss': 0.2069, 'learning_rate': 4.2999999999999995e-06, 'epoch': 3.66}


 92%|█████████▏| 4585/5000 [02:01<00:10, 41.49it/s]

{'loss': 0.1461, 'learning_rate': 4.2000000000000004e-06, 'epoch': 3.66}


 92%|█████████▏| 4595/5000 [02:01<00:09, 41.80it/s]

{'loss': 0.2635, 'learning_rate': 4.1000000000000006e-06, 'epoch': 3.67}


 92%|█████████▏| 4605/5000 [02:01<00:09, 42.05it/s]

{'loss': 0.1991, 'learning_rate': 4.000000000000001e-06, 'epoch': 3.68}


 92%|█████████▏| 4615/5000 [02:02<00:09, 40.43it/s]

{'loss': 0.186, 'learning_rate': 3.9e-06, 'epoch': 3.69}


 92%|█████████▎| 4625/5000 [02:02<00:09, 40.92it/s]

{'loss': 0.2484, 'learning_rate': 3.8e-06, 'epoch': 3.7}


 93%|█████████▎| 4635/5000 [02:02<00:08, 41.55it/s]

{'loss': 0.2638, 'learning_rate': 3.7e-06, 'epoch': 3.7}


 93%|█████████▎| 4645/5000 [02:02<00:08, 41.02it/s]

{'loss': 0.2331, 'learning_rate': 3.6e-06, 'epoch': 3.71}


 93%|█████████▎| 4655/5000 [02:03<00:08, 40.71it/s]

{'loss': 0.2108, 'learning_rate': 3.5000000000000004e-06, 'epoch': 3.72}


 93%|█████████▎| 4665/5000 [02:03<00:08, 41.23it/s]

{'loss': 0.2999, 'learning_rate': 3.4000000000000005e-06, 'epoch': 3.73}


 94%|█████████▎| 4675/5000 [02:03<00:07, 41.42it/s]

{'loss': 0.1319, 'learning_rate': 3.3e-06, 'epoch': 3.74}


 94%|█████████▎| 4685/5000 [02:03<00:07, 41.60it/s]

{'loss': 0.2746, 'learning_rate': 3.2000000000000003e-06, 'epoch': 3.74}


 94%|█████████▍| 4695/5000 [02:04<00:07, 40.46it/s]

{'loss': 0.1957, 'learning_rate': 3.1e-06, 'epoch': 3.75}


 94%|█████████▍| 4705/5000 [02:04<00:07, 41.55it/s]

{'loss': 0.1757, 'learning_rate': 3e-06, 'epoch': 3.76}


 94%|█████████▍| 4715/5000 [02:04<00:06, 41.47it/s]

{'loss': 0.1643, 'learning_rate': 2.9e-06, 'epoch': 3.77}


 94%|█████████▍| 4725/5000 [02:04<00:06, 41.76it/s]

{'loss': 0.2765, 'learning_rate': 2.8000000000000003e-06, 'epoch': 3.78}


 95%|█████████▍| 4735/5000 [02:04<00:06, 41.74it/s]

{'loss': 0.1657, 'learning_rate': 2.7e-06, 'epoch': 3.78}


 95%|█████████▍| 4745/5000 [02:05<00:06, 41.61it/s]

{'loss': 0.1781, 'learning_rate': 2.6e-06, 'epoch': 3.79}


 95%|█████████▌| 4755/5000 [02:05<00:05, 41.98it/s]

{'loss': 0.2728, 'learning_rate': 2.5e-06, 'epoch': 3.8}


 95%|█████████▌| 4765/5000 [02:05<00:05, 41.87it/s]

{'loss': 0.1938, 'learning_rate': 2.4000000000000003e-06, 'epoch': 3.81}


 96%|█████████▌| 4775/5000 [02:05<00:05, 41.69it/s]

{'loss': 0.1513, 'learning_rate': 2.3e-06, 'epoch': 3.82}


 96%|█████████▌| 4785/5000 [02:06<00:05, 41.26it/s]

{'loss': 0.2494, 'learning_rate': 2.2e-06, 'epoch': 3.82}


 96%|█████████▌| 4795/5000 [02:06<00:04, 41.56it/s]

{'loss': 0.2402, 'learning_rate': 2.1000000000000002e-06, 'epoch': 3.83}


 96%|█████████▌| 4805/5000 [02:06<00:04, 41.69it/s]

{'loss': 0.2278, 'learning_rate': 2.0000000000000003e-06, 'epoch': 3.84}


 96%|█████████▋| 4815/5000 [02:06<00:04, 41.46it/s]

{'loss': 0.1543, 'learning_rate': 1.9e-06, 'epoch': 3.85}


 96%|█████████▋| 4825/5000 [02:07<00:04, 41.78it/s]

{'loss': 0.1779, 'learning_rate': 1.8e-06, 'epoch': 3.86}


 97%|█████████▋| 4835/5000 [02:07<00:03, 41.70it/s]

{'loss': 0.2455, 'learning_rate': 1.7000000000000002e-06, 'epoch': 3.86}


 97%|█████████▋| 4845/5000 [02:07<00:03, 41.60it/s]

{'loss': 0.2502, 'learning_rate': 1.6000000000000001e-06, 'epoch': 3.87}


 97%|█████████▋| 4855/5000 [02:07<00:03, 41.45it/s]

{'loss': 0.3235, 'learning_rate': 1.5e-06, 'epoch': 3.88}


 97%|█████████▋| 4865/5000 [02:08<00:03, 41.65it/s]

{'loss': 0.2418, 'learning_rate': 1.4000000000000001e-06, 'epoch': 3.89}


 98%|█████████▊| 4875/5000 [02:08<00:02, 41.73it/s]

{'loss': 0.2653, 'learning_rate': 1.3e-06, 'epoch': 3.9}


 98%|█████████▊| 4885/5000 [02:08<00:02, 41.72it/s]

{'loss': 0.1887, 'learning_rate': 1.2000000000000002e-06, 'epoch': 3.9}


 98%|█████████▊| 4895/5000 [02:08<00:02, 42.22it/s]

{'loss': 0.275, 'learning_rate': 1.1e-06, 'epoch': 3.91}


 98%|█████████▊| 4905/5000 [02:09<00:02, 41.82it/s]

{'loss': 0.3044, 'learning_rate': 1.0000000000000002e-06, 'epoch': 3.92}


 98%|█████████▊| 4915/5000 [02:09<00:02, 41.74it/s]

{'loss': 0.276, 'learning_rate': 9e-07, 'epoch': 3.93}


 98%|█████████▊| 4925/5000 [02:09<00:01, 41.14it/s]

{'loss': 0.2747, 'learning_rate': 8.000000000000001e-07, 'epoch': 3.94}


 99%|█████████▊| 4935/5000 [02:09<00:01, 41.78it/s]

{'loss': 0.25, 'learning_rate': 7.000000000000001e-07, 'epoch': 3.94}


 99%|█████████▉| 4945/5000 [02:10<00:01, 41.58it/s]

{'loss': 0.2183, 'learning_rate': 6.000000000000001e-07, 'epoch': 3.95}


 99%|█████████▉| 4955/5000 [02:10<00:01, 41.63it/s]

{'loss': 0.3079, 'learning_rate': 5.000000000000001e-07, 'epoch': 3.96}


 99%|█████████▉| 4965/5000 [02:10<00:00, 41.56it/s]

{'loss': 0.2194, 'learning_rate': 4.0000000000000003e-07, 'epoch': 3.97}


100%|█████████▉| 4975/5000 [02:10<00:00, 42.10it/s]

{'loss': 0.2733, 'learning_rate': 3.0000000000000004e-07, 'epoch': 3.98}


100%|█████████▉| 4985/5000 [02:10<00:00, 42.05it/s]

{'loss': 0.1962, 'learning_rate': 2.0000000000000002e-07, 'epoch': 3.98}


100%|█████████▉| 4995/5000 [02:11<00:00, 41.70it/s]

{'loss': 0.2299, 'learning_rate': 1.0000000000000001e-07, 'epoch': 3.99}


100%|██████████| 5000/5000 [02:11<00:00, 41.62it/s]The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 16


{'loss': 0.14, 'learning_rate': 0.0, 'epoch': 4.0}


                                                   
100%|██████████| 5000/5000 [02:14<00:00, 41.62it/s]Saving model checkpoint to checkpoints\checkpoint-5000
Configuration saved in checkpoints\checkpoint-5000\config.json
Model weights saved in checkpoints\checkpoint-5000\pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from checkpoints\checkpoint-3750 (score: 0.3155679702758789).
100%|██████████| 5000/5000 [02:14<00:00, 37.09it/s]

{'eval_loss': 0.33197230100631714, 'eval_accuracy': 0.8776, 'eval_runtime': 3.4175, 'eval_samples_per_second': 1463.055, 'eval_steps_per_second': 91.587, 'epoch': 4.0}
{'train_runtime': 134.8039, 'train_samples_per_second': 593.455, 'train_steps_per_second': 37.091, 'train_loss': 0.3134208901166916, 'epoch': 4.0}





TrainOutput(global_step=5000, training_loss=0.3134208901166916, metrics={'train_runtime': 134.8039, 'train_samples_per_second': 593.455, 'train_steps_per_second': 37.091, 'train_loss': 0.3134208901166916, 'epoch': 4.0})

In [2]:
# Change this to whichever checkpoint you want to evalaute
eval_checkpoint_directory = "./checkpoints/run-1/checkpoint-10000"

# Creates a Trainer to test a Hugging Face saved model
tester = init_tester(eval_checkpoint_directory)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file ./checkpoints/run-1/checkpoint-10000\config.json
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_ve

In [3]:
results = tester.predict(imdb["test"])  # 运行模型推理
print(results)  # 显示完整的预测结果
print("Test Accuracy:", results.metrics["test_accuracy"])

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 25000
  Batch size = 8
100%|██████████| 3125/3125 [00:19<00:00, 163.13it/s]

PredictionOutput(predictions=array([[ 3.5452178, -2.722677 ],
       [ 3.5177026, -2.6854327],
       [ 3.5448737, -2.7231743],
       ...,
       [ 2.499429 , -1.9069178],
       [ 3.5228763, -2.6918516],
       [ 2.7628484, -2.0171015]], dtype=float32), label_ids=array([0, 0, 0, ..., 1, 1, 1]), metrics={'test_loss': 0.726195216178894, 'test_accuracy': 0.86272, 'test_runtime': 19.192, 'test_samples_per_second': 1302.627, 'test_steps_per_second': 162.828})
Test Accuracy: 0.86272





Your `init_trainer` function needs to support the following.
- The training configuration (total number of epochs, early stopping criteria if any) must match your answer for Problem 1c.
- Your `Trainer` needs to save the model obtained during each training run to a folder called `checkpoints`.
- You should leave the `model` keyword parameter blank and instead pass an argument to the `model_init` keyword parameter.
- It should evaluate models based on accuracy.

Your `init_tester` function needs to support the following.
- The `Trainer` should only support testing and not traiing.
- It should evaluate models based on accuracy.


Please use the [Hugging Face fine-tuning tutorial](https://huggingface.co/docs/transformers/training) as well as [this forum post](https://discuss.huggingface.co/t/using-trainer-at-inference-time/9378/3) for guidance. You may need to create new functions for this problem, and you may find it useful to learn about [lambda expressions](https://realpython.com/python-lambda/) if you don't know about them already.

### Problem 2c: Set Up Hyperparameter Tuning (Code, 20 Points)

Finally, to complete the experiment setup, you will implement hyperparameter tuning using the [Optuna](https://optuna.org/) framework. Optuna is integrated with 🤗 Transformers, and it can be invoked via the `Trainer.hyperparameter_search` method. Please implement the function `hyperparameter_search_settings` in `train_model.py` by returing the correct keyword arguments for `Trainer.hyperparameter_search`. (Observe that, at the end of `train_model.py`, these keyword arguments are passed to `Trainer.hyperparameter_search` via dictionary unpacking.)  

Your code should support the following requirements.
- Your hyperparameter tuning configuration must match your answer for Problem 1c.
- You must use Optuna for hyperparameter tuning.
- You must indicate to Optuna that the hyperparameter search should maximize accuracy.

Please use the following resources for guidance.
- [The Hugging Face tutorial on hyperparameter tuning](https://huggingface.co/docs/transformers/hpo_train)
- [The documentation for `Trainer.hyperparameter_search`](https://huggingface.co/docs/transformers/v4.26.1/en/main_classes/trainer#transformers.Trainer.hyperparameter_search)
- [The documentation for Optuna's `GridSampler`](https://optuna.readthedocs.io/en/v2.0.0/reference/generated/optuna.samplers.GridSampler.html)

## Problem 3: Run Experiment (20 Points in Total)

To complete the assignment, you will now run your code and report on the results. It is recommended that you run your code on [Google Colaboratory](https://colab.research.google.com/) using a free GPU.

### Problem 3a: Train Models (Code and Written, 10 Points)

Please now run the following experimental procedure by running `train_model.py` as a Python script:
- first, fine-tune a BERT$_{\text{tiny}}$ model on the IMDb dataset _with_ BitFit;
- then, fine-tune a BERT$_{\text{tiny}}$ model on the IMDb dataset _without_ BitFit.

The `train_model.py` script should create a Pickle object containing information about the best hyperparameters found during hyperparameter tuning. Please submit this object, using the filenames `train_results_with_bitfit.p` and `train_results_without_bitfit.p` for your two training runs, respectively. Please also report the highest validation accuracy attained in each of your two training runs, as well as the hyperparameters used in those trials. Please format these results as a table such as the following.


| | Validation Accuracy | Learning Rate | Batch Size |
|---|---|---|---|
| Without BitFit |0.8878 |3e-4 |128 |
| With BitFit |0.6330 | 3e-4|8 |

### Problem 3b: Test Models and Report Results (Code and Written, 10 Points)

For each of your two training runs, please test the model that attained the best validation accuracy across all hyperparameter tuning trials. You may do so by running the `test_model.py` script. Once testing is complete, please report your results in the form of a table such as the following.

| | # Trainable Parameters | Test Accuracy |
|---|---|---|
| Without BitFit |4386178 | 0.87432|
| With BitFit |3074 | 0.63416|

The `test_model.py` script should create a Pickle object containing information about test results. Please submit this object, using the filenames `test_results_with_bitfit.p` and `test_results_without_bitfit.p` for your two tests.

Finally, please comment on your results. How do they compare to the results reported by Zaken et al. (2020)? What does this say about BitFit and its applicability to other pre-trained Transformers?