# Background

Houlsby et al., 2019 paper proved that adapters at lower layers have less impact than those at higher layers (see Figure 6 in the paper). The experiment was done by removing the adapter at various layers to see the fall of accuracy on validation data.

This finding is in line with the popular fine-tuning strategy of focusing on upper layers. One intuition is that the lower layers extract lower-level features that are shared among tasks, while the higher layers build features that are unique to different tasks.

# Problem statement

We will answer three questions:

1) Can we get similar SoTA results with lesser parameters (than fixed adapter size throughout layers) by using smaller adapter size (i.e. number of units in the bottleneck) at lower layers and larger size at high layers?

2) Can we get better SoTA results by using approximately the same number of parameters as the fixed size approach, but with bigger size at higher layers and smaller size at lower layers?

3) Experiment with different adapter configurations (non-linearity, etc)

We will be investigating only Houlsby task adapter architecture (Houlsby et al., 2019), NOT language adapter architecture (Pfeiffer et al., 2020)

# Approach

We will use the following approaches to get the answers:

1) We will be investigating with Sentiment Analysis task on SST-2 dataset

2) We will use the pre-trained adapter https://adapterhub.ml/adapters/ukp/bert-base-uncased_sentiment_sst-2_houlsby/ as baseline for our performance measurement

3) We will experiment with different adapter sizes and compare the performances against the baseline

4) We will experiment with various adapter configurations and compare the performances against the baseline

# Pre-trained Houlsby adapter sentiment/sst-2@ukp

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelWithHeads, AdapterConfig, pipeline, Trainer, TrainingArguments, DataCollatorWithPadding, EarlyStoppingCallback
import transformers.adapters.composition as ac
from datasets import load_dataset, load_metric
import math
import numpy as np

In [2]:
#BERT_LOCAL_PATH='./bert-base-uncased/'
BERT_LOCAL_PATH='D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased'

In [19]:
#model = AutoModelWithHeads.from_pretrained("bert-base-uncased", num_labels=2)
model = AutoModelWithHeads.from_pretrained(BERT_LOCAL_PATH, local_files_only=True, num_labels=2)
config = AdapterConfig.load("houlsby")
print(config)
adapter_name = model.load_adapter("sentiment/sst-2@ukp", config=config)
model.set_active_adapters(adapter_name)

Some weights of the model checkpoint at D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased were not used when initializing BertModelWithHeads: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModelWithHeads from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModelWithHeads from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


AdapterConfig(original_ln_before=False, original_ln_after=True, residual_before_ln=True, adapter_residual_before_ln=False, ln_before=False, ln_after=False, mh_adapter=True, output_adapter=True, non_linearity='swish', reduction_factor=16, inv_adapter=None, inv_adapter_reduction_factor=None, cross_adapter=False, leave_out=[])


ValueError: Unable to resolve adapter without the name of a model. Please specify model_name.

In [3]:
dataset = load_dataset("sst", "default")

Reusing dataset sst (C:\Users\sawro\.cache\huggingface\datasets\sst\default\1.0.0\b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)


In [10]:
print("Test size:", dataset["test"].num_rows)
# Each complete sentence is annotated with a float label that indicates its level of positive sentiment from 0.0 to 1.0.
# We can transform the above into a binary sentiment classification task by rounding each label to 0 or 1.
print("Sample test data:", dataset["test"][0])

print("Train size:", dataset["train"].num_rows)
print("Sample train data:", dataset["train"][0])

print("valid size:", dataset["validation"].num_rows)

Test size: 2210
Sample test data: {'sentence': 'Effective but too-tepid biopic', 'label': 0.5138900279998779, 'tokens': 'Effective|but|too-tepid|biopic', 'tree': '6|6|5|5|7|7|0'}
Train size: 8544
Sample train data: {'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'label': 0.6944400072097778, 'tokens': "The|Rock|is|destined|to|be|the|21st|Century|'s|new|``|Conan|''|and|that|he|'s|going|to|make|a|splash|even|greater|than|Arnold|Schwarzenegger|,|Jean-Claud|Van|Damme|or|Steven|Segal|.", 'tree': '70|70|68|67|63|62|61|60|58|58|57|56|56|64|65|55|54|53|52|51|49|47|47|46|46|45|40|40|41|39|38|38|43|37|37|69|44|39|42|41|42|43|44|45|50|48|48|49|50|51|52|53|54|55|66|57|59|59|60|61|62|63|64|65|66|67|68|69|71|71|0'}
valid size: 1101


In [4]:
#tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained(BERT_LOCAL_PATH, local_files_only=True)

In [24]:
sentiment_analysis = pipeline(task="sentiment-analysis", model=model, tokenizer=tokenizer)

# LABEL_0=negative; LABEL_1=positive (the adapter head_config.json has no label2id, that's why auto populated label_0 and _1)
output_labels = {0: "LABEL_0", 1: "LABEL_1"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("sentiment/sst-2@ukp pre-trained adapter accuracy on SST-2 test data: ", accuracy)

sentiment/sst-2@ukp pre-trained adapter accuracy on SST-2 test data:  0.8705882352941177


In [25]:
sentiment_analysis = pipeline(task="sentiment-analysis", model=model, tokenizer=tokenizer)
sentiment_analysis(dataset["test"][0]['sentence'])

[{'label': 'LABEL_0', 'score': 0.7958005666732788}]

# Baseline: Pre-training Houlsby adapter with SST-2 dataset

This baseline Adapter in Houlsby architecture is trained on the binary SST task

In [9]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

config2 = AdapterConfig.load("houlsby")
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2"])
model2.set_active_adapters("sst-2")
#print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs=20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\sawro/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1036, 1036, 16608, 1005, 1005, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tokens, sentence, tree.
***** Running training *****
  Num examples = 8544
  Num Epochs = 20
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 5340


Epoch,Training Loss,Validation Loss,Accuracy
1,0.678,0.643679,0.585831
2,0.5798,0.466579,0.79564
3,0.4265,0.40569,0.811989
4,0.4048,0.390768,0.828338
5,0.3993,0.409879,0.823797
6,0.3908,0.391416,0.833787
7,0.3836,0.381402,0.839237
8,0.3795,0.392848,0.840145
9,0.3785,0.382502,0.841054
10,0.3744,0.385497,0.84287


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tokens, sentence, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests\checkpoint-267
Configuration saved in ./tests\checkpoint-267\sst-2\adapter_config.json
Module weights saved in ./tests\checkpoint-267\sst-2\pytorch_adapter.bin
Configuration saved in ./tests\checkpoint-267\sst-2\head_config.json
Module weights saved in ./tests\checkpoint-267\sst-2\pytorch_model_head.bin
Configuration saved in ./tests\checkpoint-267\sst-2\head_config.json
Module weights saved in ./tests\checkpoint-267\sst-2\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tokens, sentence, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests\checkpoint-534
Con

In [12]:
""" Evaluate the above pre-trained baseline adapter on SST-2 test data """

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")

model2.add_classification_head(
    'sst-2',
    num_labels=1
)
config = AdapterConfig.load("./tests/sst-2/adapter_config.json")
adapter_name = model2.load_adapter("./tests/sst-2", config=config)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("Baseline adapter accuracy on SST-2 test data: ", accuracy)

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\sawro/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/bert-base-uncased/resolve/ma

Baseline adapter accuracy on SST-2 test data:  0.8384615384615385


In [20]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

config2 = AdapterConfig.load("houlsby", reduction_factor=12)
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2-1",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2-1",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2-1"])
model2.set_active_adapters("sst-2-1")
print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests-1", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs=20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\sawro/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

BertModelWithHeads(
  (bert): BertModel(
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1036, 1036, 16608, 1005, 1005, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tokens, sentence, tree.
***** Running training *****
  Num examples = 8544
  Num Epochs = 20
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 5340


Epoch,Training Loss,Validation Loss,Accuracy
1,0.676,0.633846,0.573115
2,0.5314,0.414808,0.81653
3,0.4114,0.398112,0.82743
4,0.3974,0.387888,0.834696
5,0.3936,0.408601,0.826521
6,0.384,0.390064,0.839237
7,0.377,0.380815,0.839237
8,0.3733,0.391442,0.838329
9,0.3737,0.381776,0.847411
10,0.3667,0.385589,0.841962


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tokens, sentence, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests-1\checkpoint-267
Configuration saved in ./tests-1\checkpoint-267\sst-2-1\adapter_config.json
Module weights saved in ./tests-1\checkpoint-267\sst-2-1\pytorch_adapter.bin
Configuration saved in ./tests-1\checkpoint-267\sst-2-1\head_config.json
Module weights saved in ./tests-1\checkpoint-267\sst-2-1\pytorch_model_head.bin
Configuration saved in ./tests-1\checkpoint-267\sst-2-1\head_config.json
Module weights saved in ./tests-1\checkpoint-267\sst-2-1\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tokens, sentence, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to 

In [1]:
""" Evaluate the above pre-trained baseline adapter on SST-2 test data """

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")

model2.add_classification_head(
    'sst-2',
    num_labels=1
)
config = AdapterConfig.load("./tests-1/sst-2-1/adapter_config.json")
adapter_name = model2.load_adapter("./tests-1/sst-2-1", config=config)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("62-sized adapter accuracy on SST-2 test data: ", accuracy)

NameError: name 'AutoTokenizer' is not defined

In [23]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

# Variable adapter sizes
# bigger adapter size for higher layers.
# reduction_factor (int or Mapping) – Either an integer specifying the reduction factor for all layers 
# or a mapping specifying the reduction_factor for individual layers. 
# If not all layers are represented in the mapping a default value should be given e.g. {‘1’: 8, ‘6’: 32, ‘default’: 16}
# Layer 0 - 3 = 24 adapter size; the rest = 48
config2 = AdapterConfig.load("houlsby", reduction_factor={'0':32, '1':32,'2':32,'3':32,'default':16})
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2-2",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2-2",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2-2"])
model2.set_active_adapters("sst-2-2")
print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests-2", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs=20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\sawro/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

BertModelWithHeads(
  (bert): BertModel(
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1036, 1036, 16608, 1005, 1005, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tokens, sentence, tree.
***** Running training *****
  Num examples = 8544
  Num Epochs = 20
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 5340


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6854,0.656857,0.541326
2,0.6159,0.511956,0.788374
3,0.445,0.409374,0.811989
4,0.4118,0.393737,0.822888
5,0.4071,0.414343,0.813806
6,0.3949,0.39445,0.823797
7,0.3881,0.383094,0.831063
8,0.3833,0.394562,0.830154
9,0.3831,0.385316,0.84287
10,0.3782,0.38742,0.839237


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tokens, sentence, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests-2\checkpoint-267
Configuration saved in ./tests-2\checkpoint-267\sst-2-2\adapter_config.json
Module weights saved in ./tests-2\checkpoint-267\sst-2-2\pytorch_adapter.bin
Configuration saved in ./tests-2\checkpoint-267\sst-2-2\head_config.json
Module weights saved in ./tests-2\checkpoint-267\sst-2-2\pytorch_model_head.bin
Configuration saved in ./tests-2\checkpoint-267\sst-2-2\head_config.json
Module weights saved in ./tests-2\checkpoint-267\sst-2-2\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tokens, sentence, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to 

In [25]:
""" Evaluate the above pre-trained adapter on SST-2 test data """

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")

model2.add_classification_head(
    'sst-2-2',
    num_labels=1
)
config = AdapterConfig.load("./tests-2/sst-2-2/adapter_config.json")
adapter_name = model2.load_adapter("./tests-2/sst-2-2", config=config)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("var-sized adapter accuracy on SST-2 test data: ", accuracy)

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\sawro/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/bert-base-uncased/resolve/ma

var-sized adapter accuracy on SST-2 test data:  0.8357466063348417


In [26]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

# Variable adapter sizes
# bigger adapter size for higher layers.
# reduction_factor (int or Mapping) – Either an integer specifying the reduction factor for all layers 
# or a mapping specifying the reduction_factor for individual layers. 
# If not all layers are represented in the mapping a default value should be given e.g. {‘1’: 8, ‘6’: 32, ‘default’: 16}
# Layer 0 - 5 = 24 adapter size; the rest = 48; total 12 layers
config2 = AdapterConfig.load("houlsby", reduction_factor={'0':32, '1':32,'2':32,'3':32,'4':32,'5':32,'default':16})
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2-3",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2-3",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2-3"])
model2.set_active_adapters("sst-2-3")
print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests-3", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs=20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\sawro/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

BertModelWithHeads(
  (bert): BertModel(
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1036, 1036, 16608, 1005, 1005, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tokens, sentence, tree.
***** Running training *****
  Num examples = 8544
  Num Epochs = 20
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 5340


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6819,0.650549,0.550409
2,0.6108,0.499978,0.790191
3,0.4414,0.407693,0.811989
4,0.4095,0.396169,0.825613
5,0.4053,0.413016,0.819255
6,0.3963,0.395992,0.825613
7,0.3888,0.385401,0.840145
8,0.3852,0.395321,0.835604
9,0.3843,0.387065,0.840145
10,0.3799,0.389256,0.840145


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tokens, sentence, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests-3\checkpoint-267
Configuration saved in ./tests-3\checkpoint-267\sst-2-3\adapter_config.json
Module weights saved in ./tests-3\checkpoint-267\sst-2-3\pytorch_adapter.bin
Configuration saved in ./tests-3\checkpoint-267\sst-2-3\head_config.json
Module weights saved in ./tests-3\checkpoint-267\sst-2-3\pytorch_model_head.bin
Configuration saved in ./tests-3\checkpoint-267\sst-2-3\head_config.json
Module weights saved in ./tests-3\checkpoint-267\sst-2-3\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tokens, sentence, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to 

In [4]:
""" Evaluate the above pre-trained adapter on SST-2 test data """

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")

model2.add_classification_head(
    'sst-2-3',
    num_labels=1
)
config = AdapterConfig.load("./tests-3/sst-2-3/adapter_config.json")
adapter_name = model2.load_adapter("./tests-3/sst-2-3", config=config)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("var-sized adapter accuracy on SST-2 test data: ", accuracy)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModelWithHeads: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModelWithHeads from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModelWithHeads from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Overwriting existing head 'sst-2-3'


var-sized adapter accuracy on SST-2 test data:  0.8361990950226245


In [5]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

# Variable adapter sizes
# bigger adapter size for higher layers.
# reduction_factor (int or Mapping) – Either an integer specifying the reduction factor for all layers 
# or a mapping specifying the reduction_factor for individual layers. 
# If not all layers are represented in the mapping a default value should be given e.g. {‘1’: 8, ‘6’: 32, ‘default’: 16}
# Layer 0 - 7 = 24 adapter size; the rest = 48; total 12 layers
config2 = AdapterConfig.load("houlsby", reduction_factor={'0':32, '1':32,'2':32,'3':32,'4':32,'5':32,'6':32,'7':32,'default':16})
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2-4",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2-4",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2-4"])
model2.set_active_adapters("sst-2-4")
print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests-4", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs=20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModelWithHeads: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModelWithHeads from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModelWithHeads from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertModelWithHeads(
  (bert): BertModel(
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1036, 1036, 16608, 1005, 1005, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running training *****
  Num examples = 8544
  Num Epochs = 20
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 5340


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6792,0.641744,0.569482
2,0.6018,0.489704,0.790191
3,0.4405,0.406582,0.811081
4,0.4102,0.395365,0.82198
5,0.4056,0.412662,0.814714
6,0.3958,0.397122,0.833787
7,0.3897,0.388018,0.840145
8,0.3838,0.397658,0.831063
9,0.3856,0.388746,0.841962
10,0.3817,0.391042,0.84287


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests-4\checkpoint-267
Configuration saved in ./tests-4\checkpoint-267\sst-2-4\adapter_config.json
Module weights saved in ./tests-4\checkpoint-267\sst-2-4\pytorch_adapter.bin
Configuration saved in ./tests-4\checkpoint-267\sst-2-4\head_config.json
Module weights saved in ./tests-4\checkpoint-267\sst-2-4\pytorch_model_head.bin
Configuration saved in ./tests-4\checkpoint-267\sst-2-4\head_config.json
Module weights saved in ./tests-4\checkpoint-267\sst-2-4\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to 

In [6]:
""" Evaluate the above pre-trained adapter on SST-2 test data """

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")

model2.add_classification_head(
    'sst-2-4',
    num_labels=1
)
config = AdapterConfig.load("./tests-4/sst-2-4/adapter_config.json")
adapter_name = model2.load_adapter("./tests-4/sst-2-4", config=config)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("var-sized adapter accuracy on SST-2 test data: ", accuracy)

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\sawro/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/bert-base-uncased/resolve/ma

var-sized adapter accuracy on SST-2 test data:  0.8366515837104073


In [7]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

# Variable adapter sizes
# bigger adapter size for higher layers.
# reduction_factor (int or Mapping) – Either an integer specifying the reduction factor for all layers 
# or a mapping specifying the reduction_factor for individual layers. 
# If not all layers are represented in the mapping a default value should be given e.g. {‘1’: 8, ‘6’: 32, ‘default’: 16}
# Layer 10 & 11 (last 2 layers) = 48 adapter size; the rest = 24; total 12 layers
config2 = AdapterConfig.load("houlsby", reduction_factor={'10':16,'11':16, 'default': 32})
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2-5",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2-5",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2-5"])
model2.set_active_adapters("sst-2-5")
print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests-5", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs=20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\sawro/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

BertModelWithHeads(
  (bert): BertModel(
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1036, 1036, 16608, 1005, 1005, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running training *****
  Num examples = 8544
  Num Epochs = 20
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 5340


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6815,0.649141,0.532243
2,0.6116,0.511369,0.782016
3,0.4477,0.413461,0.808356
4,0.4121,0.397808,0.81653
5,0.4085,0.418374,0.82198
6,0.4007,0.399722,0.824705
7,0.3934,0.389604,0.832879
8,0.3888,0.401028,0.829246
9,0.3885,0.392585,0.832879
10,0.3842,0.393548,0.834696


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests-5\checkpoint-267
Configuration saved in ./tests-5\checkpoint-267\sst-2-5\adapter_config.json
Module weights saved in ./tests-5\checkpoint-267\sst-2-5\pytorch_adapter.bin
Configuration saved in ./tests-5\checkpoint-267\sst-2-5\head_config.json
Module weights saved in ./tests-5\checkpoint-267\sst-2-5\pytorch_model_head.bin
Configuration saved in ./tests-5\checkpoint-267\sst-2-5\head_config.json
Module weights saved in ./tests-5\checkpoint-267\sst-2-5\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to 

In [8]:
""" Evaluate the above pre-trained adapter on SST-2 test data """

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")

model2.add_classification_head(
    'sst-2-5',
    num_labels=1
)
config = AdapterConfig.load("./tests-5/sst-2-5/adapter_config.json")
adapter_name = model2.load_adapter("./tests-5/sst-2-5", config=config)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("var-sized adapter accuracy on SST-2 test data: ", accuracy)

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\sawro/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/bert-base-uncased/resolve/ma

var-sized adapter accuracy on SST-2 test data:  0.8343891402714932


In [10]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

# Variable adapter sizes
# bigger adapter size for higher layers.
# reduction_factor (int or Mapping) – Either an integer specifying the reduction factor for all layers 
# or a mapping specifying the reduction_factor for individual layers. 
# If not all layers are represented in the mapping a default value should be given e.g. {‘1’: 8, ‘6’: 32, ‘default’: 16}
config2 = AdapterConfig.load("houlsby", reduction_factor={'0':48, '1':48,'2':48,'3':48,'4':48,'5':48,'6':48,'7':48,'default':16})
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2-6",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2-6",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2-6"])
model2.set_active_adapters("sst-2-6")
print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests-6", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs=20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\sawro/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

BertModelWithHeads(
  (bert): BertModel(
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1036, 1036, 16608, 1005, 1005, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running training *****
  Num examples = 8544
  Num Epochs = 20
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 5340


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6785,0.64569,0.548592
2,0.6143,0.509996,0.776567
3,0.4518,0.412865,0.815622
4,0.4137,0.398721,0.817439
5,0.4091,0.41826,0.815622
6,0.3995,0.398249,0.830154
7,0.3924,0.388743,0.831971
8,0.3907,0.399008,0.839237
9,0.3897,0.39028,0.840145
10,0.3862,0.393056,0.84287


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests-6\checkpoint-267
Configuration saved in ./tests-6\checkpoint-267\sst-2-6\adapter_config.json
Module weights saved in ./tests-6\checkpoint-267\sst-2-6\pytorch_adapter.bin
Configuration saved in ./tests-6\checkpoint-267\sst-2-6\head_config.json
Module weights saved in ./tests-6\checkpoint-267\sst-2-6\pytorch_model_head.bin
Configuration saved in ./tests-6\checkpoint-267\sst-2-6\head_config.json
Module weights saved in ./tests-6\checkpoint-267\sst-2-6\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to 

In [11]:
""" Evaluate the above pre-trained adapter on SST-2 test data """

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")

model2.add_classification_head(
    'sst-2-6',
    num_labels=1
)
config = AdapterConfig.load("./tests-6/sst-2-6/adapter_config.json")
adapter_name = model2.load_adapter("./tests-6/sst-2-6", config=config)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("var-sized adapter accuracy on SST-2 test data: ", accuracy)

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\sawro/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/bert-base-uncased/resolve/ma

var-sized adapter accuracy on SST-2 test data:  0.8357466063348417


In [13]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

# Variable adapter sizes
# bigger adapter size for higher layers.
# reduction_factor (int or Mapping) – Either an integer specifying the reduction factor for all layers 
# or a mapping specifying the reduction_factor for individual layers. 
# If not all layers are represented in the mapping a default value should be given e.g. {‘1’: 8, ‘6’: 32, ‘default’: 16}
config2 = AdapterConfig.load("houlsby", reduction_factor={'10':96, '11':96,'default':16})
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2-7",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2-7",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2-7"])
model2.set_active_adapters("sst-2-7")
print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests-7", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs=20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\sawro/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

BertModelWithHeads(
  (bert): BertModel(
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1036, 1036, 16608, 1005, 1005, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running training *****
  Num examples = 8544
  Num Epochs = 20
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 5340


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6792,0.634971,0.590372
2,0.5632,0.441669,0.809264
3,0.4198,0.39858,0.82743
4,0.4038,0.389816,0.830154
5,0.4014,0.408022,0.82743
6,0.3904,0.390612,0.836512
7,0.3841,0.381206,0.838329
8,0.38,0.391488,0.838329
9,0.3795,0.383406,0.84287
10,0.3733,0.384386,0.84287


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests-6\checkpoint-267
Configuration saved in ./tests-6\checkpoint-267\sst-2-7\adapter_config.json
Module weights saved in ./tests-6\checkpoint-267\sst-2-7\pytorch_adapter.bin
Configuration saved in ./tests-6\checkpoint-267\sst-2-7\head_config.json
Module weights saved in ./tests-6\checkpoint-267\sst-2-7\pytorch_model_head.bin
Configuration saved in ./tests-6\checkpoint-267\sst-2-7\head_config.json
Module weights saved in ./tests-6\checkpoint-267\sst-2-7\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to 

In [14]:
""" Evaluate the above pre-trained adapter on SST-2 test data """

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")

model2.add_classification_head(
    'sst-2-7',
    num_labels=1
)
config = AdapterConfig.load("./tests-7/sst-2-7/adapter_config.json")
adapter_name = model2.load_adapter("./tests-7/sst-2-7", config=config)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("var-sized adapter accuracy on SST-2 test data: ", accuracy)

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\sawro/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/bert-base-uncased/resolve/ma

var-sized adapter accuracy on SST-2 test data:  0.8371040723981901


## Interesting Experiment 1

In [16]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

# Variable adapter sizes
# bigger adapter size for higher layers.
# reduction_factor (int or Mapping) – Either an integer specifying the reduction factor for all layers 
# or a mapping specifying the reduction_factor for individual layers. 
# If not all layers are represented in the mapping a default value should be given e.g. {‘1’: 8, ‘6’: 32, ‘default’: 16}
config2 = AdapterConfig.load("houlsby", reduction_factor={'11':48,'default':96})
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2-8",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2-8",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2-8"])
model2.set_active_adapters("sst-2-8")
print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests-8", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs=20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\sawro/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

BertModelWithHeads(
  (bert): BertModel(
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1036, 1036, 16608, 1005, 1005, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running training *****
  Num examples = 8544
  Num Epochs = 20
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 5340


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6851,0.6618,0.518619
2,0.6541,0.594929,0.722071
3,0.5693,0.487377,0.789282
4,0.4583,0.419962,0.802906
5,0.4282,0.430441,0.804723
6,0.4163,0.41148,0.814714
7,0.411,0.401943,0.817439
8,0.4068,0.412195,0.819255
9,0.4048,0.403658,0.823797
10,0.4041,0.404804,0.825613


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests-8\checkpoint-267
Configuration saved in ./tests-8\checkpoint-267\sst-2-8\adapter_config.json
Module weights saved in ./tests-8\checkpoint-267\sst-2-8\pytorch_adapter.bin
Configuration saved in ./tests-8\checkpoint-267\sst-2-8\head_config.json
Module weights saved in ./tests-8\checkpoint-267\sst-2-8\pytorch_model_head.bin
Configuration saved in ./tests-8\checkpoint-267\sst-2-8\head_config.json
Module weights saved in ./tests-8\checkpoint-267\sst-2-8\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to 

In [17]:
""" Evaluate the above pre-trained adapter on SST-2 test data """

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")

model2.add_classification_head(
    'sst-2-8',
    num_labels=1
)
config = AdapterConfig.load("./tests-8/sst-2-8/adapter_config.json")
adapter_name = model2.load_adapter("./tests-8/sst-2-8", config=config)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("var-sized adapter accuracy on SST-2 test data: ", accuracy)

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\sawro/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/bert-base-uncased/resolve/ma

var-sized adapter accuracy on SST-2 test data:  0.8294117647058824


In [18]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

# Variable adapter sizes
# bigger adapter size for higher layers.
# reduction_factor (int or Mapping) – Either an integer specifying the reduction factor for all layers 
# or a mapping specifying the reduction_factor for individual layers. 
# If not all layers are represented in the mapping a default value should be given e.g. {‘1’: 8, ‘6’: 32, ‘default’: 16}
config2 = AdapterConfig.load("houlsby", reduction_factor={'10':48,'11':48,'default':96})
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2-9",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2-9",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2-9"])
model2.set_active_adapters("sst-2-9")
print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests-9", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs=20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\sawro/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

BertModelWithHeads(
  (bert): BertModel(
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1036, 1036, 16608, 1005, 1005, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running training *****
  Num examples = 8544
  Num Epochs = 20
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 5340


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6841,0.656445,0.542234
2,0.6553,0.592941,0.742053
3,0.578,0.492446,0.784741
4,0.4637,0.41976,0.803815
5,0.4288,0.430902,0.80654
6,0.4163,0.411443,0.815622
7,0.4092,0.402027,0.823797
8,0.4064,0.411619,0.822888
9,0.405,0.402755,0.830154
10,0.4019,0.403245,0.832879


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests-9\checkpoint-267
Configuration saved in ./tests-9\checkpoint-267\sst-2-9\adapter_config.json
Module weights saved in ./tests-9\checkpoint-267\sst-2-9\pytorch_adapter.bin
Configuration saved in ./tests-9\checkpoint-267\sst-2-9\head_config.json
Module weights saved in ./tests-9\checkpoint-267\sst-2-9\pytorch_model_head.bin
Configuration saved in ./tests-9\checkpoint-267\sst-2-9\head_config.json
Module weights saved in ./tests-9\checkpoint-267\sst-2-9\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: sentence, tokens, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to 

In [19]:
""" Evaluate the above pre-trained adapter on SST-2 test data """

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")

model2.add_classification_head(
    'sst-2-9',
    num_labels=1
)
config = AdapterConfig.load("./tests-9/sst-2-9/adapter_config.json")
adapter_name = model2.load_adapter("./tests-9/sst-2-9", config=config)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("var-sized adapter accuracy on SST-2 test data: ", accuracy)

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\sawro/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/bert-base-uncased/resolve/ma

var-sized adapter accuracy on SST-2 test data:  0.8262443438914027


In [4]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

# Variable adapter sizes
# bigger adapter size for higher layers.
# reduction_factor (int or Mapping) – Either an integer specifying the reduction factor for all layers 
# or a mapping specifying the reduction_factor for individual layers. 
# If not all layers are represented in the mapping a default value should be given e.g. {‘1’: 8, ‘6’: 32, ‘default’: 16}
config2 = AdapterConfig.load("houlsby", reduction_factor=16, leave_out=[0,1,2,3,4,5,6,7,8,9,10])
model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2-10",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2-10",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2-10"])
model2.set_active_adapters("sst-2-10")
print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests-10", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs=20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModelWithHeads: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModelWithHeads from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModelWithHeads from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertModelWithHeads(
  (bert): BertModel(
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1036, 1036, 16608, 1005, 1005, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tokens, sentence, tree.
***** Running training *****
  Num examples = 8544
  Num Epochs = 20
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 5340


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6811,0.652771,0.559491
2,0.6649,0.614935,0.696639
3,0.65,0.588942,0.71753
4,0.6387,0.559432,0.752044
5,0.6299,0.547342,0.736603
6,0.6184,0.527232,0.747502
7,0.611,0.508385,0.758401
8,0.602,0.506527,0.751135
9,0.5968,0.495741,0.762943
10,0.5886,0.488556,0.763851


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tokens, sentence, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests-10\checkpoint-267
Configuration saved in ./tests-10\checkpoint-267\sst-2-10\adapter_config.json
Module weights saved in ./tests-10\checkpoint-267\sst-2-10\pytorch_adapter.bin
Configuration saved in ./tests-10\checkpoint-267\sst-2-10\head_config.json
Module weights saved in ./tests-10\checkpoint-267\sst-2-10\pytorch_model_head.bin
Configuration saved in ./tests-10\checkpoint-267\sst-2-10\head_config.json
Module weights saved in ./tests-10\checkpoint-267\sst-2-10\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tokens, sentence, tree.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model c

In [20]:
""" Evaluate the above pre-trained adapter on SST-2 test data """

#tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained(BERT_LOCAL_PATH, local_files_only=True)
#model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained(BERT_LOCAL_PATH, local_files_only=True, num_labels=2)

model2.add_classification_head(
    'sst-2-10',
    num_labels=1
)
config = AdapterConfig.load("./tests-10/sst-2-10/adapter_config.json")
#adapter_name = model2.load_adapter("./tests-10/sst-2-10", config=config)
adapter_name = model2.load_adapter("./tests-10/sst-2-10", config=config, model_name=BERT_LOCAL_PATH)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("var-sized adapter accuracy on SST-2 test data: ", accuracy)

Some weights of the model checkpoint at D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased were not used when initializing BertModelWithHeads: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModelWithHeads from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModelWithHeads from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Overwriting existing head 'sst-2-10'


var-sized adapter accuracy on SST-2 test data:  0.5574660633484163


In [21]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

# Variable adapter sizes
# bigger adapter size for higher layers.
# reduction_factor (int or Mapping) – Either an integer specifying the reduction factor for all layers 
# or a mapping specifying the reduction_factor for individual layers. 
# If not all layers are represented in the mapping a default value should be given e.g. {‘1’: 8, ‘6’: 32, ‘default’: 16}
config2 = AdapterConfig.load("houlsby", reduction_factor=16, leave_out=[0,1,2,3,4,5,6,7,8,9])
#model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)
model2 = AutoModelWithHeads.from_pretrained(BERT_LOCAL_PATH, local_files_only=True, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2-11",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2-11",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2-11"])
model2.set_active_adapters("sst-2-11")
print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests-11", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs=20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

Some weights of the model checkpoint at D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased were not used when initializing BertModelWithHeads: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModelWithHeads from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModelWithHeads from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertModelWithHeads(
  (bert): BertModel(
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

  0%|          | 0/9 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1109, 2977, 1110, 17348, 1106, 1129, 1103, 6880, 5944, 112, 188, 1207, 169, 169, 17727, 112, 112, 1105, 1115, 1119, 112, 188, 1280, 1106, 1294, 170, 24194, 1256, 3407, 1190, 7296, 20452, 24156, 11819, 7582, 9146, 117, 2893, 118, 140, 15554, 1181, 3605, 8732, 3263, 1137, 6536, 17979, 1233, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, tokens, sentence.
***** Running training *****
  Num examples = 8544
  Num Epochs = 20
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 5340


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6886,0.660441,0.678474
2,0.675,0.630797,0.732062
3,0.6584,0.602197,0.736603
4,0.644,0.572355,0.745686
5,0.6218,0.541964,0.754768
6,0.604,0.5138,0.769301
7,0.5732,0.487139,0.769301
8,0.5513,0.471312,0.782016
9,0.5283,0.459366,0.790191
10,0.5131,0.451029,0.796549


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, tokens, sentence.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests-10\checkpoint-267
Configuration saved in ./tests-10\checkpoint-267\sst-2-10\adapter_config.json
Module weights saved in ./tests-10\checkpoint-267\sst-2-10\pytorch_adapter.bin
Configuration saved in ./tests-10\checkpoint-267\sst-2-10\head_config.json
Module weights saved in ./tests-10\checkpoint-267\sst-2-10\pytorch_model_head.bin
Configuration saved in ./tests-10\checkpoint-267\sst-2-10\head_config.json
Module weights saved in ./tests-10\checkpoint-267\sst-2-10\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, tokens, sentence.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model c

In [22]:
""" Evaluate the above pre-trained adapter on SST-2 test data """

#tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained(BERT_LOCAL_PATH, local_files_only=True)
#model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained(BERT_LOCAL_PATH, local_files_only=True, num_labels=2)

model2.add_classification_head(
    'sst-2-11',
    num_labels=1
)
config = AdapterConfig.load("./tests-11/sst-2-11/adapter_config.json")
#adapter_name = model2.load_adapter("./tests-10/sst-2-10", config=config)
adapter_name = model2.load_adapter("./tests-11/sst-2-11", config=config, model_name=BERT_LOCAL_PATH)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("var-sized adapter accuracy on SST-2 test data: ", accuracy)

loading configuration file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\config.json
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

Didn't find file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\added_tokens.json. We won't load it.
Didn't find file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\special_tokens_map.json. We won't load it.
loa

var-sized adapter accuracy on SST-2 test data:  0.8117647058823529


In [31]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

# Variable adapter sizes
# bigger adapter size for higher layers.
# reduction_factor (int or Mapping) – Either an integer specifying the reduction factor for all layers 
# or a mapping specifying the reduction_factor for individual layers. 
# If not all layers are represented in the mapping a default value should be given e.g. {‘1’: 8, ‘6’: 32, ‘default’: 16}
config2 = AdapterConfig.load("houlsby", reduction_factor=16, leave_out=[0,1,2,3,4,5,6,7,8])
#model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)
model2 = AutoModelWithHeads.from_pretrained(BERT_LOCAL_PATH, local_files_only=True, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2-12",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2-12",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2-12"])
model2.set_active_adapters("sst-2-12")
print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests-12", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs= 40, #20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

loading configuration file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\config.json
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\pytorch_model.bin
Some weights of the model checkpoint at D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased were not used when initializing BertMo

BertModelWithHeads(
  (bert): BertModel(
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

  0%|          | 0/9 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1109, 2977, 1110, 17348, 1106, 1129, 1103, 6880, 5944, 112, 188, 1207, 169, 169, 17727, 112, 112, 1105, 1115, 1119, 112, 188, 1280, 1106, 1294, 170, 24194, 1256, 3407, 1190, 7296, 20452, 24156, 11819, 7582, 9146, 117, 2893, 118, 140, 15554, 1181, 3605, 8732, 3263, 1137, 6536, 17979, 1233, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, tokens, sentence.
***** Running training *****
  Num examples = 8544
  Num Epochs = 40
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 10680


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6906,0.666353,0.673933
2,0.6759,0.632749,0.743869
3,0.6549,0.591358,0.741144
4,0.6229,0.532643,0.772025
5,0.5631,0.475344,0.7802
6,0.5134,0.444248,0.80109
7,0.4756,0.434366,0.808356
8,0.4581,0.43681,0.802906
9,0.4563,0.431309,0.805631
10,0.4493,0.430215,0.804723


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, tokens, sentence.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests-12\checkpoint-267
Configuration saved in ./tests-12\checkpoint-267\sst-2-12\adapter_config.json
Module weights saved in ./tests-12\checkpoint-267\sst-2-12\pytorch_adapter.bin
Configuration saved in ./tests-12\checkpoint-267\sst-2-12\head_config.json
Module weights saved in ./tests-12\checkpoint-267\sst-2-12\pytorch_model_head.bin
Configuration saved in ./tests-12\checkpoint-267\sst-2-12\head_config.json
Module weights saved in ./tests-12\checkpoint-267\sst-2-12\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, tokens, sentence.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model c

In [32]:
""" Evaluate the above pre-trained adapter on SST-2 test data """

#tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained(BERT_LOCAL_PATH, local_files_only=True)
#model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained(BERT_LOCAL_PATH, local_files_only=True, num_labels=2)

model2.add_classification_head(
    'sst-2-12',
    num_labels=1
)
config = AdapterConfig.load("./tests-12/sst-2-12/adapter_config.json")
#adapter_name = model2.load_adapter("./tests-10/sst-2-10", config=config)
adapter_name = model2.load_adapter("./tests-12/sst-2-12", config=config, model_name=BERT_LOCAL_PATH)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("var-sized adapter accuracy on SST-2 test data: ", accuracy)

loading configuration file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\config.json
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

Didn't find file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\added_tokens.json. We won't load it.
Didn't find file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\special_tokens_map.json. We won't load it.
loa

var-sized adapter accuracy on SST-2 test data:  0.834841628959276


In [34]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

# Variable adapter sizes
# bigger adapter size for higher layers.
# reduction_factor (int or Mapping) – Either an integer specifying the reduction factor for all layers 
# or a mapping specifying the reduction_factor for individual layers. 
# If not all layers are represented in the mapping a default value should be given e.g. {‘1’: 8, ‘6’: 32, ‘default’: 16}
config2 = AdapterConfig.load("houlsby", reduction_factor={'11':8,'10':16}, leave_out=[0,1,2,3,4,5,6,7,8,9])
#model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)
model2 = AutoModelWithHeads.from_pretrained(BERT_LOCAL_PATH, local_files_only=True, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2-13",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2-13",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2-13"])
model2.set_active_adapters("sst-2-13")
print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests-13", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs=40, #20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 5)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

loading configuration file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\config.json
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\pytorch_model.bin
Some weights of the model checkpoint at D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased were not used when initializing BertMo

BertModelWithHeads(
  (bert): BertModel(
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1109, 2977, 1110, 17348, 1106, 1129, 1103, 6880, 5944, 112, 188, 1207, 169, 169, 17727, 112, 112, 1105, 1115, 1119, 112, 188, 1280, 1106, 1294, 170, 24194, 1256, 3407, 1190, 7296, 20452, 24156, 11819, 7582, 9146, 117, 2893, 118, 140, 15554, 1181, 3605, 8732, 3263, 1137, 6536, 17979, 1233, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, tokens, sentence.
***** Running training *****
  Num examples = 8544
  Num Epochs = 40
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 10680


Epoch,Training Loss,Validation Loss,Accuracy
1,0.688,0.656714,0.705722
2,0.6724,0.624846,0.744777
3,0.6565,0.589789,0.743869
4,0.6345,0.550608,0.762035
5,0.6061,0.511957,0.764759
6,0.5754,0.477255,0.773842
7,0.5324,0.453309,0.792916
8,0.5052,0.446027,0.799273
9,0.4895,0.440136,0.805631
10,0.4799,0.437426,0.807448


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, tokens, sentence.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests-13\checkpoint-267
Configuration saved in ./tests-13\checkpoint-267\sst-2-13\adapter_config.json
Module weights saved in ./tests-13\checkpoint-267\sst-2-13\pytorch_adapter.bin
Configuration saved in ./tests-13\checkpoint-267\sst-2-13\head_config.json
Module weights saved in ./tests-13\checkpoint-267\sst-2-13\pytorch_model_head.bin
Configuration saved in ./tests-13\checkpoint-267\sst-2-13\head_config.json
Module weights saved in ./tests-13\checkpoint-267\sst-2-13\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, tokens, sentence.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model c

In [35]:
""" Evaluate the above pre-trained adapter on SST-2 test data """

#tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained(BERT_LOCAL_PATH, local_files_only=True)
#model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained(BERT_LOCAL_PATH, local_files_only=True, num_labels=2)

model2.add_classification_head(
    'sst-2-13',
    num_labels=1
)
config = AdapterConfig.load("./tests-13/sst-2-13/adapter_config.json")
#adapter_name = model2.load_adapter("./tests-10/sst-2-10", config=config)
adapter_name = model2.load_adapter("./tests-13/sst-2-13", config=config, model_name=BERT_LOCAL_PATH)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("var-sized adapter accuracy on SST-2 test data: ", accuracy)

loading configuration file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\config.json
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

Didn't find file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\added_tokens.json. We won't load it.
Didn't find file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\special_tokens_map.json. We won't load it.
loa

var-sized adapter accuracy on SST-2 test data:  0.8321266968325792


# Baseline adapter with ReLU

In [36]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

# Variable adapter sizes
# bigger adapter size for higher layers.
# reduction_factor (int or Mapping) – Either an integer specifying the reduction factor for all layers 
# or a mapping specifying the reduction_factor for individual layers. 
# If not all layers are represented in the mapping a default value should be given e.g. {‘1’: 8, ‘6’: 32, ‘default’: 16}
config2 = AdapterConfig.load("houlsby", non_linearity='relu')
#model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)
model2 = AutoModelWithHeads.from_pretrained(BERT_LOCAL_PATH, local_files_only=True, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2-14",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2-14",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2-14"])
model2.set_active_adapters("sst-2-14")
print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests-14", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs=20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

loading configuration file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\config.json
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\pytorch_model.bin
Some weights of the model checkpoint at D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased were not used when initializing BertMo

BertModelWithHeads(
  (bert): BertModel(
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

  0%|          | 0/9 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1109, 2977, 1110, 17348, 1106, 1129, 1103, 6880, 5944, 112, 188, 1207, 169, 169, 17727, 112, 112, 1105, 1115, 1119, 112, 188, 1280, 1106, 1294, 170, 24194, 1256, 3407, 1190, 7296, 20452, 24156, 11819, 7582, 9146, 117, 2893, 118, 140, 15554, 1181, 3605, 8732, 3263, 1137, 6536, 17979, 1233, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, tokens, sentence.
***** Running training *****
  Num examples = 8544
  Num Epochs = 20
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 5340


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6858,0.651083,0.715713
2,0.5898,0.445808,0.808356
3,0.4373,0.411739,0.814714
4,0.418,0.401076,0.822888
5,0.4061,0.407494,0.814714
6,0.4009,0.396454,0.823797
7,0.394,0.393223,0.826521
8,0.3859,0.393133,0.82743
9,0.3834,0.391214,0.830154
10,0.3812,0.39113,0.82743


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, tokens, sentence.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests-14\checkpoint-267
Configuration saved in ./tests-14\checkpoint-267\sst-2-14\adapter_config.json
Module weights saved in ./tests-14\checkpoint-267\sst-2-14\pytorch_adapter.bin
Configuration saved in ./tests-14\checkpoint-267\sst-2-14\head_config.json
Module weights saved in ./tests-14\checkpoint-267\sst-2-14\pytorch_model_head.bin
Configuration saved in ./tests-14\checkpoint-267\sst-2-14\head_config.json
Module weights saved in ./tests-14\checkpoint-267\sst-2-14\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, tokens, sentence.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model c

In [37]:
""" Evaluate the above pre-trained adapter on SST-2 test data """

#tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained(BERT_LOCAL_PATH, local_files_only=True)
#model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained(BERT_LOCAL_PATH, local_files_only=True, num_labels=2)

model2.add_classification_head(
    'sst-2-14',
    num_labels=1
)
config = AdapterConfig.load("./tests-14/sst-2-14/adapter_config.json")
#adapter_name = model2.load_adapter("./tests-10/sst-2-10", config=config)
adapter_name = model2.load_adapter("./tests-14/sst-2-14", config=config, model_name=BERT_LOCAL_PATH)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("var-sized adapter accuracy on SST-2 test data: ", accuracy)

loading configuration file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\config.json
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

Didn't find file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\added_tokens.json. We won't load it.
Didn't find file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\special_tokens_map.json. We won't load it.
loa

var-sized adapter accuracy on SST-2 test data:  0.8447963800904977


In [5]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

# Variable adapter sizes
# bigger adapter size for higher layers.
# reduction_factor (int or Mapping) – Either an integer specifying the reduction factor for all layers 
# or a mapping specifying the reduction_factor for individual layers. 
# If not all layers are represented in the mapping a default value should be given e.g. {‘1’: 8, ‘6’: 32, ‘default’: 16}
config2 = AdapterConfig.load("houlsby", reduction_factor={'11':12,'10':16,'9':24}, leave_out=[0,1,2,3,4,5,6,7,8])
#model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)
model2 = AutoModelWithHeads.from_pretrained(BERT_LOCAL_PATH, local_files_only=True, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2-15",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2-15",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2-15"])
model2.set_active_adapters("sst-2-15")
print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests-15", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs=40, #20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

Some weights of the model checkpoint at D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased were not used when initializing BertModelWithHeads: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModelWithHeads from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModelWithHeads from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertModelWithHeads(
  (bert): BertModel(
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

  0%|          | 0/9 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1109, 2977, 1110, 17348, 1106, 1129, 1103, 6880, 5944, 112, 188, 1207, 169, 169, 17727, 112, 112, 1105, 1115, 1119, 112, 188, 1280, 1106, 1294, 170, 24194, 1256, 3407, 1190, 7296, 20452, 24156, 11819, 7582, 9146, 117, 2893, 118, 140, 15554, 1181, 3605, 8732, 3263, 1137, 6536, 17979, 1233, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, sentence, tokens.
***** Running training *****
  Num examples = 8544
  Num Epochs = 40
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 10680


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6846,0.650928,0.710263
2,0.6689,0.616813,0.742961
3,0.6459,0.573311,0.748411
4,0.6115,0.517451,0.762035
5,0.5555,0.471594,0.77475
6,0.5125,0.444654,0.796549
7,0.4785,0.436021,0.805631
8,0.4637,0.437372,0.801998
9,0.4607,0.433105,0.807448
10,0.4533,0.431662,0.805631


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, sentence, tokens.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests-15\checkpoint-267
Configuration saved in ./tests-15\checkpoint-267\sst-2-15\adapter_config.json
Module weights saved in ./tests-15\checkpoint-267\sst-2-15\pytorch_adapter.bin
Configuration saved in ./tests-15\checkpoint-267\sst-2-15\head_config.json
Module weights saved in ./tests-15\checkpoint-267\sst-2-15\pytorch_model_head.bin
Configuration saved in ./tests-15\checkpoint-267\sst-2-15\head_config.json
Module weights saved in ./tests-15\checkpoint-267\sst-2-15\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, sentence, tokens.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model c

In [6]:
""" Evaluate the above pre-trained adapter on SST-2 test data """

#tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained(BERT_LOCAL_PATH, local_files_only=True)
#model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained(BERT_LOCAL_PATH, local_files_only=True, num_labels=2)

model2.add_classification_head(
    'sst-2-15',
    num_labels=1
)
config = AdapterConfig.load("./tests-15/sst-2-15/adapter_config.json")
#adapter_name = model2.load_adapter("./tests-10/sst-2-10", config=config)
adapter_name = model2.load_adapter("./tests-15/sst-2-15", config=config, model_name=BERT_LOCAL_PATH)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("var-sized adapter accuracy on SST-2 test data: ", accuracy)

loading configuration file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\config.json
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

Didn't find file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\added_tokens.json. We won't load it.
Didn't find file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\special_tokens_map.json. We won't load it.
loa

var-sized adapter accuracy on SST-2 test data:  0.8316742081447964


# Linear Normalization Before and After Experiment

In [7]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

# Variable adapter sizes
# bigger adapter size for higher layers.
# reduction_factor (int or Mapping) – Either an integer specifying the reduction factor for all layers 
# or a mapping specifying the reduction_factor for individual layers. 
# If not all layers are represented in the mapping a default value should be given e.g. {‘1’: 8, ‘6’: 32, ‘default’: 16}
config2 = AdapterConfig.load("houlsby", reduction_factor=16, ln_after=True, ln_before=True)
#model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)
model2 = AutoModelWithHeads.from_pretrained(BERT_LOCAL_PATH, local_files_only=True, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2-16",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2-16",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2-16"])
model2.set_active_adapters("sst-2-16")
print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests-16", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs=40, #20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

loading configuration file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\config.json
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\pytorch_model.bin
Some weights of the model checkpoint at D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased were not used when initializing BertMo

BertModelWithHeads(
  (bert): BertModel(
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

  0%|          | 0/9 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1109, 2977, 1110, 17348, 1106, 1129, 1103, 6880, 5944, 112, 188, 1207, 169, 169, 17727, 112, 112, 1105, 1115, 1119, 112, 188, 1280, 1106, 1294, 170, 24194, 1256, 3407, 1190, 7296, 20452, 24156, 11819, 7582, 9146, 117, 2893, 118, 140, 15554, 1181, 3605, 8732, 3263, 1137, 6536, 17979, 1233, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, sentence, tokens.
***** Running training *****
  Num examples = 8544
  Num Epochs = 40
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 10680


Epoch,Training Loss,Validation Loss,Accuracy
1,0.7012,0.687025,0.555858
2,0.6933,0.662794,0.614896
3,0.6833,0.616834,0.682107
4,0.6623,0.606182,0.681199
5,0.6445,0.596259,0.702089
6,0.6327,0.576442,0.70663
7,0.6227,0.577296,0.707539
8,0.6106,0.571125,0.716621
9,0.6018,0.576025,0.713896
10,0.5999,0.579731,0.712988


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, sentence, tokens.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests-16\checkpoint-267
Configuration saved in ./tests-16\checkpoint-267\sst-2-16\adapter_config.json
Module weights saved in ./tests-16\checkpoint-267\sst-2-16\pytorch_adapter.bin
Configuration saved in ./tests-16\checkpoint-267\sst-2-16\head_config.json
Module weights saved in ./tests-16\checkpoint-267\sst-2-16\pytorch_model_head.bin
Configuration saved in ./tests-16\checkpoint-267\sst-2-16\head_config.json
Module weights saved in ./tests-16\checkpoint-267\sst-2-16\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, sentence, tokens.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model c

In [8]:
""" Evaluate the above pre-trained adapter on SST-2 test data """

#tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained(BERT_LOCAL_PATH, local_files_only=True)
#model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained(BERT_LOCAL_PATH, local_files_only=True, num_labels=2)

model2.add_classification_head(
    'sst-2-16',
    num_labels=1
)
config = AdapterConfig.load("./tests-16/sst-2-16/adapter_config.json")
#adapter_name = model2.load_adapter("./tests-10/sst-2-10", config=config)
adapter_name = model2.load_adapter("./tests-16/sst-2-16", config=config, model_name=BERT_LOCAL_PATH)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("var-sized adapter accuracy on SST-2 test data: ", accuracy)

loading configuration file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\config.json
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

Didn't find file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\added_tokens.json. We won't load it.
Didn't find file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\special_tokens_map.json. We won't load it.
loa

var-sized adapter accuracy on SST-2 test data:  0.7036199095022625


In [9]:
# Ref: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/text-classification/run_glue_alt.py

# Variable adapter sizes
# bigger adapter size for higher layers.
# reduction_factor (int or Mapping) – Either an integer specifying the reduction factor for all layers 
# or a mapping specifying the reduction_factor for individual layers. 
# If not all layers are represented in the mapping a default value should be given e.g. {‘1’: 8, ‘6’: 32, ‘default’: 16}
config2 = AdapterConfig.load("houlsby", reduction_factor=16, ln_after=True, ln_before=False)
#model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased", config=config2, num_labels=2)
model2 = AutoModelWithHeads.from_pretrained(BERT_LOCAL_PATH, local_files_only=True, num_labels=2)

"""
# Add classification head
num_labels = 0
label_list = []
is_regression = dataset["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)
"""

# Add classification head
num_labels = 2
label_list = ["Negative", "Positive"]

# Ref: https://docs.adapterhub.ml/prediction_heads.html
model2.add_classification_head(
    "sst-2-17",
    num_labels=num_labels,
    id2label={i: v for i, v in enumerate(label_list)} if num_labels > 0 else None,
)

# add a new adapter
model2.add_adapter(
    "sst-2-17",
    config=config2
)

# Enable adapter training
# The most crucial step when training an adapter module is to freeze all weights in the model except for those of the adapter. 
# calling the train_adapterNN() method which disables training of all weights outside the task adapter. 
model2.train_adapter(["sst-2-17"])
model2.set_active_adapters("sst-2-17")
print(model2)


def tokenize_function(batch):
    tokenized_batch = tokenizer(batch['sentence'], padding=True, truncation=True)
    tokenized_batch["label"] = [int(round(num)) for num in batch["label"]]
    return tokenized_batch

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets['train'][0])

# It is needed to format the label to long (label must be long for CELoss). If not it is always float even with typecast at token_function. 
# Check this out: https://discuss.huggingface.co/t/dataset-set-format/1961
format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
tokenized_datasets.set_format(**format, columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
#print(type(tokenized_datasets['train'][0]['label']))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


"""
print(tokenized_datasets['train'][0])
input_tensor = torch.tensor([
    tokenizer.convert_tokens_to_ids(tokenized_datasets['train'][0]['tokens'])
])
logits = model2(input_tensor)
print(logits)  # two heads for binary classification
print(logits.view(-1, 2))
print()
"""

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./tests-17", 
    do_train=True,
    do_eval=True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    learning_rate=1e-5, 
    num_train_epochs=40, #20,
    weight_decay = 0.01,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    
    # for early stopping
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True
    #label_names = ["Negative", "Positive"]
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()
#trainer.evaluate()
trainer.save_model()

loading configuration file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\config.json
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\pytorch_model.bin
Some weights of the model checkpoint at D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased were not used when initializing BertMo

BertModelWithHeads(
  (bert): BertModel(
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

  0%|          | 0/9 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1109, 2977, 1110, 17348, 1106, 1129, 1103, 6880, 5944, 112, 188, 1207, 169, 169, 17727, 112, 112, 1105, 1115, 1119, 112, 188, 1280, 1106, 1294, 170, 24194, 1256, 3407, 1190, 7296, 20452, 24156, 11819, 7582, 9146, 117, 2893, 118, 140, 15554, 1181, 3605, 8732, 3263, 1137, 6536, 17979, 1233, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 1.0, 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .", 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, sentence, tokens.
***** Running training *****
  Num examples = 8544
  Num Epochs = 40
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 10680


Epoch,Training Loss,Validation Loss,Accuracy
1,0.7015,0.697624,0.495005
2,0.6973,0.660435,0.609446
3,0.6769,0.595673,0.700272
4,0.6398,0.643782,0.675749
5,0.6251,0.576386,0.711172
6,0.613,0.567321,0.696639
7,0.6016,0.564905,0.715713
8,0.5906,0.560147,0.722979
9,0.5778,0.544346,0.729337
10,0.5678,0.555389,0.730245


The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, sentence, tokens.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model checkpoint to ./tests-17\checkpoint-267
Configuration saved in ./tests-17\checkpoint-267\sst-2-17\adapter_config.json
Module weights saved in ./tests-17\checkpoint-267\sst-2-17\pytorch_adapter.bin
Configuration saved in ./tests-17\checkpoint-267\sst-2-17\head_config.json
Module weights saved in ./tests-17\checkpoint-267\sst-2-17\pytorch_model_head.bin
Configuration saved in ./tests-17\checkpoint-267\sst-2-17\head_config.json
Module weights saved in ./tests-17\checkpoint-267\sst-2-17\pytorch_model_head.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertModelWithHeads.forward` and have been ignored: tree, sentence, tokens.
***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32
Saving model c

In [11]:
""" Evaluate the above pre-trained adapter on SST-2 test data """

#tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained(BERT_LOCAL_PATH, local_files_only=True)
#model2 = AutoModelWithHeads.from_pretrained("bert-base-uncased")
model2 = AutoModelWithHeads.from_pretrained(BERT_LOCAL_PATH, local_files_only=True, num_labels=2)

model2.add_classification_head(
    'sst-2-17',
    num_labels=1
)
config = AdapterConfig.load("./tests-17/sst-2-17/adapter_config.json")
#adapter_name = model2.load_adapter("./tests-10/sst-2-10", config=config)
adapter_name = model2.load_adapter("./tests-17/sst-2-17", config=config, model_name=BERT_LOCAL_PATH)
model2.set_active_adapters(adapter_name)

label_list = ["Negative", "Positive"]

sentiment_analysis2 = pipeline(task="sentiment-analysis", model=model2, tokenizer=tokenizer)

# 0=negative; 1=positive
output_labels = {0: "Negative", 1: "Positive"}
correct_count = 0
accuracy = 0
test_data_size = dataset["test"].num_rows
for i in range(test_data_size):
    
    # 0=negative; 1=positive
    truth = output_labels[round(dataset["test"][i]['label'], 0)]
    result = sentiment_analysis2(dataset["test"][i]['sentence'])[0]
    
    if result['label'] == truth:
        correct_count += 1
    
    print('Progress: %s / %s' % (i+1, test_data_size), end='\r')
    
accuracy = correct_count / test_data_size
print("var-sized adapter accuracy on SST-2 test data: ", accuracy)

loading configuration file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\config.json
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

Didn't find file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\added_tokens.json. We won't load it.
Didn't find file D:/cs7643-dl/Project/cs7643-proj-ablation-study/bert-base-cased\special_tokens_map.json. We won't load it.
loa

var-sized adapter accuracy on SST-2 test data:  0.7226244343891403
