<a href="https://colab.research.google.com/github/tiginamaria/ITMO_NLP/blob/main/hw4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Soft deadline: `30.03.2022 23:59`

In this homework you will understand the fine-tuning procedure and get acquainted with Huggingface Datasets library

In [2]:
! pip install datasets
! pip install transformers

Collecting datasets
  Downloading datasets-2.0.0-py3-none-any.whl (325 kB)
[?25l[K     |█                               | 10 kB 18.0 MB/s eta 0:00:01[K     |██                              | 20 kB 13.7 MB/s eta 0:00:01[K     |███                             | 30 kB 7.5 MB/s eta 0:00:01[K     |████                            | 40 kB 6.7 MB/s eta 0:00:01[K     |█████                           | 51 kB 4.5 MB/s eta 0:00:01[K     |██████                          | 61 kB 5.3 MB/s eta 0:00:01[K     |███████                         | 71 kB 5.6 MB/s eta 0:00:01[K     |████████                        | 81 kB 4.6 MB/s eta 0:00:01[K     |█████████                       | 92 kB 5.0 MB/s eta 0:00:01[K     |██████████                      | 102 kB 4.3 MB/s eta 0:00:01[K     |███████████                     | 112 kB 4.3 MB/s eta 0:00:01[K     |████████████                    | 122 kB 4.3 MB/s eta 0:00:01[K     |█████████████                   | 133 kB 4.3 MB/s eta 0:00:01[

For our goals we will use [Datasets](https://huggingface.co/docs/datasets/) library and take `yahoo_answers_topics` dataset - the task of this dataset is to divide documents on 10 topic categories. More detiled information can be found on the dataset [page](https://huggingface.co/datasets/viewer/).


In [23]:
from datasets import load_dataset

In [24]:
dataset = load_dataset('yahoo_answers_topics') # the result is a dataset dictionary of train and test splits in this case

Reusing dataset yahoo_answers_topics (/root/.cache/huggingface/datasets/yahoo_answers_topics/yahoo_answers_topics/1.0.0/b2712a72fde278f1d6e96cc4f485fd89ed2f79ecb231441e13645b53da021902)


  0%|          | 0/2 [00:00<?, ?it/s]

In [25]:
dataset['train'] = dataset['train'].shuffle(seed=42).select(range(5000))
dataset['test'] = dataset['test'].shuffle(seed=42).select(range(1000))

In [26]:
from datasets import set_caching_enabled
set_caching_enabled(False)

# Fine-tuning the model** (20 points)

In [27]:
from transformers import (ElectraTokenizer, ElectraForSequenceClassification,
                          get_scheduler, pipeline, ElectraForMaskedLM, ElectraModel)

import torch
from torch.utils.data import DataLoader
from datasets import load_metric

Fine-tuning procedure on the end task consists of adding additional layers on the top of the pre-trained model. The resulting model can be tuned fully (passing gradients through the all model) or partially.

**Task**: 
- load tokenizer and model
- look at the predictions of the model as-is before any fine-tuning


```
- Why don't you ask [MASK]?
- What is [MASK]
- Let's talk about [MASK] physics
```

- convert `best_answer` to the input tokens (supporting function for dataset is provided below) 

```
def tokenize_function(examples):
    return tokenizer(examples["best_answer"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
```

- define optimizer, sheduler (optional)
- fine-tune the model (write the training loop), plot the loss changes and measure results in terms of weighted F1 score
- get the masked word prediction (sample sentences above) on the fine-tuned model, why the results as they are and what should be done in order to change that (write down your answer)
- Tune the training hyperparameters (and write down your results).

**Tips**:
- The easiest way to get predictions is to use transformers `pipeline` function 
- Do not forget to set `num_labels` parameter, when initializing the model
- To convert data to batches use `DataLoader`
- Even the `small` version of Electra can be long to train, so you can take data sample (>= 5000 and set seed for reproducibility)
- You may want to try freezing (do not update the pretrained model weights) all the layers exept the ones for classification, in that case use:


```
for param in model.electra.parameters():
      param.requires_grad = False
```


In [28]:
GENERATOR_MODEL_NAME = "google/electra-small-generator"
GENERATOR_TOKENIZER_NAME = "google/electra-small-generator"
DISCRIMINATOR_MODEL_NAME = "google/electra-small-discriminator"
DISCRIMINATOR_TOKENIZER_NAME = "google/electra-small-discriminator"

In [29]:
g_model = ElectraForMaskedLM.from_pretrained(GENERATOR_MODEL_NAME)
g_tokenizer = ElectraTokenizer.from_pretrained(GENERATOR_TOKENIZER_NAME)

loading configuration file https://huggingface.co/google/electra-small-generator/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ddf7554779ef5bd660812cf3b6c92a66e14e307bae0f8582015b43ce8f8de85c.e50e2a54975f5ef36835643600664f71c63e7f570a08222c48829a8d8e327dca
Model config ElectraConfig {
  "architectures": [
    "ElectraForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 128,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "electra",
  "num_attention_heads": 4,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "summary_activation": "gelu",
  "summary_last_dropout": 0.1,
  "summary_type": "first",
  "summary_use_proj": true,
  "transformers_version": "4.17.0",
  "type_vocab_size": 2,
  "use_cache": true,


In [66]:
d_model = ElectraForSequenceClassification.from_pretrained(DISCRIMINATOR_MODEL_NAME, num_labels=10)
d_tokenizer = ElectraTokenizer.from_pretrained(DISCRIMINATOR_TOKENIZER_NAME)

loading configuration file https://huggingface.co/google/electra-small-discriminator/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ca13c16218c6780ec76753d3afa19fcb7cc759e3f63ee87e441562d374762b3d.3dd1921e571dfa18c0bdaa17b9b38f111097812281989b1cb22263738e66ef73
Model config ElectraConfig {
  "architectures": [
    "ElectraForPreTraining"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 128,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9"
  },
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6,
    "LABEL_7": 7,
    "LABEL_8": 8,

In [31]:
fill_mask = pipeline(
    "fill-mask",
    model=g_model,
    tokenizer=g_tokenizer
)

In [32]:
test_samples = [
                "Why don't you ask [MASK]?", 
                "What is [MASK]", 
                "Let's talk about [MASK] physics"
                ]

for test_sample in test_samples:
  print(fill_mask(test_sample)[0]['sequence'])

why don't you ask me?
what is?
let's talk about quantum physics


In [64]:
from sklearn.metrics import f1_score

def model_f1_score(data, model):
  batch_size = 32
  batches = [data[k * batch_size, (k + 1) * batch_size] for k in range(data.shape[0] // batch_size)]

  f1 = []
  for batch in batches:
      with torch.no_grad():
          input_ids = torch.tensor(batch['input_ids']).to('cuda')
          attention_mask = torch.tensor(batch['attention_mask']).to('cuda')
          token_type_ids = torch.tensor(batch['token_type_ids']).to('cuda')
          labels = torch.tensor(batch['labels']).to('cuda')

          outputs = model(input_ids, attention_mask, token_type_ids)
          logits = outputs['logits']
          preds = torch.argmax(logits, 1)
          f1s = f1_score(labels.cpu().data, preds.cpu(), average='weighted')
          f1.append(f1s)
          
  return sum(f1) / len(f1)

In [33]:
def tokenize_function(examples):
    return d_tokenizer(examples["best_answer"], padding="max_length", truncation=True)

d_tokenized_datasets = dataset.map(tokenize_function, batched=True)

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [34]:
d_tokenized_datasets = d_tokenized_datasets.remove_columns(['question_title', 
                                                          'question_content',
                                                          'best_answer'
                                                          ])
d_tokenized_datasets = d_tokenized_datasets.rename_column('topic', 'labels')

In [36]:
d_tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['id', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
})

In [68]:
model_f1_score(d_tokenized_datasets['test'], d_model.to('cuda'))

0.13440860215053763

In [37]:
import numpy as np
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    logits, labels = logits.cpu().data, labels.cpu().data

    predictions = np.argmax(logits, axis=-1)

    accuracy = accuracy_score(y_true=labels, y_pred=logits)
    recall = recall_score(y_true=labels, y_pred=logits)
    precision = precision_score(y_true=labels, y_pred=logits)
    f1 = f1_score(y_true=labels, y_pred=logits)

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

In [39]:
from transformers import DataCollatorWithPadding, TrainingArguments, Trainer

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        outputs = model(**inputs)
        labels = inputs.get('labels')
        logits = outputs.get('logits')

        logits = torch.argmax(logits, 1)

        loss_fct = torch.nn.CrossEntropyLoss(reduction='none')
        loss = loss_fct(logits, labels)

        return (loss, outputs) if return_outputs else loss

data_collator = DataCollatorWithPadding(tokenizer=d_tokenizer)

training_args = TrainingArguments(
    "test-trainer",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
)

trainer = Trainer(
    model=d_model,
    args=training_args,
    train_dataset=d_tokenized_datasets['train'],
    eval_dataset=d_tokenized_datasets['test'],
    tokenizer=d_tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

result = trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `ElectraForSequenceClassification.forward` and have been ignored: id. If id are not expected by `ElectraForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5000
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3125


Epoch,Training Loss,Validation Loss
1,2.0991,1.711269
2,1.6483,1.486377
3,1.4346,1.451521
4,1.1395,1.417876
5,1.0284,1.428418


Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `ElectraForSequenceClassification.forward` and have been ignored: id. If id are not expected by `ElectraForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
Saving model checkpoint to test-trainer/checkpoint-1000
Configuration saved in test-trainer/checkpoint-1000/config.json
Model weights saved in test-trainer/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1000/tokenizer_config.json
Special tokens file saved in test-trainer/checkp

In [40]:
result

TrainOutput(global_step=3125, training_loss=1.419865966796875, metrics={'train_runtime': 1225.1066, 'train_samples_per_second': 20.406, 'train_steps_per_second': 2.551, 'total_flos': 735648921600000.0, 'train_loss': 1.419865966796875, 'epoch': 5.0})

In [41]:
d_model.eval()

ElectraForSequenceClassification(
  (electra): ElectraModel(
    (embeddings): ElectraEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (embeddings_project): Linear(in_features=128, out_features=256, bias=True)
    (encoder): ElectraEncoder(
      (layer): ModuleList(
        (0): ElectraLayer(
          (attention): ElectraAttention(
            (self): ElectraSelfAttention(
              (query): Linear(in_features=256, out_features=256, bias=True)
              (key): Linear(in_features=256, out_features=256, bias=True)
              (value): Linear(in_features=256, out_features=256, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): ElectraSelfOutput(
              (dense): Linear(in_

In [65]:
model_f1_score(d_tokenized_datasets['test'], d_model)

0.618279569892473

F1-score before fine tuning was 0.13 while after 0.61 which is much more better.