<a href="https://colab.research.google.com/github/tiginamaria/ITMO_NLP/blob/main/hw4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Soft deadline: `30.03.2022 23:59`

In this homework you will understand the fine-tuning procedure and get acquainted with Huggingface Datasets library

In [None]:
! pip install datasets
! pip install transformers

For our goals we will use [Datasets](https://huggingface.co/docs/datasets/) library and take `yahoo_answers_topics` dataset - the task of this dataset is to divide documents on 10 topic categories. More detiled information can be found on the dataset [page](https://huggingface.co/datasets/viewer/).


In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset('yahoo_answers_topics') # the result is a dataset dictionary of train and test splits in this case

Reusing dataset yahoo_answers_topics (/root/.cache/huggingface/datasets/yahoo_answers_topics/yahoo_answers_topics/1.0.0/b2712a72fde278f1d6e96cc4f485fd89ed2f79ecb231441e13645b53da021902)


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
dataset['train'] = dataset['train'].shuffle(seed=42).select(range(5000))
dataset['test'] = dataset['test'].shuffle(seed=42).select(range(1000))

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/yahoo_answers_topics/yahoo_answers_topics/1.0.0/b2712a72fde278f1d6e96cc4f485fd89ed2f79ecb231441e13645b53da021902/cache-96f801c1cead3588.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/yahoo_answers_topics/yahoo_answers_topics/1.0.0/b2712a72fde278f1d6e96cc4f485fd89ed2f79ecb231441e13645b53da021902/cache-7a6d924f5654ac9f.arrow


In [None]:
from datasets import set_caching_enabled
set_caching_enabled(False)

  


# Fine-tuning the model** (20 points)

In [None]:
from transformers import (ElectraTokenizer, ElectraForSequenceClassification,
                          get_scheduler, pipeline, ElectraForMaskedLM, ElectraModel)

import torch
from torch.utils.data import DataLoader
from datasets import load_metric

Fine-tuning procedure on the end task consists of adding additional layers on the top of the pre-trained model. The resulting model can be tuned fully (passing gradients through the all model) or partially.

**Task**: 
- load tokenizer and model
- look at the predictions of the model as-is before any fine-tuning


```
- Why don't you ask [MASK]?
- What is [MASK]
- Let's talk about [MASK] physics
```

- convert `best_answer` to the input tokens (supporting function for dataset is provided below) 

```
def tokenize_function(examples):
    return tokenizer(examples["best_answer"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
```

- define optimizer, sheduler (optional)
- fine-tune the model (write the training loop), plot the loss changes and measure results in terms of weighted F1 score
- get the masked word prediction (sample sentences above) on the fine-tuned model, why the results as they are and what should be done in order to change that (write down your answer)
- Tune the training hyperparameters (and write down your results).

**Tips**:
- The easiest way to get predictions is to use transformers `pipeline` function 
- Do not forget to set `num_labels` parameter, when initializing the model
- To convert data to batches use `DataLoader`
- Even the `small` version of Electra can be long to train, so you can take data sample (>= 5000 and set seed for reproducibility)
- You may want to try freezing (do not update the pretrained model weights) all the layers exept the ones for classification, in that case use:


```
for param in model.electra.parameters():
      param.requires_grad = False
```


In [None]:
GENERATOR_MODEL_NAME = "google/electra-small-generator"
GENERATOR_TOKENIZER_NAME = "google/electra-small-generator"
DISCRIMINATOR_MODEL_NAME = "google/electra-small-discriminator"
DISCRIMINATOR_TOKENIZER_NAME = "google/electra-small-discriminator"

In [None]:
g_model = ElectraForMaskedLM.from_pretrained(GENERATOR_MODEL_NAME)
g_tokenizer = ElectraTokenizer.from_pretrained(GENERATOR_TOKENIZER_NAME)

In [None]:
d_model = ElectraForSequenceClassification.from_pretrained(DISCRIMINATOR_MODEL_NAME, num_labels=10)
d_tokenizer = ElectraTokenizer.from_pretrained(DISCRIMINATOR_TOKENIZER_NAME)

loading configuration file https://huggingface.co/google/electra-small-discriminator/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ca13c16218c6780ec76753d3afa19fcb7cc759e3f63ee87e441562d374762b3d.3dd1921e571dfa18c0bdaa17b9b38f111097812281989b1cb22263738e66ef73
Model config ElectraConfig {
  "architectures": [
    "ElectraForPreTraining"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 128,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9"
  },
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6,
    "LABEL_7": 7,
    "LABEL_8": 8,

In [None]:
fill_mask = pipeline(
    "fill-mask",
    model=g_model,
    tokenizer=g_tokenizer
)

In [None]:
test_samples = [
                "Why don't you ask [MASK]?", 
                "What is [MASK]", 
                "Let's talk about [MASK] physics"
                ]

for test_sample in test_samples:
  print(fill_mask(test_sample)[0]['sequence'])

why don't you ask me?
what is?
let's talk about quantum physics


In [None]:
def tokenize_function(examples):
    return d_tokenizer(examples["best_answer"], padding="max_length", truncation=True)

d_tokenized_datasets = dataset.map(tokenize_function, batched=True)

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
d_tokenized_datasets = d_tokenized_datasets.remove_columns(['question_title', 
                                                          'question_content',
                                                          'best_answer'
                                                          ])
d_tokenized_datasets = d_tokenized_datasets.rename_column('topic', 'labels')

In [None]:
d_tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['id', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
})

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    accuracy = accuracy_score(y_true=labels, y_pred=logits)
    recall = recall_score(y_true=labels, y_pred=logits)
    precision = precision_score(y_true=labels, y_pred=logits)
    f1 = f1_score(y_true=labels, y_pred=logits)

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

In [None]:
from datasets import load_metric
f1_loss = load_metric('f1')

In [None]:
from transformers import DataCollatorWithPadding, TrainingArguments, Trainer

# class CustomTrainer(Trainer):
#     def compute_loss(self, model, inputs, return_outputs=False):
#         outputs = model(**inputs)
#         labels = inputs.get('labels')
#         logits = outputs.get('logits')

#         logits = torch.argmax(logits, 1)
#         loss = torch.tensor(f1_score(y_true=labels.cpu(), y_pred=logits.cpu(), average='weighted'),
#                             requires_grad=True)

#         # loss_fct = torch.nn.CrossEntropyLoss(reduction='none')
#         # loss = loss_fct(logits, labels)

#         return (loss, outputs) if return_outputs else loss

data_collator = DataCollatorWithPadding(tokenizer=d_tokenizer)

training_args = TrainingArguments(
    "test-trainer",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
)

trainer = Trainer(
    model=d_model,
    args=training_args,
    train_dataset=d_tokenized_datasets['train'],
    eval_dataset=d_tokenized_datasets['test'],
    tokenizer=d_tokenizer,
    data_collator=data_collator,
)

result = trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `ElectraForSequenceClassification.forward` and have been ignored: id. If id are not expected by `ElectraForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5000
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3125


Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss


In [None]:
d_tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['id', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
})

In [None]:
train_dataset