# INF5007 Neural Networks / Neuroniniai tinklai
**LAB9**

## HOMEWORK TASK

You can use transformers in your individual work.

## Introduction to the Hugging Face API

Full course on how to use the API: https://huggingface.co/course/chapter0/1?fw=pt

In [3]:
from transformers import pipeline

## The hugging face API 
The Hugging Face API was created as a space to train, load and share large deep learning NLP models. Several of the possible tasks that you can solve using the models provided in the library:

### Sentiment analysis
Model: https://huggingface.co/assemblyai/distilbert-base-uncased-sst2?text=I+like+you.+I+love+you

In [4]:
sentiment_pipeline = pipeline("sentiment-analysis")

data = ["I love you", "I hate you"]
sentiment_pipeline(data)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.9998656511306763},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079}]

### Text generation
Model: https://huggingface.co/microsoft/DialoGPT-small?text=Hey+my+name+is+Mariama%21+How+are+you%3F

In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

for step in range(5):
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids
    chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
    print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))


DialoGPT: I'm back!
DialoGPT: Hi back
DialoGPT: Hi back
DialoGPT: Hi water
DialoGPT: I like turtles


### Masked learning
Model: https://huggingface.co/bert-base-multilingual-cased?text=I+like+%5BMASK%5D

In [8]:
unmasker = pipeline('fill-mask', model='bert-base-multilingual-cased')
unmasker("I live in [MASK]")

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.03385859727859497,
  'token': 10829,
  'token_str': 'London',
  'sequence': 'I live in London'},
 {'score': 0.03125724568963051,
  'token': 11619,
  'token_str': 'Italy',
  'sequence': 'I live in Italy'},
 {'score': 0.023685377091169357,
  'token': 12775,
  'token_str': 'Germany',
  'sequence': 'I live in Germany'},
 {'score': 0.020019808784127235,
  'token': 12962,
  'token_str': 'live',
  'sequence': 'I live in live'},
 {'score': 0.016861874610185623,
  'token': 18744,
  'token_str': 'Moscow',
  'sequence': 'I live in Moscow'}]

### Text embeddings

In [9]:
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertModel.from_pretrained("bert-base-multilingual-cased")

text = "This is the text I'd like to train my deep learning model with"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
print(output)

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.3543,  0.0795,  0.4237,  ...,  0.7303,  0.2264,  0.1486],
         [-0.3508, -0.4988,  0.2034,  ...,  0.0907, -0.0760, -0.4666],
         [-0.2361, -0.4036, -0.0100,  ...,  0.4503, -0.2976,  0.1432],
         ...,
         [-0.1339, -0.1011, -0.1279,  ...,  0.5724, -0.2574, -0.2440],
         [ 0.1079,  0.1582, -0.0425,  ...,  0.8293,  0.1642, -0.0081],
         [ 0.3190,  0.0924,  0.5734,  ...,  0.2694,  0.3159,  0.0590]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[ 4.4070e-01, -2.4586e-01,  2.5098e-01, -3.7415e-01, -1.4130e-01,
          4.9311e-01,  3.5765e-01,  2.8654e-01, -5.3179e-01,  2.5917e-01,
         -2.0071e-01, -2.9737e-01, -2.2273e-01, -5.7293e-02,  1.8640e-01,
         -2.4702e-01,  7.8171e-01,  2.8670e-01,  3.2563e-01, -2.7157e-01,
         -9.9564e-01, -2.7828e-01, -5.3223e-01, -3.3896e-01, -4.7615e-01,
          6.3296e-02, -1.7648e-01,  2.6515e-01,  4.7409e-01, -3.262

## Finetuning your own model

We will use IMDB movie reviews dataset to finetune a DistilBERT model for sentiment analysis. The dataset can be found here: https://huggingface.co/datasets/imdb

In [11]:
# Requires a lot of computing resources
import torch
torch.cuda.is_available()

True

#### Install the datasets library

In [13]:
#!pip install datasets transformers huggingface_hub

#### Loading the dataset

In [14]:
from datasets import load_dataset
imdb = load_dataset("imdb")

Downloading: 4.31kB [00:00, 2.66MB/s]                   
Downloading: 2.17kB [00:00, 1.49MB/s]                   


Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /home/milita/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading: 100%|██████████| 84.1M/84.1M [02:07<00:00, 659kB/s] 
                                           

Dataset imdb downloaded and prepared to /home/milita/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 1524.09it/s]


#### Train / Test split

In [15]:
small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(3000))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(300))])

#### Tokenize the text data

In [16]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess_function(examples):
   return tokenizer(examples["text"], truncation=True)
 
tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)

from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Downloading: 100%|██████████| 28.0/28.0 [00:00<00:00, 20.9kB/s]
Downloading: 100%|██████████| 483/483 [00:00<00:00, 547kB/s]
Downloading: 100%|██████████| 226k/226k [00:00<00:00, 369kB/s] 
Downloading: 100%|██████████| 455k/455k [00:00<00:00, 503kB/s]  
100%|██████████| 3/3 [00:01<00:00,  2.31ba/s]
100%|██████████| 1/1 [00:00<00:00,  8.82ba/s]


#### Load the model for training

In [17]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Downloading: 100%|██████████| 256M/256M [06:58<00:00, 640kB/s]  
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly init

#### Defining the evaluation metrics

In [18]:
import numpy as np
from datasets import load_metric
 
def compute_metrics(eval_pred):
   load_accuracy = load_metric("accuracy")
   load_f1 = load_metric("f1")
  
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
   return {"accuracy": accuracy, "f1": f1}

#### Define the model training args

In [21]:
from transformers import TrainingArguments, Trainer
 
repo_name = "finetuning-sentiment-model-3000-samples"
 
training_args = TrainingArguments(
   output_dir=repo_name,
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=2,
   weight_decay=0.01,
   save_strategy="epoch",
)
 
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


#### Train the model

In [22]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text.
***** Running training *****
  Num examples = 3000
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 376
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mmilasong[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.13.5 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


 50%|████▉     | 187/376 [00:28<00:30,  6.29it/s]Saving model checkpoint to finetuning-sentiment-model-3000-samples/checkpoint-188
Configuration saved in finetuning-sentiment-model-3000-samples/checkpoint-188/config.json
Model weights saved in finetuning-sentiment-model-3000-samples/checkpoint-188/pytorch_model.bin
tokenizer config file saved in finetuning-sentiment-model-3000-samples/checkpoint-188/tokenizer_config.json
Special tokens file saved in finetuning-sentiment-model-3000-samples/checkpoint-188/special_tokens_map.json
100%|█████████▉| 375/376 [00:58<00:00,  6.43it/s]Saving model checkpoint to finetuning-sentiment-model-3000-samples/checkpoint-376
Configuration saved in finetuning-sentiment-model-3000-samples/checkpoint-376/config.json
Model weights saved in finetuning-sentiment-model-3000-samples/checkpoint-376/pytorch_model.bin
tokenizer config file saved in finetuning-sentiment-model-3000-samples/checkpoint-376/tokenizer_config.json
Special tokens file saved in finetuning-se

{'train_runtime': 62.3944, 'train_samples_per_second': 96.163, 'train_steps_per_second': 6.026, 'train_loss': 0.29486384290329953, 'epoch': 2.0}





TrainOutput(global_step=376, training_loss=0.29486384290329953, metrics={'train_runtime': 62.3944, 'train_samples_per_second': 96.163, 'train_steps_per_second': 6.026, 'train_loss': 0.29486384290329953, 'epoch': 2.0})

#### Evaluate the model

In [36]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text.
***** Running Evaluation *****
  Num examples = 300
  Batch size = 16
100%|██████████| 19/19 [00:03<00:00,  5.70it/s]


{'eval_loss': 0.3351273238658905,
 'eval_accuracy': 0.87,
 'eval_f1': 0.8729641693811074,
 'eval_runtime': 3.4058,
 'eval_samples_per_second': 88.085,
 'eval_steps_per_second': 5.579,
 'epoch': 2.0}