## Transformers with PyTorch
* Transformers was developed using PyTorch.
* But it can be used with Tensorflow.
* Transformers uses "Trainer" class for model training (for PyTorch)

## Importing Dataset
* Loading dataset containing movie comments.

In [1]:
!pip install -q datasets

In [2]:
from datasets import load_dataset

In [3]:
dataset = load_dataset("rotten_tomatoes")

Downloading builder script:   0%|          | 0.00/1.89k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/921 [00:00<?, ?B/s]

Downloading and preparing dataset rotten_tomatoes_movie_review/default (download: 476.34 KiB, generated: 1.28 MiB, post-processed: Unknown size, total: 1.75 MiB) to /root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46...


Downloading data:   0%|          | 0.00/488k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Dataset rotten_tomatoes_movie_review downloaded and prepared to /root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

## Loading Model
* Loading pre-trained model for classification.

In [4]:
from transformers import AutoModelForSequenceClassification

In [5]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Data Preprocessing
* We will use tokenizer for data preprocessing.

In [6]:
from transformers import AutoTokenizer

In [7]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

* This dataset consists of 2 columns. First column is text and other one is label.
* We will apply the tokenizer for text column.

In [8]:
# creating a function to tokenize the dataset
def tokenize_dataset(dataset):
    return tokenizer(dataset["text"])

In [9]:
#applying tokenizer for text data to all dataset
dataset = dataset.map(tokenize_dataset, batched = True)

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

## Padding

In [10]:
from transformers import DataCollatorWithPadding

In [11]:
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

## Setting Training Arguments

In [12]:
from transformers import TrainingArguments

In [13]:
# creating training arguments that contain the model hyperparameters
training_args = TrainingArguments(
    output_dir="my_bert_model", #path to save the model
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    report_to="none",
)

## Model Training
* Transformers uses "Trainer" class for model training (for PyTorch)

In [14]:
from transformers import Trainer

In [15]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,   
)

In [16]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.4492
1000,0.383
1500,0.2678
2000,0.2761


TrainOutput(global_step=2134, training_loss=0.3371227133910904, metrics={'train_runtime': 2630.01, 'train_samples_per_second': 6.487, 'train_steps_per_second': 0.811, 'total_flos': 195974132394480.0, 'train_loss': 0.3371227133910904, 'epoch': 2.0})

* Remember that we set the epoch value as 2 in cell 13.

## Prediction

In [17]:
text = "I love NLP. It's fun to analyze the NLP tasks with HuggingFace"

* We can not give this text to the model directly. First, we have to tokenize the model.

In [18]:
inputs = tokenizer(text, return_tensors = "pt")
inputs

{'input_ids': tensor([[  101,  1045,  2293, 17953,  2361,  1012,  2009,  1005,  1055,  4569,
          2000, 17908,  1996, 17953,  2361,  8518,  2007, 17662, 12172,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [19]:
model_path = "/kaggle/working/my_bert_model/checkpoint-1000"

In [20]:
model = AutoModelForSequenceClassification.from_pretrained( model_path, num_labels=2)

In [21]:
#prediction of labels with pytorch
import torch

In [22]:
with torch.no_grad():
    logits = model(**inputs).logits

In [23]:
logits.argmax().item()

1

* 1 stands for positive label