# Transformers library from Huggingface

Notes done by: Sebastian Sarasti

It this notebook, I am going to address how to fine-tunning any model through the transformers API.

## Case analyzed

It is going to be loaded data to label how good was a comment in stack overflow.

The model employed is going to be a FNet.

### 1. Data loading an processing

In [None]:
%%capture
!pip install kaggle
!pip install accelerate -U
!pip install transformers[torch]

In [None]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"sebas02","key":"ce727a08c17ed7670291511b0a659889"}'}

In [None]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d imoore/60k-stack-overflow-questions-with-quality-rate
!unzip /content/60k-stack-overflow-questions-with-quality-rate.zip

Dataset URL: https://www.kaggle.com/datasets/imoore/60k-stack-overflow-questions-with-quality-rate
License(s): copyright-authors
Downloading 60k-stack-overflow-questions-with-quality-rate.zip to /content
 81% 17.0M/21.0M [00:00<00:00, 26.0MB/s]
100% 21.0M/21.0M [00:00<00:00, 31.2MB/s]
Archive:  /content/60k-stack-overflow-questions-with-quality-rate.zip
  inflating: train.csv               
  inflating: valid.csv               


In [None]:
import pandas as pd
import numpy as np

In [None]:
train = pd.read_csv("train.csv")
valid = pd.read_csv("valid.csv")

Review basic features of the dataset

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45000 entries, 0 to 44999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            45000 non-null  int64 
 1   Title         45000 non-null  object
 2   Body          45000 non-null  object
 3   Tags          45000 non-null  object
 4   CreationDate  45000 non-null  object
 5   Y             45000 non-null  object
dtypes: int64(1), object(5)
memory usage: 2.1+ MB


In [None]:
valid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            15000 non-null  int64 
 1   Title         15000 non-null  object
 2   Body          15000 non-null  object
 3   Tags          15000 non-null  object
 4   CreationDate  15000 non-null  object
 5   Y             15000 non-null  object
dtypes: int64(1), object(5)
memory usage: 703.2+ KB


Verify null data

In [None]:
train.isnull().sum()

Id              0
Title           0
Body            0
Tags            0
CreationDate    0
Y               0
dtype: int64

In [None]:
valid.isnull().sum()

Id              0
Title           0
Body            0
Tags            0
CreationDate    0
Y               0
dtype: int64

See how the dataframe looks like

In [None]:
train.head()

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y
0,34552656,Java: Repeat Task Every Random Seconds,<p>I'm already familiar with repeating tasks e...,<java><repeat>,2016-01-01 00:21:59,LQ_CLOSE
1,34553034,Why are Java Optionals immutable?,<p>I'd like to understand why Java 8 Optionals...,<java><optional>,2016-01-01 02:03:20,HQ
2,34553174,Text Overlay Image with Darkened Opacity React...,<p>I am attempting to overlay a title over an ...,<javascript><image><overlay><react-native><opa...,2016-01-01 02:48:24,HQ
3,34553318,Why ternary operator in swift is so picky?,"<p>The question is very simple, but I just cou...",<swift><operators><whitespace><ternary-operato...,2016-01-01 03:30:17,HQ
4,34553755,hide/show fab with scale animation,<p>I'm using custom floatingactionmenu. I need...,<android><material-design><floating-action-but...,2016-01-01 05:21:48,HQ


In [None]:
valid.head()

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y
0,34552974,How to get all the child records from differen...,I am having 4 different tables like \r\nselect...,<sql><sql-server>,2016-01-01 01:44:52,LQ_EDIT
1,34554721,Retrieve all except some data of the another t...,I have two table m_master and tbl_appointment\...,<php><mysql><sql><codeigniter><mysqli>,2016-01-01 08:43:50,LQ_EDIT
2,34555135,Pandas: read_html,<p>I'm trying to extract US states from wiki U...,<python><pandas>,2016-01-01 09:55:22,HQ
3,34555448,Reader Always gimme NULL,"I'm so new to C#, I wanna make an application ...",<sql-server><c#-4.0>,2016-01-01 10:43:45,LQ_EDIT
4,34555752,php rearrange array elements based on condition,basically i have this array:\r\n\r\n array(...,<php>,2016-01-01 11:34:09,LQ_EDIT


It is going to be removed the html symbols with regex

In [None]:
import re

In [None]:
def clean_text(text):
  text = re.sub(r"\n", "", text)
  text = re.sub(r"\r", "", text)
  text = re.sub(r"\**", "", text)
  text = re.sub(r"<p>", "", text)
  text = re.sub(r"</p>", "", text)
  text = re.sub(r"\'", "", text)
  text = re.sub(r"<br/>n", "", text)
  text = re.sub(r"<br/>", "", text)
  text = re.sub(r"  ", " ", text)
  text = text.strip()
  return text

See before apply the function

In [None]:
train.iloc[0, 2]

'<p>I\'m already familiar with repeating tasks every n seconds by using Java.util.Timer and Java.util.TimerTask. But lets say I want to print "Hello World" to the console every random seconds from 1-5. Unfortunately I\'m in a bit of a rush and don\'t have any code to show so far. Any help would be apriciated.  </p>\n'

In [None]:
train["Body"] = train["Body"].apply(lambda x: clean_text(x))
valid["Body"] = valid["Body"].apply(lambda x: clean_text(x))

In [None]:
train.iloc[0, 2]

'Im already familiar with repeating tasks every n seconds by using Java.util.Timer and Java.util.TimerTask. But lets say I want to print "Hello World" to the console every random seconds from 1-5. Unfortunately Im in a bit of a rush and dont have any code to show so far. Any help would be apriciated.'

In [None]:
train = train.drop(["Id", "Title", "Tags", "CreationDate"], axis=1)
valid = valid.drop(["Id", "Title", "Tags", "CreationDate"], axis=1)

In [None]:
train

Unnamed: 0,Body,Y
0,Im already familiar with repeating tasks every...,LQ_CLOSE
1,Id like to understand why Java 8 Optionals wer...,HQ
2,I am attempting to overlay a title over an ima...,HQ
3,"The question is very simple, but I just could ...",HQ
4,Im using custom floatingactionmenu. I need to ...,HQ
...,...,...
44995,I am new to this and I am asking for help to c...,LQ_CLOSE
44996,I am working on learning Python and was wonder...,LQ_CLOSE
44997,It looks like it costs 8 days per month in Azu...,LQ_CLOSE
44998,"""I _____ any questions.""I want to implement a ...",LQ_CLOSE


In [None]:
valid

Unnamed: 0,Body,Y
0,I am having 4 different tables like select fro...,LQ_EDIT
1,I have two table m_master and tbl_appointment[...,LQ_EDIT
2,"Im trying to extract US states from wiki URL, ...",HQ
3,"Im so new to C#, I wanna make an application t...",LQ_EDIT
4,basically i have this array: array(\t08:00-08...,LQ_EDIT
...,...,...
14995,"I have a menu, and Id like the div.right-contr...",LQ_CLOSE
14996,I try to multiply an integer by a double but I...,LQ_CLOSE
14997,URLS.PY //URLS.PY FILE from django.contrib i...,LQ_EDIT
14998,I have a controller inside which a server is c...,LQ_CLOSE


### 2. Datasets

In this section, it is created a dataset object which is the native version of how to save and process data in huggingface.

Also, it is presented the datasetdict object which adds several datasets, with different data. Typically, this helps to apply the same transformation over all data splits.

#### 2.1 Accessing data

In [None]:
%%capture
!pip install datasets

In [None]:
from datasets import Dataset, DatasetDict

In [None]:
D_train = Dataset.from_pandas(train)
D_valid = Dataset.from_pandas(valid)

In [None]:
dataset = DatasetDict({
    'train': D_train,
    'test': D_valid
})

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Body', 'Y'],
        num_rows: 45000
    })
    test: Dataset({
        features: ['Body', 'Y'],
        num_rows: 15000
    })
})

You can access to the data as dictionaries

In [None]:
dataset["train"]

Dataset({
    features: ['Body', 'Y'],
    num_rows: 45000
})

In [None]:
dataset["train"][0]

{'Body': 'Im already familiar with repeating tasks every n seconds by using Java.util.Timer and Java.util.TimerTask. But lets say I want to print "Hello World" to the console every random seconds from 1-5. Unfortunately Im in a bit of a rush and dont have any code to show so far. Any help would be apriciated.',
 'Y': 'LQ_CLOSE'}

#### 2.2 Creating new columns

It is going to be created a column based on the labels available, to simulate a one-hot encoder

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
encoder = LabelEncoder().fit(dataset["train"]["Y"])

You always operate with dictionaries, the input is always going to be the row, yu have to select the target value.

In case, you want to return a new value, you have to return it as a dictionary

In [None]:
def encoding_labels(value):
  return {"label": encoder.transform([value["Y"]])[0]}

In [None]:
dataset["train"]["Y"][0]

'LQ_CLOSE'

In [None]:
encoder.transform([dataset["train"]["Y"][0]])

array([1])

In [None]:
dataset = dataset.map(lambda x: encoding_labels(x))

Map:   0%|          | 0/45000 [00:00<?, ? examples/s]

Map:   0%|          | 0/15000 [00:00<?, ? examples/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Body', 'Y', 'label'],
        num_rows: 45000
    })
    test: Dataset({
        features: ['Body', 'Y', 'label'],
        num_rows: 15000
    })
})

### 3. Tokenization

In [None]:
from transformers import AutoTokenizer

In [None]:
checkpoint = "google/fnet-base"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/455 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/708k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.12M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

In [None]:
def tokenizer_data(batch):
  return tokenizer(batch["Body"], padding=True, truncation=True, return_tensors="pt")

In [None]:
data_tokenized = dataset.map(tokenizer_data, batched=True)

Map:   0%|          | 0/45000 [00:00<?, ? examples/s]

Map:   0%|          | 0/15000 [00:00<?, ? examples/s]

In [None]:
data_tokenized

DatasetDict({
    train: Dataset({
        features: ['Body', 'Y', 'label', 'input_ids', 'token_type_ids'],
        num_rows: 45000
    })
    test: Dataset({
        features: ['Body', 'Y', 'label', 'input_ids', 'token_type_ids'],
        num_rows: 15000
    })
})

The data also can be moved from a dataset to a pandas dataframe

In [None]:
df_train = data_tokenized["train"].to_pandas()
df_test = data_tokenized["test"].to_pandas()

**From this point, the final model is going to be fine-tuned with Pytorch**

1. It is needed to create tensors in pytorch

In [None]:
import torch

In [None]:
from torch.utils.data import DataLoader

In [None]:
input_ids_train = torch.from_numpy(np.array(df_train['input_ids'].tolist()))
token_type_ids_train = torch.from_numpy(np.array(df_train['token_type_ids'].tolist())).to(torch.long)
Y_train = torch.from_numpy(np.array(df_train['label'].tolist()))

In [None]:
# Y_train = torch.nn.functional.one_hot(Y_train)

In [None]:
input_ids_test = torch.from_numpy(np.array(df_test['input_ids'].tolist()))
token_type_ids_test = torch.from_numpy(np.array(df_test['token_type_ids'].tolist())).to(torch.long)
Y_test = torch.from_numpy(np.array(df_test['label'].tolist()))

In [None]:
# Y_test = torch.nn.functional.one_hot(Y_train)

2. It is created a dataset in pytorch

In [None]:
class DataNLP(torch.utils.data.Dataset):
  def __init__(self, input_ids, token_type_ids, Y):
    self.input_ids = input_ids
    self.token_type_ids = token_type_ids
    self.Y = Y
  def __len__(self):
    return len(self.input_ids)
  def __getitem__(self, idx):
    return self.input_ids[idx], self.token_type_ids[idx], self.Y[idx]

In [None]:
train_dataset = DataNLP(input_ids_train, token_type_ids_train, Y_train)
test_dataset = DataNLP(input_ids_test, token_type_ids_test, Y_test)

3. It is created dataloaders in pytorch

In [None]:
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=True)

### 4. Training

In [None]:
from transformers import AutoModelForPreTraining

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

In [None]:
FNET = AutoModelForPreTraining.from_pretrained("google/fnet-base")

In [None]:
import torch.nn as nn

In [None]:
class FNetTuned(nn.Module):
  def __init__(self):
    super(FNetTuned, self).__init__()
    # load the FNet architecture
    self.fnet = FNET.fnet
    # cold the weigths of the FNet
    for param in self.fnet.parameters():
      param.requires_grad = False
    # create a sequential layer
    self.seq1 = nn.Sequential(
        nn.Flatten(),
        nn.Linear(393216, 300),
        nn.ReLU(),
        nn.Linear(300, 30),
        nn.ReLU(),
        nn.Linear(30, 3)
    )

  def forward(self, input_ids, token_type_ids):
    x = self.fnet(input_ids=input_ids, token_type_ids=token_type_ids)
    last_hidden_state = x.last_hidden_state
    x = self.seq1(last_hidden_state)
    return x

In [None]:
model = FNetTuned().to(device)

In [None]:
from tqdm import tqdm

In [None]:
# a = model(input_ids_test[:2].to(device), token_type_ids_test[:2].to(device))
# probabilities = torch.nn.functional.softmax(a, dim=1)
# predictions = torch.argmax(probabilities, dim=1)

In [None]:
# loss function
loss_fn = nn.CrossEntropyLoss()
# optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

# training loop
for epoch in range(3):
  model.train()
  for batch in tqdm(train_loader, desc=f"Epoch {epoch+1} Training"):
    input_ids, token_type_ids, Y = batch[0].to(device), batch[1].to(device), batch[2].to(device)
    output = model(input_ids, token_type_ids)
    loss = loss_fn(output, Y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

  # validation
  model.eval()
  correct = 0
  total = 0
  with torch.no_grad():
    for batch in tqdm(test_loader, desc=f"Epoch {epoch+1} Validation"):
      input_ids, token_type_ids, Y = batch[0].to(device), batch[1].to(device), batch[2].to(device)
      output = model(input_ids, token_type_ids)
      probabilities = torch.nn.functional.softmax(output, dim=1)
      predictions = torch.argmax(probabilities, dim=1)
      correct += torch.sum(predictions == Y)
      total += len(Y)

    acc = correct / total
    print(f"Epoch {epoch+1}, Accuracy: {acc}")

Epoch 1 Training: 100%|██████████| 704/704 [18:24<00:00,  1.57s/it]
Epoch 1 Validation: 100%|██████████| 235/235 [05:53<00:00,  1.51s/it]


Epoch 1, Accuracy: 0.7825999855995178


Epoch 2 Training: 100%|██████████| 704/704 [18:20<00:00,  1.56s/it]
Epoch 2 Validation: 100%|██████████| 235/235 [05:53<00:00,  1.51s/it]


Epoch 2, Accuracy: 0.7919999957084656


Epoch 3 Training: 100%|██████████| 704/704 [18:20<00:00,  1.56s/it]
Epoch 3 Validation: 100%|██████████| 235/235 [05:53<00:00,  1.50s/it]


Epoch 3, Accuracy: 0.7851999998092651


The target variable should be one-hot encoded

In [None]:
model

FNetTuned(
  (fnet): FNetModel(
    (embeddings): FNetEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=3)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(4, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (projection): Linear(in_features=768, out_features=768, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): FNetEncoder(
      (layer): ModuleList(
        (0-11): 12 x FNetLayer(
          (fourier): FNetFourierTransform(
            (self): FNetBasicFourierTransform()
            (output): FNetBasicOutput(
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            )
          )
          (intermediate): FNetIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): NewGELUActivation()
          )
          (output): FNetOutput(
            (dense): Linear(in_fe

# Save model

The model is going to be saved in a huggingface repo

It is going to be created a custom class to give the HF format

In [None]:
from huggingface_hub import PyTorchModelHubMixin

In [None]:
class FinalModel(nn.Module, PyTorchModelHubMixin):
    def __init__(self):
        super().__init__()
        self.red = model

    def forward(self, x):
        return self.red(x)

In [None]:
final_model = FinalModel()

In [None]:
final_model.push_to_hub("sebastiansarasti/stack_over_flow")

model.safetensors:   0%|          | 0.00/803M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/sebastiansarasti/stack_over_flow/commit/478903775e470b17bb97fe55afec937912d866dd', commit_message='Push model using huggingface_hub.', commit_description='', oid='478903775e470b17bb97fe55afec937912d866dd', pr_url=None, pr_revision=None, pr_num=None)