# Train a language model (Masked Language Modelling) from scratch using Huggingface Transformers and a custom tokenizer

### Inspired from the great notebook by Huggingfce (link to blogpost [link](https://huggingface.co/blog/how-to-train)).

# Brief Introduction
This blog post is the first part of a series where we want to create a product names generator using a transformer model. For a few weeks I was investigating different models and alternatives in Huggingface to train a text generation model. We have a short list of products with their description and our goal is to obtain the name of the product. I did some experiments with the Transformer model in Tensorflow as well as the T5 summarizer. Finally, in order to deepen the use of Huggingface transformers, I decided to approach the problem with a somewhat more complex approach, an encoder decoder model. Maybe it was not the best option but I wanted to learn new things about huggingface Transformers. In the next post of the series we will introduce you deeper in this concept.

Here, in this first part, we will show how to train a tokenizer from scratch and how to use Masked Language Modeling technique to create a RoBERTa model. This personalized model will become the base model for our future encoder Decoder model.

# Our Solution
For our experiment we are going to train from scratch a RoBERTa model, it will become the encoder and the decoder of a future model. But our domain is very specific, words and concepts about clothes, shapes, colors, … Therefore, we are interested in defining our own tokenizer created from our specific vocabulary, avoiding to include more common words from other domains or use cases which are irrelevant for our final purpose.

*We can describe our training phase in three main steps*:
- Create and train a byte-level, **Byte-pair encoding tokenizer** with the same special tokens as RoBERTa
- Train a RoBERTa model from scratch using **Masked Language Modeling**, MLM.
- Warm start and **fine tune an encoder decoder model** based on our RoBERTa pretrained model.

In this post we’ll demo how to train a “small” RoBERTa model (6 layers, 768 hidden size, 12 attention heads) – that’s the same number of layers & heads as DistilBERT – on a language from a clothing shop. And in a next notebook, we’ll then fine-tune the model on a downstream task of text generation.


# Loading the libraries

In [2]:
# Check that we have a GPU
!nvidia-smi

Tue Oct 25 13:23:58 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  A100-SXM4-40GB      Off  | 00000000:00:04.0 Off |                    0 |
| N/A   30C    P0    45W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
from google.colab import drive

drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [4]:
!pip install datasets 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.6.1-py3-none-any.whl (441 kB)
[K     |████████████████████████████████| 441 kB 4.7 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 83.2 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 88.8 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 91.3 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 89.3 MB/s 
Collecting multiproce

In [5]:
import os
import pandas as pd
import tqdm
import math

# Loading the datasets

As we mentioned before, our dataset contains around 31.000 items, about clothes from an important retailer, including a long product description and a short product name, our target variable. First, we execute a exploratory data analysis and we can observe that the count of rows with outliers values is a small number. The count of words looks like a left skewed distribution, 75% of rows in the range 50–60 words and a maximum about 125 words. The target variable contains about 3 to 6 words.

Set the variables to the data folders:

In [6]:
# #Set the path to the data folder, datafile and output folder and files

# Version = 'v_1.5.2'

# root_folder = os.path.abspath(os.path.join('/content/drive/My Drive/07_research_main/lab_10', Version))

# # wikitext_folder = os.path.abspath(os.path.join(root_folder, '../dataset/wikitext'))
# # cc_news_folder = os.path.abspath(os.path.join(root_folder, '../dataset/cc_news'))
# model_folder = os.path.abspath(os.path.join(root_folder, 'model'))
# output_folder = os.path.abspath(os.path.join(root_folder, 'output'))
# # tokenizer_folder = os.path.abspath(os.path.join(root_folder, 'tokenizer'))

# # train_filename = 'train'
# # test_filename = 'test'
# outputfile = 'wikitext_submission.csv'

# # wikitext_trainfile_path = os.path.abspath(os.path.join(wikitext_folder,train_filename))
# # cc_news_trainfile_path = os.path.abspath(os.path.join(cc_news_folder,train_filename))
# # testfile_path = os.path.abspath(os.path.join(wikitext_folder,test_filename))
# outputfile_path = os.path.abspath(os.path.join(output_folder,outputfile))

In [7]:
def checkpath(path):
    if not os.path.exists(path):
        os.makedirs(path)

In [8]:
 # import dataset

root_folder = '/content/drive/My Drive/07_research_main/lab_10'

wikitext_folder = os.path.abspath(os.path.join(root_folder, 'dataset/wikitext'))
cc_news_folder = os.path.abspath(os.path.join(root_folder, 'dataset/cc_news'))

train_filename = 'train'
test_filename = 'test'

wikitext_trainfile_path = os.path.abspath(os.path.join(wikitext_folder,train_filename))
cc_news_trainfile_path = os.path.abspath(os.path.join(cc_news_folder,train_filename))
testfile_path = os.path.abspath(os.path.join(wikitext_folder,test_filename))

In [9]:
 # model version
Version = 'M_v_6.0.0' # 06

root_folder = os.path.abspath(os.path.join('/content/drive/My Drive/07_research_main/lab_10', Version))

model_folder = os.path.abspath(os.path.join(root_folder, 'model'))
checkpath(model_folder)

output_folder = os.path.abspath(os.path.join(root_folder, 'output'))

outputfile = 'wikitext_submission.csv'

outputfile_path = os.path.abspath(os.path.join(output_folder,outputfile))

In [10]:
#  # last model version
# Version = 'M_v_4.1.2' # 04

# root_folder = os.path.abspath(os.path.join('/content/drive/My Drive/07_research_main/lab_10', Version))

# pre_model_folder = os.path.abspath(os.path.join(root_folder, 'model'))

In [11]:
#  # tokenizer version
# Version = 'T_v_1.3.3'

# root_folder = os.path.abspath(os.path.join('/content/drive/My Drive/07_research_main/lab_10', Version))

# tokenizer_folder = os.path.abspath(os.path.join(root_folder, 'tokenizer'))

In [12]:
#  # dataset version
# Version = 'D_v_1.2.1'

# root_folder = os.path.abspath(os.path.join('/content/drive/My Drive/07_research_main/lab_10', Version))

# train_filename = 'train_dataset_list.csv'
# test_filename = 'test_dataset_list.csv'

# train_file_path = os.path.abspath(os.path.join(root_folder,train_filename))
# test_file_path = os.path.abspath(os.path.join(root_folder,test_filename))

Load the train datafile with the product descriptions and names:

In [13]:
from datasets import load_from_disk

In [14]:
wikitext_train_df = load_from_disk(wikitext_trainfile_path).to_pandas()
# wikitext_train_df.head()
# wikitext_train_df.info()

In [15]:
cc_news_train_df = load_from_disk(cc_news_trainfile_path).to_pandas()
cc_news_train_df.pop("title")
cc_news_train_df.pop("domain")
cc_news_train_df.pop("date")
cc_news_train_df.pop("description")
cc_news_train_df.pop("url")
cc_news_train_df.pop("image_url")
# cc_news_train_df.head()
# cc_news_train_df.info()

0         https://pointe-img.rbl.ms/simage/https%3A%2F%2...
1         https://pointe-img.rbl.ms/simage/https%3A%2F%2...
2         https://pointe-img.rbl.ms/simage/https%3A%2F%2...
3         https://pointe-img.rbl.ms/simage/https%3A%2F%2...
4         https://pointe-img.rbl.ms/simage/https%3A%2F%2...
                                ...                        
708236    https://res.cloudinary.com/jpress/image/fetch/...
708237    https://res.cloudinary.com/jpress/image/fetch/...
708238    http://KFMBFM.images.worldnow.com/images/15964...
708239    http://KFMBFM.images.worldnow.com/images/15965...
708240    http://KFMBFM.images.worldnow.com/images/15966...
Name: image_url, Length: 708241, dtype: object

In [16]:
# Load the train dataset
train_df = pd.concat([wikitext_train_df, cc_news_train_df], ignore_index=True)
# train_df.head()
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2509591 entries, 0 to 2509590
Data columns (total 1 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   text    object
dtypes: object(1)
memory usage: 19.1+ MB


In [17]:
train_df = train_df.sample(frac=1).reset_index(drop=True)
train_df.head()

Unnamed: 0,text
0,""",""size"":[970,250],""partnerId"":""AolHtb""},{""tar..."
1,One of the main uses of weather radar is to b...
2,"ST. CLOUD -- The Granite City Gearheads, a loc..."
3,
4,Plans were made to widen the highway to six l...


In [18]:
# Show the count of rows
print('Num Examples: ',len(train_df))
print('Null Values\n', train_df.isna().sum())
# Drop rows with Null values 
train_df.dropna(inplace=True)
print('Num Examples: ',len(train_df))

Num Examples:  2509591
Null Values
 text    0
dtype: int64
Num Examples:  2509591


In [19]:
train_df.head()

Unnamed: 0,text
0,""",""size"":[970,250],""partnerId"":""AolHtb""},{""tar..."
1,One of the main uses of weather radar is to b...
2,"ST. CLOUD -- The Granite City Gearheads, a loc..."
3,
4,Plans were made to widen the highway to six l...


Then, we read the test dataset:

In [20]:
# Load the test dataset 
test_df = load_from_disk(testfile_path).to_pandas()
print('Num Examples: ',len(test_df))
print('Null Values\n', test_df.isna().sum())
# there are no null values

Num Examples:  4358
Null Values
 text    0
dtype: int64


In [21]:
# ###
# train_df = train_df.sample(frac=0.00001, random_state=0, axis=0)
# test_df = test_df.sample(frac=0.001, random_state=0, axis=0)
# train_df.info()

# Build a Tokenizer


## Create the dataset to train a tokenizer

*To train a tokenizer we need to save our dataset in a bunch of text files*. We create a plain text file for every description value and we will split each sample using a newline character. We include both the train and test dataset:

In [22]:
# # Drop the files from the output dir
# txt_files_dir = "./text_split"
# !rm -rf {txt_files_dir}
# !mkdir {txt_files_dir}

In [23]:
# # Store values in a dataframe column (Series object) to files, one file per record
# def column_to_files(column, prefix, txt_files_dir):
#     # The prefix is a unique ID to avoid to overwrite a text file
#     i=prefix
#     #For every value in the df, with just one column
#     for row in column.to_list():
#       # Create the filename using the prefix ID
#       file_name = os.path.join(txt_files_dir, str(i)+'.txt')
#       try:
#         # Create the file and write the column text to it
#         f = open(file_name, 'wb')
#         f.write(row.encode('utf-8'))
#         f.close()
#       except Exception as e:  #catch exceptions(for eg. empty rows)
#         print(row, e) 
#       i+=1
#     # Return the last ID
#     return i


Include the training dataset to the main text file:

In [24]:
# data = train_df["text"]
# # Removing the end of line character \n
# data = data.replace("\n"," ")
# # Set the ID to 0
# prefix=0
# # Create a file for every description value
# prefix = column_to_files(data, prefix, txt_files_dir)
# # Print the last ID
# print(prefix)

Also include the test dataset to the text file:

In [25]:
# data = test_df["text"]
# # Removing the end of line character \n
# data = data.replace("\n"," ")
# print(len(data))
# # Create a file for every description value
# prefix = column_to_files(data, prefix, txt_files_dir)
# print(prefix)

**Include the target variable for training** NOOOO¿?

In [26]:
# data = train_df["text"]
# data = data.replace("\n"," ")
# print(len(data))
# prefix = column_to_files(data, prefix, txt_files_dir)
# print(prefix)

## Train the tokenizer

The Stanford NLP group define the tokenization as:

"*Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.*"

A tokenizer breaks a string of characters, usually sentences of text, into tokens, an integer representation of the token, usually by looking for whitespace (tabs, spaces, new lines). It usually splits a sentence into words but there are many options like subwords.

We will use a **byte-level Byte-pair encoding tokenizer**, byte pair encoding (BPE) is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. The benefit of this method is that it will start building its vocabulary from an alphabet of single chars, so all words will be decomposable into tokens. We can avoid the presence of unknown (UNK) tokens.

A great explanation on tokenizers can be found on the huggingface documentation, https://huggingface.co/transformers/tokenizer_summary.html.

In [27]:
# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
# !pip install -q transformers==4.21.1
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1
!pip install datasets==1.0.2

Found existing installation: tensorflow 2.9.2
Uninstalling tensorflow-2.9.2:
  Successfully uninstalled tensorflow-2.9.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-kuwnwc4t
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-kuwnwc4t
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 4.7 MB/s 
Building wheels for collected packages: transformers
  Building wheel for transformers (PEP 517) ... [?25l[?25hdone
  Created wheel for transformers: filename=transf

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let’s pick its size to be 8,192 because our specific vocabulary is very limited and simple.

In [28]:
# from pathlib import Path

# from tokenizers import ByteLevelBPETokenizer

# from tokenizers.processors import BertProcessing

# import torch
# from torch.utils.data.dataset import Dataset

Now we can train our tokenizer on the text files containing our vocabulary, we need to specify the vocabulary size, the min frequency for a token to be included and the special tokens. We choose a vocab size of 8,192 and a min frequency of 2 (you can tune this value depending on your max vocabulary size). 

The special tokens depends on the model, for RoBERTa we include a short list: 
- \<s> or BOS, beginning Of Sentence
- \</s> or EOS, End Of Sentence
- \<pad> the padding token
- \<unk> the unknown token
- \<mask> the masking token.

In [29]:
# %%time 
# paths = [str(x) for x in Path(".").glob("text_split/*.txt")]

# # Initialize a tokenizer
# tokenizer = ByteLevelBPETokenizer(lowercase=True)

# # Customize training
# tokenizer.train(files=paths, vocab_size=50265, min_frequency=2,
#                 show_progress=True,
#                 special_tokens=[
#                                 "<s>",
#                                 "<pad>",
#                                 "</s>",
#                                 "<unk>",
#                                 "<mask>",
# ])

In [30]:
# tokenizer

The count of samples is small and the tokenizer trains very fast. Now we can save the tokenizer to disk, later we will use it to train the language model:

In [31]:
# #Save the Tokenizer to disk
# tokenizer.save_model(tokenizer_folder)

We now have both a `vocab.json`, which is a list of the most frequent tokens ranked by frequency and it is used to convert tokens to IDs, and a `merges.txt` file that maps texts to tokens.

```json
{
	"<s>": 0,
	"<pad>": 1,
	"</s>": 2,
	"<unk>": 3,
	"<mask>": 4,
	"!": 5,
	"\"": 6,
	"#": 7,
	"$": 8,
	"%": 9,
	"&": 10,
	"'": 11,
	"(": 12,
	")": 13,
	# ...
}

# merges.txt
l a
Ġ k
o n
Ġ la
t a
Ġ e
Ġ d
Ġ p
# ...
```

What is great is that our tokenizer is optimized for our very specific vocabulary. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. 

Here’s  how you can use it in `tokenizers`, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from `transformers`. We can instantiate our tokenizer using both files and test it with some text from our dataset.


In [32]:
# # Create the tokenizer using vocab.json and mrege.txt files
# tokenizer = ByteLevelBPETokenizer(
#     os.path.abspath(os.path.join(tokenizer_folder,'vocab.json')),
#     os.path.abspath(os.path.join(tokenizer_folder,'merges.txt'))
# )

In [33]:
# # Prepare the tokenizer
# tokenizer._tokenizer.post_processor = BertProcessing(
#     ("</s>", tokenizer.token_to_id("</s>")),
#     ("<s>", tokenizer.token_to_id("<s>")),
# )
# tokenizer.enable_truncation(max_length=512)

Let's show some examples:

In [34]:
# tokenizer.encode("knit midi dress with vneckline straps.")

In [35]:
# # Show the tokens created
# tokenizer.encode("knit midi dress with vneckline straps.").tokens

# Train a language model from scratch

**Update:** This section follows along the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, using our new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) directly. Feel free to pick the approach you like best.

> We’ll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details). In summary: *It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates* .

As the model is BERT-like, we’ll train it on a task of **Masked language modeling**. It involves masking part of the input, about 10-20% of thre tokens, then learning a model to predict the missing tokens. MLM is often used within pretraining tasks, **to give models the opportunity to learn textual patterns from unlabeled data**. It can be fine tuned to a particular downstream task. The main benefit is that we do not need labeled data (hard to obtain), no text needs to be labeled by human labelers in order to predict the missing values.


We define some global parameters:

In [36]:
TRAIN_BATCH_SIZE = 80    # input batch size for training (default: 64)
VALID_BATCH_SIZE = 8    # input batch size for testing (default: 1000)
TRAIN_EPOCHS = 1        # number of epochs to train (default: 10)
LEARNING_RATE = 2e-4    # learning rate (default: 0.001)
WEIGHT_DECAY = 0.01
SEED = 42               # random seed (default: 42)
MAX_LEN = 128
SUMMARY_LEN = 7
BOOM = 4
SAVE_STEPS = 78420

In [37]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

##Define the model

We are going to train the model from scratch, not from a pretrained one. We create a model configuration for our RoBERTa model setting the main parameters:
- Vocabulary size
- Attention heads
- Hidden layers
- etc,

In [38]:
# from transformers import RobertaConfig

# config = RobertaConfig.from_pretrained('roberta-large')

# config.num_hidden_layers = 10

# # config.vocab_size = 8192 ###50265

In [39]:
# print(config)

Finally let's initialize our model using the configuration file. As we are training from scratch, we only initialize from a config that define the architecture of the model but *not restoring previously trained weights*. The weights will be randomly initialized. 

In [40]:
import transformers
print('transformers version: %s' %(transformers.__version__))

transformers version: 4.24.0.dev0


In [41]:
from transformers import RobertaForMaskedLM, RobertaModel
from transformers.models.roberta.modeling_roberta import RobertaLMHead, RobertaPreTrainedModel
from transformers.utils import (
    add_code_sample_docstrings,
    add_start_docstrings,
    add_start_docstrings_to_model_forward,
    logging,
    replace_return_docstrings,
)
from typing import List, Optional, Tuple, Union
from transformers.modeling_outputs import MaskedLMOutput
import torch.nn as nn
from torch.nn import CrossEntropyLoss
from torch import Tensor
logger = logging.get_logger(__name__)

_CHECKPOINT_FOR_DOC = "roberta-base"
_CONFIG_FOR_DOC = "RobertaConfig"
_TOKENIZER_FOR_DOC = "RobertaTokenizer"

ROBERTA_INPUTS_DOCSTRING = r"""
    Args:
        input_ids (`torch.LongTensor` of shape `({0})`):
            Indices of input sequence tokens in the vocabulary.
            Indices can be obtained using [`RobertaTokenizer`]. See [`PreTrainedTokenizer.encode`] and
            [`PreTrainedTokenizer.__call__`] for details.
            [What are input IDs?](../glossary#input-ids)
        attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*):
            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
            - 1 for tokens that are **not masked**,
            - 0 for tokens that are **masked**.
            [What are attention masks?](../glossary#attention-mask)
        token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*):
            Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
            1]`:
            - 0 corresponds to a *sentence A* token,
            - 1 corresponds to a *sentence B* token.
            [What are token type IDs?](../glossary#token-type-ids)
        position_ids (`torch.LongTensor` of shape `({0})`, *optional*):
            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
            config.max_position_embeddings - 1]`.
            [What are position IDs?](../glossary#position-ids)
        head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
            Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
            - 1 indicates the head is **not masked**,
            - 0 indicates the head is **masked**.
        inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*):
            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
            model's internal embedding lookup matrix.
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
            tensors for more detail.
        output_hidden_states (`bool`, *optional*):
            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
            more detail.
        return_dict (`bool`, *optional*):
            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""

In [42]:
class GELU(nn.Module):
    def forward(self, x):
        return x * torch.sigmoid(1.702 * x)

In [43]:
class Boom_new(nn.Module):
     def __init__(self, in_features: int, out_features: int, dropout=0.1, shortcut: bool = True, device=None, dtype=None) -> None:
         factory_kwargs = {'device': device, 'dtype': dtype}
         super(Boom_new, self).__init__()

         self.linear1 = nn.Linear(in_features, out_features)
         self.dropout = nn.Dropout(dropout) if dropout else None
         if not shortcut:
             self.linear2 = nn.Linear(out_features, in_features)
         self.shortcut = shortcut
         self.act = GELU()
 
     def forward(self, input: Tensor) -> Tensor:
         x = self.act(self.linear1(input))
         if self.dropout: x = self.dropout(x)
         if self.shortcut:
             ninp = input.shape[-1]
             x = torch.narrow(x, -1, 0, x.shape[-1] // ninp * ninp)
             x = x.view(*x.shape[:-1], x.shape[-1] // ninp, ninp)
             z = x.sum(dim=-2)
         else:
             z = self.linear2(x)
 
         return z

In [44]:
class ModifiedRobertaForMaskedLM(RobertaPreTrainedModel):
    _keys_to_ignore_on_save = [r"lm_head.decoder.weight", r"lm_head.decoder.bias"]
    _keys_to_ignore_on_load_missing = [r"position_ids", r"lm_head.decoder.weight", r"lm_head.decoder.bias"]
    _keys_to_ignore_on_load_unexpected = [r"pooler"]

    def __init__(self, config):
        super().__init__(config)

        if config.is_decoder:
            logger.warning(
                "If you want to use `RobertaForMaskedLM` make sure `config.is_decoder=False` for "
                "bi-directional self-attention."
            )

        self.roberta = RobertaModel(config, add_pooling_layer=False)
        self.Boom = Boom_new(config.hidden_size, (config.hidden_size * BOOM))
        self.LINEAR = nn.Linear(config.hidden_size,config.hidden_size)
        self.lm_head = RobertaLMHead(config)

        # The LM head weights require special treatment only when they are tied with the word embeddings
        self.update_keys_to_ignore(config, ["lm_head.decoder.weight"])

        # Initialize weights and apply final processing
        self.post_init()

    def get_output_embeddings(self):
        return self.lm_head.decoder

    def set_output_embeddings(self, new_embeddings):
        self.lm_head.decoder = new_embeddings

    @add_start_docstrings_to_model_forward(ROBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @add_code_sample_docstrings(
        processor_class=_TOKENIZER_FOR_DOC,
        checkpoint=_CHECKPOINT_FOR_DOC,
        output_type=MaskedLMOutput,
        config_class=_CONFIG_FOR_DOC,
        mask="<mask>",
        expected_output="' Paris'",
        expected_loss=0.1,
    )
    def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None,
        attention_mask: Optional[torch.FloatTensor] = None,
        token_type_ids: Optional[torch.LongTensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        head_mask: Optional[torch.FloatTensor] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        encoder_hidden_states: Optional[torch.FloatTensor] = None,
        encoder_attention_mask: Optional[torch.FloatTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[torch.Tensor], MaskedLMOutput]:
        r"""
        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
            config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
            loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
        kwargs (`Dict[str, any]`, optional, defaults to *{}*):
            Used to hide legacy arguments that have been deprecated.
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.roberta(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = outputs[0]
        sequence_output = self.Boom(sequence_output)
        sequence_output = self.LINEAR(sequence_output)
        prediction_scores = self.lm_head(sequence_output)

        masked_lm_loss = None
        if labels is not None:
            loss_fct = CrossEntropyLoss()
            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))

        if not return_dict:
            output = (prediction_scores,) + outputs[2:]
            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output

        return MaskedLMOutput(
            loss=masked_lm_loss,
            logits=prediction_scores,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

In [45]:
# model = ModifiedRobertaForMaskedLM(config=config)
# print('Num parameters: ',model.num_parameters())

In [46]:
from transformers import RobertaConfig

config = RobertaConfig.from_pretrained('roberta-large')

# config.num_hidden_layers = 10

# config.vocab_size = 8192 ###50265

Downloading:   0%|          | 0.00/482 [00:00<?, ?B/s]

In [47]:
# model = ModifiedRobertaForMaskedLM.from_pretrained(pre_model_folder)
# print('Num parameters: ',model.num_parameters())

In [48]:
# from transformers import RobertaForMaskedLM

model = ModifiedRobertaForMaskedLM.from_pretrained('roberta-large', config=config)
print('Num parameters: ',model.num_parameters())

Downloading:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

Some weights of ModifiedRobertaForMaskedLM were not initialized from the model checkpoint at roberta-large and are newly initialized: ['LINEAR.bias', 'Boom.linear1.weight', 'LINEAR.weight', 'Boom.linear1.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Num parameters:  360660057


In [49]:
print(model)

ModifiedRobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 1024, padding_idx=1)
      (position_embeddings): Embedding(514, 1024, padding_idx=1)
      (token_type_embeddings): Embedding(1, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (

Now let's recreate our tokenizer, using the tokenizer trained and saved in the previous step. We will use a `RoBERTaTokenizerFast` object and the `from_pretrained` method, to initialize our tokenizer.

In [50]:
from transformers import RobertaTokenizer
# from transformers import RobertaTokenizer

# Create the tokenizer from a trained one
tokenizer = RobertaTokenizer.from_pretrained('roberta-large', max_len=MAX_LEN)
# tokenizer = RobertaTokenizer.from_pretrained(tokenizer_folder, max_len=MAX_LEN)

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [51]:
tokenizer

PreTrainedTokenizer(name_or_path='roberta-large', vocab_size=50265, model_max_len=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'sep_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'cls_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True)})

## Building the training Dataset

We'll build a Pytorch dataset, subclassing the Dataset Class. The CustomDataset receives a Pandas Series with the `description` variable values and the tokenizer to encode those values. The Dataset returns a list of tokens for every product description in the Series.

In order to evaluate the model during training, we will generate a train dataset for training and a evaluation dataset.


https://ryanong.co.uk/2020/06/11/day-163-how-to-build-a-language-model-from-scratch-implementation/

In [52]:
from torch.utils.data.dataset import Dataset

In [53]:
# get_train_dataset = pd.read_csv(train_file_path, dtype = 'Int64', keep_default_na = False)
# # get_train_dataset = get_train_dataset.sample(frac=1).reset_index(drop=True)
# get_train_dataset.head()

In [54]:
# ###
# get_train_dataset = get_train_dataset[0:250959]

In [55]:
# train_dataset_list_ = [[y for y in x if pd.notna(y)] for x in get_train_dataset.values.tolist()]

In [56]:
# get_test_dataset = pd.read_csv(test_file_path, dtype = 'Int64', keep_default_na = False)
# test_dataset_list_ = [[y for y in x if pd.notna(y)] for x in get_test_dataset.values.tolist()]

In [57]:
# len(train_dataset_list_)

In [58]:
# print(test_dataset_list_)

In [59]:
class CustomDataset(Dataset):
    def __init__(self, df, tokenizer):
        # or use the RobertaTokenizer from `transformers` directly.

        self.examples = []
        
        for example in df.values:
            x=tokenizer.encode_plus(example, max_length = MAX_LEN, truncation=True, padding=True)
            self.examples += [x.input_ids]

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i):
        # We’ll pad at the batch level.
        return torch.tensor(self.examples[i])

In [60]:
# class ModifiedCustomDataset(Dataset):
#     def __init__(self, df_list):
#         # or use the RobertaTokenizer from `transformers` directly.

#         self.examples = df_list

#     def __len__(self):
#         return len(self.examples)

#     def __getitem__(self, i):
#         # We’ll pad at the batch level.
#         return torch.tensor(self.examples[i])

Concat the training and test dataset, only with the description column.

In [61]:
# Concatenate the train dataset and the test dataset for language modelling
#df=pd.concat([train_df['description'], test_df['description']], axis=0)
#print('Total: ',len(df), len(train_df), len(test_df))


Create the custom datasets, for training and evaluation:

In [62]:
# Create the train and evaluation dataset
train_dataset = CustomDataset(train_df['text'], tokenizer)
eval_dataset = CustomDataset(test_df['text'], tokenizer)

In [63]:
# # Create the train and evaluation dataset
# train_dataset = ModifiedCustomDataset(train_dataset_list_)
# eval_dataset = ModifiedCustomDataset(test_dataset_list_)

In [64]:
# len(train_dataset.examples)

In [65]:
# print(eval_dataset.examples)

## Define the Data Collactor for masking our language

Like in the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, we need to define a data_collator.

Once we have the dataset, a **Data Collator will helps us to mask our training texts**. This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on. Data collators are objects that will form a batch by using a list of dataset elements as input and may apply some processing like padding or random masking. The `DataCollatorForLanguageModeling` method allow us to set the probability with which to randomly mask tokens in the input.

In [66]:
from transformers import DataCollatorForLanguageModeling

# Define the Data Collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

## Initialize and train our Trainer

When we want to train a transformer model, the basic approach is to create a Trainer class that provides an API for feature-complete training and contains the basic training loop. First, we define the training arguments, there are many of them but the more relevant are
- `output_dir`, where the model artifacts will be saved
- `num_train_epochs`
- `per_device_train_batch_size`, the batch size


 and then the `Trainer` object is created with the arguments, the input dataset and the data collator defined:



In [67]:
from transformers import Trainer, TrainingArguments

print(model_folder)
# Define the training arguments
training_args = TrainingArguments(
    output_dir=model_folder,
    overwrite_output_dir=True,
    evaluation_strategy = 'epoch',
    num_train_epochs=TRAIN_EPOCHS,
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=VALID_BATCH_SIZE,
    save_steps=SAVE_STEPS,
    #eval_steps=4096,
    save_total_limit=1,
    fp16 = False,
    # load_best_model_at_end = True,
)
# Create the trainer for our model
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    #prediction_loss_only=True,
)

/content/drive/My Drive/07_research_main/lab_10/M_v_6.0.0/model


And now, we are ready to train our model 

In [68]:
import torch
torch.cuda.empty_cache()

In [69]:
# Train the model
trainer.train()

***** Running training *****
  Num examples = 2509591
  Num Epochs = 1
  Instantaneous batch size per device = 80
  Total train batch size (w. parallel, distributed & accumulation) = 80
  Gradient Accumulation steps = 1
  Total optimization steps = 31370
  Number of trainable parameters = 360660057


Epoch,Training Loss,Validation Loss
1,1.6222,1.557066


***** Running Evaluation *****
  Num examples = 4358
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=31370, training_loss=2.124494476367001, metrics={'train_runtime': 45092.5829, 'train_samples_per_second': 55.654, 'train_steps_per_second': 0.696, 'total_flos': 5.949033318782723e+17, 'train_loss': 2.124494476367001, 'epoch': 1.0})

As a result, we can watch how the loss is decreasing while training. We can evaluate our model on the validation set. The perplexity is high because we only have to make predictions for the masked tokens (which represent 15% of the total here).

In [70]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 4358
  Batch size = 8


Perplexity: 4.62


## Save our final model and tokenizer to disk

Save the model and tokenizer ina way that they can be restored for a future downstream task, our encoder decoder model

In [71]:
trainer.save_model(model_folder)

Saving model checkpoint to /content/drive/My Drive/07_research_main/lab_10/M_v_6.0.0/model
Configuration saved in /content/drive/My Drive/07_research_main/lab_10/M_v_6.0.0/model/config.json
Model weights saved in /content/drive/My Drive/07_research_main/lab_10/M_v_6.0.0/model/pytorch_model.bin


# Checking the trained model using a Pipeline

Looking at the training and eval losses going down is not enough, we would like to apply our model to check if our language model is learning anything interesting. An easy way is via the FillMaskPipeline.

Pipelines are simple wrappers around tokenizers and models. **We can use the 'fill-mask' pipeline** where we input a sequence containing a masked token (<mask>) and it returns a list of the most probable filled sequences, with their probabilities.


In [72]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model=model_folder,
    tokenizer=model_folder
)

loading configuration file /content/drive/My Drive/07_research_main/lab_10/M_v_6.0.0/model/config.json
Model config RobertaConfig {
  "_name_or_path": "/content/drive/My Drive/07_research_main/lab_10/M_v_6.0.0/model",
  "architectures": [
    "ModifiedRobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.24.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading configuration file /content/drive/My Drive/07_research_main/lab_10/M_v_6.0.0/model/config.json
Model config RobertaConfig {


OSError: ignored

In [None]:
# knit midi dress with vneckline
# =>
fill_mask("midi <mask> with vneckline.")

Ok, simple syntax/grammar works. Let’s try a slightly more interesting prompt:



In [None]:
# The test text: Round neck sweater with long sleeves
fill_mask("Round neck sweater with <mask> sleeves.")